Technology & Research

Intel® Technology Journal Home

Volume 12, Issue 03

Original 45nm Intel® Core™ Microarchitecture


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1203.03

  • Volume 12
  • Issue 03
  • Published November 7, 2008

Original 45nm Intel® Core™ Microarchitecture

  Section 4 of 15  

Improvements in the Intel® Core™2 Processor Family Architecture and Microarchitecture

INTRODUCTION TO SSE4.1

Many of the SSE4.1 instructions were created by noticing patterns in kernels that were commonly repeated using multiple instructions and that could be readily converted to a single instruction in hardware. Some of the instructions fill in gaps in the existing instruction set such as the new PMINxx, PMAXxx, PEXTRx, PINSRx, PACKUSDW, PCMPEQQ, and PMULLD. PMULDQ is the signed version of PMULUDQ. Please refer to the Software Developer's Manual for details [4] .

The Super Shuffle breaking the 64-bit wall is a key enabler of many SSE4.1 instructions’ performance improvement. The PMOVSX, PMOVZX, PEXTRx, PINSRx, and INSERTPS all require the new Super Shuffle to realize their full potential. (Figure 1) shows PMOVZXDQ moving data across the 64-bit wall without a special operation to move data from the low 64 bits to the high 64 bits. The MPSADBW and PTEST instructions also depend on crossing the 64-bit boundary albeit in other execution blocks.



Figure 1: PMOVZXDQ crosses 64-bit boundary on the family of processors without a special 64-bit operation

SSE4.1 instructions DPPS, DPPD, and INSERTPS solve the problem of requiring additional instructions to selectively zero portions of the register. This zeroing effectively compresses two instructions into one for INSERTPS and four instructions into one for DPPS and DPPD.

As shown in (Figure 2) , DPPD and DPPS are the first floating-point SSE instructions to have multiple floating-point operations.



Figure 2: DPPD provides zeroing after both floating-point operations

Two new rounding instructions, ROUNDPS and ROUNDPD, provide rounding of floating-point values to integers and return the values in floating-point format. The rounding control is selectable from either the immediate instruction or MXCSR.RC. The user also has control over suppressing Precision Exceptions based on an immediate bit.

Prior to SSE4.1, SSE instructions always operated on subsets of 128 bits. PTEST is the first SSE operation to operate on the entire 128 bits as a single entity. PTEST is very useful for detecting all 0s and all 1s and reporting the result in the flags to simplify decisions.

MPSADBW performs a series of eight 4 X 4 SAD (Sum of Absolute Differences) operations across an 8-byte window of the destination. The starting point for the SRC and DEST windows is selectable using the immediate instruction.

PHMINPOSUW forms a very useful counterpart to MPSADBW because it finds the minimum word and returns both the value and position of the minimum word.

INSERTPS is a very generalized insertion between XMM registers. It allows any packed single quantity to be selected from the source and inserted into any position in the destination. The control to select the packed single position from the source and the position to insert the packed single in the destination are controlled by the instruction immediate. INSERTPS also allows selected packed single positions in the destination to be zeroed, also under control of the immediate.

EXTRACTPS sounds as if it should be the complement of INSERTPS, but it isn’t. EXTRACTPS extracts to a General Purpose (GP) Register instead of to another XMM register, which makes EXTRACTPS very similar to PEXTRD. The only difference between EXTRACTPS and PEXTRD is the handling of REX.W. EXTRACTPS will zero extend the 32-bit value, whereas PEXTRD will become the 64-bit instruction PEXTRQ.

The BLENDxx and BLENDVxx instructions are a per-element select of the source or the destination register. Control of the select comes from the immediate for BLEND instructions, and for BLENDV instructions the control comes from the element sign bits of a third XMM register that must be XMM0.

  Section 4 of 15  

Back to Top

In this article

Download a PDF of this article.