Technology & Research

Intel® Technology Journal Home

Volume 12, Issue 03

Original 45nm Intel® Core™ Microarchitecture


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1203.03

  • Volume 12
  • Issue 03
  • Published November 7, 2008

Original 45nm Intel® Core™ Microarchitecture

  Section 3 of 15  

Improvements in the Intel® Core™2 Processor Family Architecture and Microarchitecture

SUPER SHUFFLE

One shuffler is better than two

Merom architecture dramatically improved SSE performance through a simple but highly effective method—doubling the width of SSE execution from 64 bits to 128 bits by instantiating two 64-bit execution units side by side and adding two small shuffle units to communicate between the 64-bit halves that handled only quad words. While increasing the execution width to 128 bits dramatically improved performance, the 64-bit “wall” between the execution halves was left in place.

The 64-bit wall caused some legacy instructions to be slower than would be expected. For example, the legacy instruction SHUFPS followed these steps in the Merom architecture:

  • Gathered all input DEST bits to [63:0] side of the “wall.”
  • Gathered all SRC bits to the [127:64] side of the “wall.”
  • Shuffled according to the immediate instruction.

SHUFPS output bits [127:64] always come from the SRC, and output bits [63:0] always come from the DEST, so we had to move SRC bits [63:0] to the upper half of the SSE execution unit. SHUFPS output bits [63:0] always come from the DEST input, so we had to move DEST bits [127:64] to the lower half of the SSE unit.

Some of the performance penalty of having three operations is covered by the Merom architecture having shuffle units on two ports as well as another SSE shuffler that handled only 64-bit data sizes on a third port.

The family of processor's solution to this problem is to merge the two shuffle units into a single Super-Shuffle unit that does not have a 64-bit wall. The Super-Shuffle is significantly more costly in terms of routing, but the added area cost is covered by merging the area from the two old shuffle units into the Super-Shuffle.

In the family of processors, the SHUFPS shuffling algorithm becomes:

1. SHUFPS!

While the Merom algorithm had a nominal throughput of 1, it used three operations. The implementation also has a throughput of 1, but it uses only one operation, leaving more execution bandwidth for other instructions. In the architecture we reduced the SHUFPS latency from 2 to 1.

We made similar improvements to SSE instructions PACKxSxx, PUNPCKxxx, PHADD*, PSHUFB, PALIGNR, PINSRW, PEXTRW, UNPCKLPS, UNPCKHPS, HPADDPS, HSUBPS, and PSHUFD.

  Section 3 of 15  

Back to Top

In this article

Download a PDF of this article.