Technology & Research

Intel® Technology Journal Home

Volume 12, Issue 03

Original 45nm Intel® Core™ Microarchitecture


Intel Technology Journal - Featuring Intel's recent research and development

ISSN 1535-864X DOI 10.1535/itj.1203.03

  • Volume 12
  • Issue 03
  • Published November 7, 2008

Original 45nm Intel® Core™ Microarchitecture

  Section 6 of 15  

Improvements in the Intel® Core™2 Processor Family Architecture and Microarchitecture

SSE4.1 EXAMPLES

SSE4 instructions were created to provide speedup on various types of applications. Two instructions in particular, MPSADBW and PHMINPOSUW, in combination with the Super Shuffler, can provide large performance improvements in block-matching algorithms commonly used in motion estimation. A detailed discussion on the block-matching performance (a 1.6x–3.8x function-level speedup) and how these instructions provide the performance improvements is documented in Silicon Performance [1] .

Two other SSE4 examples are discussed in this section. First, we briefly showcase the measured performance of streaming loads [2] to conclude the discussion in the previous section. Then we discuss the DPPS DPPD instruction and usage models where DPPS DPPD will improve performance. We provide an example that uses the DPPS instruction to showcase how it and another SSE4.1 instruction (EXTRACTPS) can be used to speed up collision detection performance.

Streaming loads

In this section we continue the discussion from the previous section by briefly examining streaming loads measured results and optimization guidelines [2] . To maximize streaming load throughput, users need to utilize the streaming load buffers of two cores at the same time. That is, two software threads executing on two different cores perform streaming loads from separate USWC parts at the same time and copy the data into separate WB cacheable memory buffers (see (Figure 4) ). The WB buffers have to be small enough to fit in the first-level cache to minimize resource contentions, and the four streaming loads making up one cacheline (64 bytes) need to be done close together.



Figure 4: Example of streaming load: accessing graphic card memory and utilizing two threads to maximize memory throughput

Tests were conducted on a 45nm Intel Core 2 desktop processor (E7200) with a 1067-MT sec FSB.

The theoretical memory throughput (cacheable and uncacheable memory) can be calculated as follows:

Theoretical memory throughput

= FSB Transfer/sec * bytes/transfer
= 1067 MT/sec * 8 B/T
= 8.53 GB/sec

The single-threaded streaming load implementation that utilized one core's streaming load buffers was measured to provide approximately 50 percent of the theoretical memory throughput. The dual-threaded streaming load implementation that utilized the streaming load buffers of two cores was measured to provide approximately 90 percent of the theoretical memory throughput. Utilizing two core's streaming load buffers is the recommended way to get the highest memory throughput out of streaming loads.

Single-Precision Floating-Point Dot Product

The Dot Product of Packed Single Precision Floating-Point Values (DPPS) instruction and the DPPD instruction for Double Precision Floating-Point numbers can provide performance benefits in games, multimedia, and high-performance computing applications. This instruction has a high latency due to multiple numbers of operations being done at once. Thus, this instruction provides the most benefit in situations in which the Array of Structures (AOS) data layout is being used as opposed to the Structure of Arrays (SOA) data layout [3] . The AOS layout is usually not Single Instruction Multiple Data (SIMD)—friendly except for the horizontal instructions such as DPPS and HADDPS. Users can use these horizontal instructions to avoid the heavy data swizzling [3] costs in converting to the SOA data layout. An SSE3 implementation of a dot product of Vector Length 4 in the AOS format can be implemented by using the HADDPS instruction as shown:

void dot_product_vlength4_SSE3

(float *src,float *dst,int Count)

{

__asm {

   mov esi, dword ptr [src]

   mov edi, dword ptr [dst]

   mov ecx, Count

start:

    //a3, a2, a1, a0

   movaps xmm0, [esi]

    //a3*b3,a2*b2,a1*b1,a0*b0

   mulps xmm0, [esi + 16]

    //a3*b3+a2*b2,a1*b1+a0*b0,

    //a3*b3+a2*b2,a1*b1+a0*b0

   haddps xmm0, xmm0

   movaps xmm1, xmm0

   psrlq xmm0, 32

   addss xmm0, xmm1

   movss [edi],xmm0

   add esi, 32

   add edi, 4

   sub ecx, 1

   jnz start

  }

 }          <Code 2>

Notice that the SSE3 implementation of the dot product requires the MULPS+HADDPS+MOVAPS+PSRLQ+ADDSS instructions. The SSE4 implementation replaces all of these instructions with one: DPPS (Code 3).

void dot_product_vlength4_SSE4

(float *src,float *dst,int Count)

{

__asm {

   mov esi, dword ptr [src]

   mov edi, dword ptr [dst]

   mov ecx, Count

start:

   movaps xmm0, [esi]

   dpps xmm0, [esi + 16]

   add esi, 32

   add edi, 4

   sub ecx, 1

   jnz start

  }

}             <Code 3>

Table 1 shows the measured performance of the two different dot product implementations in AOS data layout as compared to the C implementation. The DPPS instructions can provide performance speedups on multiple vector matrix operations that require a dot product such as vector normalization [3] and collision detection.


  Section 6 of 15  

Back to Top

In this article

Download a PDF of this article.