- Home ›
- Technology and Research ›
- Intel Technology Journal ›
- Original 45nm Intel® Core™ Microarchitecture
Original 45nm Intel® Core™ Microarchitecture
Original 45nm Intel® Core™2 Processor Performance
NEW INSTRUCTIONS (SSE4.1)
While many of the microarchitecture enhancements in the PenrynΔ family of processors can be utilized without recompilation, media-related kernels will achieve the maximum performance and power-efficiency gains by recompiling with the Intel compiler and or manually optimizing code, using the new SSE4.1 instructions introduced in the Penryn family of processors.
Intel works closely with industry partners including independent software vendors (ISVs) to understand their performance needs and to improve their applications’ performance. The Penryn family of processors’ new instructions, SSE4, are a customer-driven response to improve performance on audio-, video-, and image-editing applications, video encoders, 3-D applications, and games. In this section we discuss performance results achieved by using the SSE4 instructions.
Intel® HD Boost technology
Intel HD Boost, the combination of SSE4 instructions and the Penryn family of processors’ Super Shuffle Engine, can provide large speedups on a wide range of applications. The following instructions in particular can provide significant benefits to video, imaging, and audio applications.
- There are twelve new integer format conversions that can perform a conversion such as Byte->Double-Word in one cycle with one instruction.
- The new MPSADBW instruction performs eight sums of absolute differences (SAD) in one instruction. This is twice what the SSE2 PSAD instruction can do.
- The new PHMINPOSUW instruction can be used to perform a horizontal minimum search to locate a minimum unsigned word in an XMM register or a __m128 data type.
The MPSADBW and PHMINPOSUW SSE4 instructions can be used to significantly improve motion vector search algorithms (also known as block matching) used in motion estimation for video applications. An Intel whitepaper[2] showcases how to use these two instructions for block matching. The whitepaper reports a 1.6× to 3.8× performance improvement (see (Figure 1)).
Figure 1: SSE4.1 function-level speedups to motion vector search, also known as block matching, used in motion estimation.
The integer format conversions are commonly used in imaging and video applications. For example, they can be used when converting RGBA from four bytes to four floats prior to computation on a pixel. One SSE4 convert instruction can do the same thing as four SIMD instructions did previously, as shown.
SSE2:
pmovd xmm0, m32
pxor xmm7, xmm7
punpcklbw xmm0, xmm7
punpcklwd xmm0, xmm7
cvtdq2ps xmm0, xmm0
SSE4:
pmovzxbd xmm0, m32
cvtdq2ps xmm0, xmm0
Conditional moves, blends, early outs
Branches have always been one of the limitations of SIMD code. SSE4 provides new instructions (six Blend instructions plus a PTEST instruction) that can be used to replace either some branches or existing lengthy SIMD code written to get around branches.
The Blend instructions can be used to replace conditional move flows. For example, the PBLENDVB instruction can replace the PAND PANDN POR instructions commonly used in conditional moves where masks are created from a comparison instruction. Another SSE4 instruction, PTEST, can be used as an early out. It is able to compare the entire 128-bit register in one pass. This instruction can be used for conditions that are meant to be infrequent such as divide-by-zero exceptions. One of the benefits of these new instructions is that they provide the compiler more vectorization opportunities; that is, they provide more opportunities to optimize the high-level code by compiling it to use the SIMD instructions.
However, the real benefit of the Blend and PTEST instructions is when multiple branches in a loop can be replaced with multiple Blend and PTEST instructions. The Mandlebrot[3]
code shown in (Figure 2)
is an example that demonstrates how multiple branches can be replaced with multiple PTEST and Blend instructions. In the SSE4 implementation ((Figure 3)
) notice the use of two PTEST instructions:
if (_mm_test_all_ones(_mm_castps_si128(vmask)))
if (_mm_test_all_zeros(_mm_castps_si128(vmask),
_mm_castps_si128(vmask)))
and 3 Blend instructions:
sx = _mm_blendv_ps(x + sx*sx - sy*sy,sx,vmask);
sy = _mm_blendv_ps(y +
_F_TWO_*old_sx*sy,sy,vmask);
iter = I32vec4(_mm_blendv_epi8(iter+_I_ONE_,
iter,_mm_castps_si128(vmask)));
Figure 2: C implementation of Mandlebrot.
Figure 3: SSE4 (using F32VEC4) implementation of Mandlebrot.
By using the new SSE4 instructions on the Mandlebrot code, the Mandlebrot performance improves by 2.8 times over the C implementation.
Graphics building blocks
SSE4 instructions can be used to speed up graphical applications such as games. The DPPS DPPD instruction can be used to speed up collision detection and common vector matrix operations such as vector normalization. A detailed example of collision detection and usage guidelines of the DPPS DPPD instructions is discussed in.[1] The example showcases a 1.5x speedup in collision detection by using the DPPS instruction and the EXTRACTPS instruction.
A common problem in graphics applications is ‘Data Swizzling’ or converting from an Array-of-Structures (AOS) data layout implementation to a more SIMD-friendly Structures-of-Array (SOA) data layout in order to use SIMD. Users have to weigh the cost of these conversions before deciding if it is worth using SIMD. By using the INSERTPS instruction, the data-swizzling operation on the next-generation, Penryn microprocessors now take fifteen cycles per four vertices, down from 23 cycles on the Intel 65-nm Core 2 Duo microprocessors, codename Merom.[5] (see (Figure 4) ).
Figure 4: Data swizzling improvements per four vertices (one iteration converts four vertices).
Another potential game improvement is the streaming load instruction MOVNTDQA. This instruction provides a fast method to execute a 16-byte aligned load from Write Combining (WC) memory, such as graphics memory, with a non-temporal hint such that the cache is not polluted.
This instruction can provide a 5× to 7×[6] memory throughput performance increase.
Intel tools
Intel's Integrated Performance Primitives (IPPs), Version 6.0 has over a thousand functions optimized with SSE4. The average speedup across all SSE4-optimized IPP functions vs. SSSE3-optimized IPP functions is 1.12×. (Figure 5) shows which IPP categories have been optimized with SSE4 and their SSE4 speedup over SSSE3.
SSE4 instructions also enhance the compiler's ability to vectorize certain loops. Vectorization is when the compiler optimizes a loop to use SIMD instructions including SSE4 instructions. The Intel Compiler, Version 10.0 and later, can be used with the QxS compiler flag to generate SSE4-optimized code specifically for the Penryn family of processors.
Figure 5: Intel Performance Primitive categories and their SSE4 speedups over SSSE3 versions.
SSE4 instructions combined with the Super Shuffle Engine can significantly improve the performance of imaging, video, audio, multimedia, and high-performance computing applications. Users can add these instructions to their applications via assembly code or use Intel tools such as the Intel Compiler 10.0 and IPPs. For detailed information on the SSE4 instructions, including throughput, latency, and optimization guidelines, please see the Intel 64 and IA-32 Architectures Optimization Reference Manual[5] and the instruction manuals.7,8
Δ Any codenames featured in this document are used internally within Intel to identify products that are in development and not yet publicly announced for release. For ease of reference, some codenames have been used in this document for products that have already been released. Customers, licensees, and other third parties are not authorized by Intel to use codenames in advertising, promotion or marketing of any product or services, and any such use of Intel’s internal codenames is at the sole risk of the user.
