|
The core of the Intel® Core™ Duo processor-based technology is an enhanced Pentium® M processor 755/7451 core
converted to 65nm process technology. The main focus of the core enhancements was to do the following:
- Support virtualization (Virtualization Technology2) [3].
- Support the new Streaming SIMD Extension (SSE3) [4].
- Address performance inefficiencies mainly in the handling of SSE/SSE2, FP (x87) and some long latency integer
instructions.
Intel® Core™ Duo processor-based technology core performance improvements
Intel® Core™ Duo processor-based technology introduces performance improvements in the following areas:
- Streaming SIMD Extensions (SSE/2/3)
- Floating Point (x87)
- Integer
The main difficulty with SSE implementation in Pentium M is caused by the fact that SSE/2/3 is a 128-bit wide
microarchitecture while the Pentium M execution core is 64-bits wide (in order to meet power and energy constraints).
Making the machine twice as wide may produce more heat and so will have a significant impact on the Thermal Design Point
(TDP) of the system as well as some impact on battery life. Since the Pentium M was primarily designed for mobility we
preferred to make it relatively narrow and cope with the SSE performance issues. The by-product of this tradeoff is
that each SSE vector operation is "broken" into 64-bit wide micro-operation (uOp) pairs. Such instructions suffer
from several performance bottlenecks in the Pentium M pipeline, mainly in the Front End (FE) of the pipeline. For example,
the Instruction Decoder in the Pentium M processor can potentially handle three instructions per cycle but only the first
decoder in a row is capable of handling complex instructions. The other two decoders are limited to single uOp instructions
only. This works fine in most cases since the most frequent instructions are single uOp. However, this is not the case with
SSE instructions: only scalar SSE operations are single uOps while the vector operations are typically 2-4 uOps. This
results in several potential bottlenecks in the FE: the Instruction Decoder in the Pentium M can only handle one SSE vector
operation per cycle, causing starvation in the rest of the machine. This bottleneck was addressed in the Intel® Core™ Duo
core: a new mechanism was introduced that allows lamination of pairs of similar uOps. This mechanism along with enhanced
uOp fusion allows handling of the SSE/2/3 vector operation by a single laminated uOp. The instruction decoders were
modified to handle three such instructions per cycle, increasing significantly the decode bandwidth of SSE vector
operations. The laminated uOps streaming down the pipe are at a certain point un-laminated, reproducing again the
64-bit wide uOp pairs to feed the machine. These changes not only improve performance of vector operations but also
save some energy since the FE, no more a bottleneck, can be clock gated whenever its uOp buffer is filled beyond a certain
watermark.
Another bottleneck that was discovered was the handling of the floating point (FP) Control Word (CW). The FP CW is part
of the x87 state and was usually viewed as "constant"; namely it is loaded once at the beginning and stays constant
throughout the program. This is indeed the way the FP CW is used by most of the programs. However there are some FP
applications that manipulate the "rounding control" which is located in this register: the default rounding mode is
"rounding to nearest even" but before converting results to fixed point, some applications change the round control to
"chop" (this is the rule with C programs for example). Such behavior was treated rather inefficiently by the Pentium M
core: each manipulation of the FP CW was effectively stalling the pipeline until its completion. The Intel® Core™ Duo core
introduced a new renaming mechanism for the FP CW so that four different versions of this register can coexist on the fly
without stalling the machine.
Intel® Core™ Duo also improved the latency of some long latency integer operations such as Integer Divide (IDIV).
Although these instructions are not very frequent, because of their extremely long latencies, their accumulative affect on
integer benchmark scores have shown to be very significant. The basic Divide algorithm has remained unchanged; however,
Intel® Core™ Duo Divide logic exploits opportunities for "early exit." The Divide logic calculates in advance the number of
iterations that are required to accomplish the operation. This is indeed data dependent; however, it is often significantly
smaller relative to the maximal number of iterations. Once the required number of iterations is accomplished the divider
wraps up the results. This does not impact the maximal Integer Divide latency; however, on average it is much faster.
Another enhancement that benefits different kinds of applications is the introduction of a new mechanism of H/W
prefetcher. This mechanism identifies streaming loads at a very early stage in the machine and speculatively predicts the
future incarnation of these loads. These speculative requests are looked up in the shared L2 cache and if miss, they’re
speculatively prefetched from the external memory. This mechanism is dynamically deactivated whenever there are many
demand requests pending (a watermark mechanism). The benefit of this change is an average reduction in load latency.
The performance implication of these enhancements on single-threaded (ST) applications as well as on multithreaded
(MT) applications are discussed in [1]
|