Technology and Research
Intel® Technology Journal Home
Volume 10, Issue 02
Intel® Centrino® Duo Processor Technology
Table of Contents
Technical Reviewers
About This Journal
Intel Published Articles
Read Past Journals
Subscribe
E-Mail this Journal to a Collegue
Home  ›  Technology and Research  ›  Intel® Technology Journal  ›  Intel® Centrino® Duo Mobile Technology
Intel® Centrino® Duo Mobile Technology
Intel® Technology Journal
Featuring Intel's recent
research and development
 
Intel® Centrino® Duo Mobile Technology
Volume 10    Issue 02    Published May 15, 2006
ISSN 1535-864X    DOI: 10.1535/itj.1002.01

  Section 3 of 11  
Introduction to Intel® Core™ Duo processor architecture
The improved Pentium® M processor-based cores

The core of the Intel® Core™ Duo processor-based technology is an enhanced Pentium® M processor 755/7451 core converted to 65nm process technology. The main focus of the core enhancements was to do the following:

  • Support virtualization (Virtualization Technology2) [3].
  • Support the new Streaming SIMD Extension (SSE3) [4].
  • Address performance inefficiencies mainly in the handling of SSE/SSE2, FP (x87) and some long latency integer instructions.

Intel® Core™ Duo processor-based technology core performance improvements

Intel® Core™ Duo processor-based technology introduces performance improvements in the following areas:

  • Streaming SIMD Extensions (SSE/2/3)
  • Floating Point (x87)
  • Integer

The main difficulty with SSE implementation in Pentium M is caused by the fact that SSE/2/3 is a 128-bit wide microarchitecture while the Pentium M execution core is 64-bits wide (in order to meet power and energy constraints). Making the machine twice as wide may produce more heat and so will have a significant impact on the Thermal Design Point (TDP) of the system as well as some impact on battery life. Since the Pentium M was primarily designed for mobility we preferred to make it relatively narrow and cope with the SSE performance issues. The by-product of this tradeoff is that each SSE vector operation is "broken" into 64-bit wide micro-operation (uOp) pairs. Such instructions suffer from several performance bottlenecks in the Pentium M pipeline, mainly in the Front End (FE) of the pipeline. For example, the Instruction Decoder in the Pentium M processor can potentially handle three instructions per cycle but only the first decoder in a row is capable of handling complex instructions. The other two decoders are limited to single uOp instructions only. This works fine in most cases since the most frequent instructions are single uOp. However, this is not the case with SSE instructions: only scalar SSE operations are single uOps while the vector operations are typically 2-4 uOps. This results in several potential bottlenecks in the FE: the Instruction Decoder in the Pentium M can only handle one SSE vector operation per cycle, causing starvation in the rest of the machine. This bottleneck was addressed in the Intel® Core™ Duo core: a new mechanism was introduced that allows lamination of pairs of similar uOps. This mechanism along with enhanced uOp fusion allows handling of the SSE/2/3 vector operation by a single laminated uOp. The instruction decoders were modified to handle three such instructions per cycle, increasing significantly the decode bandwidth of SSE vector operations. The laminated uOps streaming down the pipe are at a certain point un-laminated, reproducing again the 64-bit wide uOp pairs to feed the machine. These changes not only improve performance of vector operations but also save some energy since the FE, no more a bottleneck, can be clock gated whenever its uOp buffer is filled beyond a certain watermark.

Another bottleneck that was discovered was the handling of the floating point (FP) Control Word (CW). The FP CW is part of the x87 state and was usually viewed as "constant"; namely it is loaded once at the beginning and stays constant throughout the program. This is indeed the way the FP CW is used by most of the programs. However there are some FP applications that manipulate the "rounding control" which is located in this register: the default rounding mode is "rounding to nearest even" but before converting results to fixed point, some applications change the round control to "chop" (this is the rule with C programs for example). Such behavior was treated rather inefficiently by the Pentium M core: each manipulation of the FP CW was effectively stalling the pipeline until its completion. The Intel® Core™ Duo core introduced a new renaming mechanism for the FP CW so that four different versions of this register can coexist on the fly without stalling the machine.

Intel® Core™ Duo also improved the latency of some long latency integer operations such as Integer Divide (IDIV). Although these instructions are not very frequent, because of their extremely long latencies, their accumulative affect on integer benchmark scores have shown to be very significant. The basic Divide algorithm has remained unchanged; however, Intel® Core™ Duo Divide logic exploits opportunities for "early exit." The Divide logic calculates in advance the number of iterations that are required to accomplish the operation. This is indeed data dependent; however, it is often significantly smaller relative to the maximal number of iterations. Once the required number of iterations is accomplished the divider wraps up the results. This does not impact the maximal Integer Divide latency; however, on average it is much faster.

Another enhancement that benefits different kinds of applications is the introduction of a new mechanism of H/W prefetcher. This mechanism identifies streaming loads at a very early stage in the machine and speculatively predicts the future incarnation of these loads. These speculative requests are looked up in the shared L2 cache and if miss, they’re speculatively prefetched from the external memory. This mechanism is dynamically deactivated whenever there are many demand requests pending (a watermark mechanism). The benefit of this change is an average reduction in load latency.

The performance implication of these enhancements on single-threaded (ST) applications as well as on multithreaded (MT) applications are discussed in [1]

  • 1 Intel® processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families. See http://www.intel.com/products/processor_number for details.
  • 2 Intel® Virtualization Technology requires a computer system with a processor, chipset, BIOS, virtual machine monitor (VMM) and applications enabled for virtualization technology. Functionality, performance or other virtualization technology benefits will vary depending on hardware and software configurations. Virtualization technology-enabled BIOS and VMM applications are currently in development.

 


  Section 3 of 11  

In this article
Abstract
Introduction
The improved Pentium® M processor-based cores
CMP-General structure
Power control
Thermal design point
Platform power management
Intel® Core™ Solo processor
Conclusion
References
Authors' biographies
Download a PDF of this article.   
Email This Page
Back to Top