|
The Microarchitecture of the PentiumŪ 4 Processor (continued)
Page 3 of 10
OVERVIEW OF THE INTEL NETBURST® MICROARCHITECTURE
A fast processor requires balancing and tuning of many microarchitectural
features that compete for processor die cost and for design and validation
efforts. Figure 1 shows the basic Intel NetBurst® microarchitecture of the
PentiumŪ 4 processor. As you can see, there are four main sections: the
in-order front end, the out-of-order execution engine, the integer and
floating-point execution units, and the memory subsystem.

Figure 1: Basic block diagram
In-Order Front End
The in-order front end is the part of the machine that fetches the
instructions to be executed next in the program and prepares them to be used
later in the machine pipeline. Its job is to supply a high-bandwidth stream
of decoded instructions to the out-of-order execution core, which will do the
actual completion of the instructions. The front end has highly accurate
branch prediction logic that uses the past history of program execution to
speculate where the program is going to execute next. The predicted
instruction address, from this front-end branch prediction logic, is used to
fetch instruction bytes from the Level 2 (L2) cache. These IA-32 instruction
bytes are then decoded into basic operations called uops (micro-operations)
that the execution core is able to execute.
The Intel NetBurst® microarchitecture has an advanced form of a Level 1 (L1)
instruction cache called the Execution Trace Cache. Unlike conventional
instruction caches, the Trace Cache sits between the instruction decode
logic and the execution core as shown in Figure 1. In this location the
Trace Cache is able to store the already decoded IA-32 instructions or uops.
Storing already decoded instructions removes the IA-32 decoding from the main
execution loop. Typically the instructions are decoded once and placed in
the Trace Cache and then used repeatedly from there like a normal instruction
cache on previous machines. The IA-32 instruction decoder is only used when
the machine misses the Trace Cache and needs to go to the L2 cache to get
and decode new IA-32 instruction bytes.
Out-of-Order Execution Logic
The out-of-order execution engine is where the instructions are prepared for
execution. The out-of-order execution logic has several buffers that it uses
to smooth and re-order the flow of instructions to optimize performance as
they go down the pipeline and get scheduled for execution. Instructions are
aggressively re-ordered to allow them to execute as quickly as their input
operands are ready. This out-of-order execution allows instructions in the
program following delayed instructions to proceed around them as long as they
do not depend on those delayed instructions. Out-of-order execution allows
the execution resources such as the ALUs and the cache to be kept as busy as
possible executing independent instructions that are ready to execute.
The retirement logic is what reorders the instructions, executed in an
out-of-order manner, back to the original program order. This retirement
logic receives the completion status of the executed instructions from the
execution units and processes the results so that the proper architectural
state is committed (or retired) according to the program order. The Pentium
4 processor can retire up to three uops per clock cycle. This retirement
logic ensures that exceptions occur only if the operation causing the
exception is the oldest, non-retired operation in the machine. This logic
also reports branch history information to the branch predictors at the front
end of the machine so they can train with the latest known-good branch-history
information.
Integer and Floating-Point Execution Units
The execution units are where the instructions are actually executed. This
section includes the register files that store the integer and floating-point
data operand values that the instructions need to execute. The execution units
include several types of integer and floating-point execution units that
compute the results and also the L1 data cache that is used for most load and
store operations.
Memory Subsystem
Figure 1 also shows the memory subsystem. This includes the L2 cache and the
system bus. The L2 cache stores both instructions and data that cannot fit
in the Execution Trace Cache and the L1 data cache. The external system bus
is connected to the backside of the second-level cache and is used to access
main memory when the L2 cache has a cache miss, and to access the system I/O
resources.
|