An Overview of the Intel TFLOPS Supercomputer (Continued)

Page 2 of 19
Introduction
From the beginning of the computer era, scientists and engineers have posed problems that could not be solved on routinely available computer systems. These problems required large amounts of memory and vast numbers of floating point computations. The special computers built to solve these large problems were called supercomputers.
Among these problems, certain ones stand out by virtue of the extraordinary demands they place on a supercomputer. For example, the best climate modeling programs solve at each time step models for the ocean, the atmosphere, and the solar radiation. This leads to astronomically huge multi-physics simulations that challenge the most powerful supercomputers in the world.
So what is the most powerful supercomputer in the world? To answer this question we must first agree on how to measure a computer's power. One possibility is to measure a system's peak rate for carrying out floating point arithmetic. In practice, however, these rates are only rarely approached. A more realistic approach is to use a common application to measure computer performance. Since computational linear algebra is at the heart of many scientific problems, the de facto standard benchmark has become the linear algebra benchmark, LINPACK [1,7].
The LINPACK benchmark measures the time it takes to solve a dense system of linear equations. Originally, the system size was fixed at 100, and users of the benchmark had to run a specific code. This form of the benchmark, however, tested the quality of compilers, not the relative speeds of computer systems. To make it a better computer performance metric, the LINPACK benchmark was extended to systems with 1000 linear equations, and as long as residual tests were passed, any benchmark implementation, tuning, or assembly coding was allowed. This worked quite well until computer performance increased to a point where even the LINPACK-1000 benchmark took an insignificant amount of time. So, about 15 years ago, the rules for the LINPACK benchmark were modified so any size linear system could be used. This resulted in the MP-LINPACK benchmark.
Using the MP-LINPACK benchmark as our metric, we can revisit our original question: which computer is the most powerful supercomputer in the world? In Table 1, we answer this question showing the MP-LINPACK world record holders in the 1990's.
All the machines in Table 1 are massively parallel processor (MPP) supercomputers. Furthermore, all the machines are based on Commercial Commodity Off the Shelf (CCOTS) microprocessors. Finally, all the machines achieve their high performance with scalable interconnection networks that let them use large numbers of processors.

The current record holder is a supercomputer built by Intel for the DOE. In December 1996, this machine, known as the ASCI Option Red Supercomputer, ran the MP-LINPACK benchmark at a rate of 1.06 trillion floating point operations per second (TFLOPS). This was the first time the MP-LINPACK benchmark had ever been run in excess of 1 TFLOP. In June 1997, when the full machine was installed, we reran the benchmark and achieved a rate of 1.34 TFLOPS.
In Table 2, we briefly summarize the machine's key parameters. The numbers are impressive. It occupies 1,600 sq. ft. of floor-space (not counting supporting network resources, tertiary storage, and other supporting hardware). The system's 9,216 Pentium® Pro processors with 596 Gbytes of RAM are connected through a 38 x 32 x 2 mesh. The system has a peak computation rate of 1.8 TFLOPS and a cross-section bandwidth (measured across the two 32 x 38 planes) of over 51 GB/sec.
Getting so much hardware to work together in a single supercomputer was challenging. Equally challenging was the problem of developing operating systems that can run on such a large scalable system. For the ASCI Option Red Supercomputer, we used different operating systems for different parts of the machine. The nodes involved with computation (compute nodes) run an efficient, small operating system called Cougar. The nodes that support interactive user services (service nodes) and booting services (system nodes) run a distributed UNIX operating system. The two operating systems work together so the user sees the system as a single integrated supercomputer. These operating systems and how they support scalable computation, I/O, and high performance communication are discussed in another paper in this Q1'98 issue of the Intel Technology Journal entitled Achieving Large Scale Parallelism Through Operating System Resource Management on the Intel TFLOPS Supercomputer [8].
When scaling to so many nodes, even low probability points of failure can become a major problem. To build a robust system with so many nodes, the hardware and software must be explicitly designed for Reliability, Availability, and Serviceability (RAS). All major components are hot-swappable and repairable while the system remains under power. Hence, if several applications are running on the system at one time, only the application using the failed component will shut down. In many cases, other applications continue to run while the failed components are replaced. Of the 4,536 compute nodes and 16 on-line hot spares, for example, all can be replaced without having to cycle the power of any other module. Similarly, system operation can continue if any of the 308 patch service boards (to support RAS functionality), 640 disks, 1540 power supplies, or 616 interconnection facility (ICF) back-planes should fail.

Keeping track of the status of such a large scalable supercomputer and controlling its RAS capabilities is a difficult job. The system responsible for this job is the Scalable Platform Services (SPS). The design and function of SPS are described in another paper in this issue of the Intel Technology Journal entitled Scalable Platform Services on the Intel TFLOPS Supercomputer [9].
Finally, a supercomputer is only of value if it delivers super performance on real applications. Between MP-LINPACK and production applications running on the machine, significant results have been produced. Some of these results and a detailed discussion of performance issues related to the system are described in another paper in this issue of the Intel Technology Journal entitled The Performance of the Intel TFLOPS Supercomputer [10].
In this paper, we describe the motivation behind this machine, the system hardware and software, and how the system is used by both programmers and the end-users. The level of detail varies. When a topic is addressed elsewhere, it is discussed only briefly in this paper. For example, we say very little about SPS. When a topic is not discussed elsewhere, we go into great detail. For example, the node boards in this computer are not discussed elsewhere so we go into great detail about them. As a result, the level of detail in this paper is uneven, but, in our opinion, acceptable given the importance of getting the full story about this machine into the literature.
|