The Intel MKL implementation of the ScaLAPACK library is specially tuned for Itanium® , Intel® Xeon®, and Intel® Pentium® processor-based systems. ScaLAPACK includes two areas of Linear Algebra- direct solvers and the eigenvalue problems. As such, we will look at both PDGETRF (a direct solver used for solving linear systems of equations) and PDSYEV (used for solving eigenvalue problems). PDGETRF (Parallel, Double precision, GEneral, TRiangular matrix Factorization) is a key function in the linear equations solver area because it is a general factorization routine that applies to many classes of matrices, and because the lower upper (LU) Factorization that it completes is the performance-intensive portion of linear equations solvers. In our tests, we compare the Intel MKL implementation of ScaLAPACK to the publicly available implementation from NETLIB. We show the performance of Netlib ScaLAPACK using BLAS from Intel MKL as well as ATLAS*. More information on the ScaLAPACK library is available at http://www.netlib.org/scalapack/.* Raw Performance
Figure 2 shows performance on a 32-node cluster with 64 Intel Xeon processors for various problem and memory sizes.
Figure 2 illustrates that: 1. Intel MKL ScaLAPACK significantly outperforms NETLIB ScaLAPACK.
2. Intel MKL is even more impressive when compared to NETLIB ScaLAPACK using ATLAS* BLAS.  Click to enlarge
Figure 2: PDGETRF Performance Comparison Varying Problem Size
Because NETLIB ScaLAPACK requires users to link to an implementation of BLAS, the Intel MKL performance improvements from ScaLAPACK versus BLAS optimizations can be isolated and identified. A comparison of Intel MKL with NETLIB, where both are using Intel MKL BLAS, shows that the optimizations Intel has made specifically for ScaLAPACK constitute a 15 percent performance advantage over the NETLIB ScaLAPACK. The combined optimizations in Intel MKL ScaLAPACK and BLAS can deliver approximately 50% performance improvement overall when compared to NETLIB ScaLAPACK using ATLAS* BLAS. In figure 3 below, we look at the PDSYEV which computes eigenvalues and eigenvectors of a real symmetric matrix. Using the same 32-node (64 core) cluster of Intel® Xeon® processors we see how Intel MKL can deliver double the performance of NETLIB ScaLAPACK.  Click to enlarge
Figure 3: PDSYEV Performance Comparison Varying Problem Size
A major benefit of distributed memory parallel computing (clusters) is the ability to achieve parallel computing scales of very large magnitude. As such, users of clusters often have a particular interest in the ability of software to scale in performance along with the system size. The classic test is to increase the problem size proportionally with the increase in nodes and observe the extent to which the performance grows linearly. Figure 4 below displays this and shows that Intel MKL can provide tremendous gains over NETLIB using ATLAS BLAS on large systems.  Click to enlarge
Figure 4: Performance Comparison Varying Cluster Size
Block Size Robustness
When running ScaLAPACK, you must decide how to “block” your data. The process of determining how to distribute your data among nodes involves choosing an appropriate block size. The block size determines the amount of data that goes to each node. This requires effort and choosing the wrong block size can have significant adverse effects on performance. The Intel MKL implementation of ScaLAPACK is tolerant of block size differences. Figure 5 below shows how Intel MKL 9.0 provides approximately the same high performance regardless of block size. The same cannot be said for NETLIB ScaLAPACK.  Click to enlarge
Figure 5: Performance Comparison Varying Block Size |