|
Data mining is the extraction of hidden predictive information from large data
bases. Emerging data-mining applications are important factors to drive the
architecture of future microprocessors. This paper analyzes the performance
scalability on parallel architectures of such applications to understand how to
best architect the next generation of microprocessors that will have many CPU
cores on chip.
Bioinformatics is one of the most active research areas in computer science,
and it relies heavily on many types of data-mining techniques. In this paper,
we report on the performance scalability analysis of six bioinformatics
applications on a 16-way SMP based on Intel® Xeon™
microprocessor system. These applications are very compute intensive, and they
manipulate very large data sets; many of them are freely accessible.
Bioinformatics is a good proxy for workload analysis of general data-mining
applications. Our experiments show that these applications exhibit good
parallel behaviors after some algorithm-level reformulations, or careful
parallelism selection. Most of them scale well with increased numbers of
processors, with a speed-up of up to 14.4X on 16 processors.
We start with an introduction to data mining. The data-mining techniques
studied are briefly described, and the selected workloads using these
techniques are listed. We then provide a brief description of the methodology
used for the studies. We present the scalability analysis of three workloads
related to Bayesian Network (BN) structure, two workloads relevant to
recognition, and one workload related to optimization. We conclude with the key
lessons of the study. These workloads are compute intensive and data parallel.
They manipulate large amounts of data that stress the cache hierarchy.
Techniques optimizing the use of caches are key to ensure performance
scalability of these workloads on parallel architectures.
|