Decision Tree Construction for Data Mining on Cluster of Shared-Memory Multiprocessors
Files
Publication or External Link
Date
Advisor
Citation
DRUM DOI
Abstract
Classification of very large datasets is a challenging problem in data
mining. It is desirable to have decision-tree classifiers that can
handle large datasets, because a large dataset often increases the
accuracy of the resulting classification model. Classification tree
algorithms can benefit from parallelization because of large memory
and computation requirements for handling large datasets. Clusters of
shared-memory multiprocessors (SMPs), in which each shared-memory node
has a small number of processors (e.g., 2--8 processors) and is
connected to the other nodes via a high-speed inter-connect, have
become a popular alternative to pure distributed-memory and
shared-memory machines. A cluster of SMPs provides a two-tier
architecture, in which a combination of shared-memory and
distributed-memory paradigms can be employed. In this paper we
investigate decision tree construction on a cluster of SMPs. We
present an algorithm that employs a hybrid approach. The
classification training dataset is partitioned across the SMP nodes so
that each SMP node performs tree construction using a subset of the
records in the dataset. Within each SMP node, on the other hand, tasks
associated with an attribute are dynamically scheduled to the
light-weight threads running on the SMP node. We present experimental
results on a Linux PC cluster with dual-processor SMP nodes.
(Also cross-referenced as UMIACS-TR-2000-78)