Fast Sequential and Parallel Algorithms for Association Rule Mining:
A Comparison
Fast Sequential and Parallel Algorithms for Association Rule Mining:
A Comparison
Loading...
Files
Publication or External Link
Date
1998-10-15
Authors
Mueller, Andreas
Advisor
Citation
DRUM DOI
Abstract
The field of knowledge discovery in databases, or "Data Mining",
has received increasing attention during recent years as large
organizations have begun to realize the potential value of the
information that is stored implicitly in their databases. One specific
data mining task is the mining of Association Rules, particularly from
retail data. The task is to determine patterns (or rules) that
characterize the shopping behavior of customers from a large database of
previous consumer transactions. The rules can then be used to focus
marketing efforts such as product placement and sales promotions.
Because early algorithms required an unpredictably large number
of IO operations, reducing IO cost has been the primary target of the
algorithms presented in the literature. One of the most recent proposed
algorithms, called PARTITION, uses a new TID-list data representation and
a new partitioning technique. The partitioning technique reduces IO cost
to a constant amount by processing one database portion at a time in
memory. We implemented an algorithm called SPTID that incorporates both
TID-lists and partitioning to study their benefits. For comparison, a
non-partitioning algorithm called SEAR, which is based on a new
prefix-tree data structure, is used. Our experiments with SPTID and SEAR
indicate that TID-lists have inherent inefficiencies; furthermore,
because all of the algorithms tested tend to be CPU-boundn trading
CPU-overhead against I/O operations by partitioning did not lead to
better performance.
In order to scale mining algorithms to the huge databases (e.g.,
multiple Terabytes) that large organizations will manage in the near
future, we implemented parallel versions of SEAR and SPEAR (its
partitioned counterpart). The performance results show that, while both
algorithms parallelize easily and obtain good speedup and scale-up
results, the parallel SEAR version performs better than parallel SPEAR,
despite the fact that it uses more communication.