University of Maryland DRUM  
University of Maryland Digital Repository at the University of Maryland

DRUM >
College of Computer, Mathematical & Natural Sciences >
Computer Science >
Technical Reports of the Computer Science Department >

Please use this identifier to cite or link to this item: http://hdl.handle.net/1903/437

Title: Fast Sequential and Parallel Algorithms for Association Rule Mining: A Comparison
Authors: Mueller, Andreas
Type: Technical Report
Issue Date: 15-Oct-1998
Series/Report no.: UM Computer Science Department; CS-TR-3515
Abstract: The field of knowledge discovery in databases, or "Data Mining", has received increasing attention during recent years as large organizations have begun to realize the potential value of the information that is stored implicitly in their databases. One specific data mining task is the mining of Association Rules, particularly from retail data. The task is to determine patterns (or rules) that characterize the shopping behavior of customers from a large database of previous consumer transactions. The rules can then be used to focus marketing efforts such as product placement and sales promotions. Because early algorithms required an unpredictably large number of IO operations, reducing IO cost has been the primary target of the algorithms presented in the literature. One of the most recent proposed algorithms, called PARTITION, uses a new TID-list data representation and a new partitioning technique. The partitioning technique reduces IO cost to a constant amount by processing one database portion at a time in memory. We implemented an algorithm called SPTID that incorporates both TID-lists and partitioning to study their benefits. For comparison, a non-partitioning algorithm called SEAR, which is based on a new prefix-tree data structure, is used. Our experiments with SPTID and SEAR indicate that TID-lists have inherent inefficiencies; furthermore, because all of the algorithms tested tend to be CPU-boundn trading CPU-overhead against I/O operations by partitioning did not lead to better performance. In order to scale mining algorithms to the huge databases (e.g., multiple Terabytes) that large organizations will manage in the near future, we implemented parallel versions of SEAR and SPEAR (its partitioned counterpart). The performance results show that, while both algorithms parallelize easily and obtain good speedup and scale-up results, the parallel SEAR version performs better than parallel SPEAR, despite the fact that it uses more communication.
URI: http://hdl.handle.net/1903/437
Appears in Collections:Technical Reports of the Computer Science Department

Files in This Item:

File Description SizeFormatNo. of Downloads
CS-TR-3515.pdfAuto-generated copy of CS-TR-3515.ps552.83 kBAdobe PDF681View/Open
CS-TR-3515.ps795.49 kBPostscript303View/Open

All items in DRUM are protected by copyright, with all rights reserved.

 

DRUM is brought to you by the University of Maryland Libraries
University of Maryland, College Park, MD 20742-7011 (301)314-1328.
Please send us your comments. -
All Contents