Quantifiable Data Mining Using Principal Component Analysis

dc.contributor.authorKorn, Flipen_US
dc.contributor.authorLabrinidis, Alexandrosen_US
dc.contributor.authorKotidis, Yannisen_US
dc.contributor.authorFaloutsos, Christosen_US
dc.contributor.authorKaplunovich, Alexen_US
dc.contributor.authorPerkovic, Dejanen_US
dc.date.accessioned2004-05-31T22:44:25Z
dc.date.available2004-05-31T22:44:25Z
dc.date.created1997-02en_US
dc.date.issued1998-10-15en_US
dc.description.abstractAssociation Rule Mining algorithms operate on a data matrix (e.g., customers x products) to derive rules. We propose a single-pass algorithm for mining linear rules in such a matrix based on Principal Component Analysis. PCA detects correlated columns of the matrix, which correspond to, e.g., products that sell together. The first contribution of this work is that we propose to quantify the ``goodness'' of a set of discovered rules. We define the ``guessing error'': the root-mean-square error of the reconstructed values of the cells of the given matrix, when we pretend that they are unknown. The second contribution is a novel method to guess missing/hidden values from the linear rules that our method derives. For example, if somebody bought $10 of milk and $3 of bread, our rules can ``guess'' the amount spent on, say, butter. Thus, we can perform a variety of important tasks such as forecasting, `what-if' scenarios, outlier detection, and visualization. Moreover, we show that we can compute the principal components with a single pass over the dataset. Experiments on real datasets (e.g., NBA statistics) demonstrate that the proposed method consistently achieves a ``guessing error'' of up to 5 times lower than the straightforward competitor. (Also cross-referenced as UMIACS-TR-97-13)en_US
dc.format.extent692963 bytes
dc.format.mimetypeapplication/postscript
dc.identifier.urihttp://hdl.handle.net/1903/879
dc.language.isoen_US
dc.relation.isAvailableAtDigital Repository at the University of Marylanden_US
dc.relation.isAvailableAtUniversity of Maryland (College Park, Md.)en_US
dc.relation.isAvailableAtTech Reports in Computer Science and Engineeringen_US
dc.relation.isAvailableAtUMIACS Technical Reportsen_US
dc.relation.ispartofseriesUM Computer Science Department; CS-TR-3754en_US
dc.relation.ispartofseriesUMIACS; UMIACS-TR-97-13en_US
dc.titleQuantifiable Data Mining Using Principal Component Analysisen_US
dc.typeTechnical Reporten_US

Files

Original bundle
Now showing 1 - 2 of 2
No Thumbnail Available
Name:
CS-TR-3754.ps
Size:
676.72 KB
Format:
Postscript Files
Loading...
Thumbnail Image
Name:
CS-TR-3754.pdf
Size:
567.98 KB
Format:
Adobe Portable Document Format
Description:
Auto-generated copy of CS-TR-3754.ps