On the Efficiency of Nearest Neighbor Searching with Data Clustered in Lower Dimensions
Files
Publication or External Link
Date
Advisor
Citation
DRUM DOI
Abstract
In nearest neighbor searching we are given a set of n data points in real d-dimensional space, R^d, and the problem is to preprocess these points into a data structure, so that given a query point, the nearest data point to the query point can be reported efficiently. Because data sets can be quite large, we are interested in data structures that use optimal O(dn) storage. Given the limitation of linear storage, the best known data structures suffer from expected-case query times that grow exponentially in d. However, it is widely regarded in practice that data sets in high dimensional spaces tend to consist of clusters residing in much lower dimensional subspaces. This raises the question of whether data structures for nearest neighbor searching adapt to the presence of lower dimensional clustering, and further how performance varies when the clusters are aligned with the coordinate axes.
We analyze the popular kd-tree data structure in the form of two variants
based on a modification of the splitting method, which produces cells
satisfy the basic packing properties needed for efficiency without
producing empty cells. We show that when data points are uniformly
distributed on a k-dimensional hyperplane for k <= d, then expected
number of leaves visited in such a kd-tree grows exponentially in k,
but not in d. We show that the growth rate is even smaller still if
the hyperplane is aligned with the coordinate axes. We present empirical
studies to support our theoretical results.