On the Efficiency of Nearest Neighbor Searching with Data Clustered in Lower Dimensions
On the Efficiency of Nearest Neighbor Searching with Data Clustered in Lower Dimensions
Loading...
Files
Publication or External Link
Date
2001-05-10
Authors
Maneewongvatana, Songrit
Mount, David M.
Advisor
Citation
DRUM DOI
Abstract
In nearest neighbor searching we are given a set of n data points in
real d-dimensional space, R^d, and the problem is to preprocess
these points into a data structure, so that given a query point, the
nearest data point to the query point can be reported efficiently.
Because data sets can be quite large, we are interested in data
structures that use optimal O(dn) storage. Given the limitation of
linear storage, the best known data structures suffer from expected-case
query times that grow exponentially in d. However, it is widely
regarded in practice that data sets in high dimensional spaces tend to
consist of clusters residing in much lower dimensional subspaces.
This raises the question of whether data structures for nearest neighbor
searching adapt to the presence of lower dimensional clustering, and
further how performance varies when the clusters are aligned with the
coordinate axes.
We analyze the popular kd-tree data structure in the form of two variants
based on a modification of the splitting method, which produces cells
satisfy the basic packing properties needed for efficiency without
producing empty cells. We show that when data points are uniformly
distributed on a k-dimensional hyperplane for k <= d, then expected
number of leaves visited in such a kd-tree grows exponentially in k,
but not in d. We show that the growth rate is even smaller still if
the hyperplane is aligned with the coordinate axes. We present empirical
studies to support our theoretical results.