On the Efficiency of Nearest Neighbor Searching with Data Clustered in Lower Dimensions
Publication or External Link
In nearest neighbor searching we are given a set of n data points in real d-dimensional space, R^d, and the problem is to preprocess these points into a data structure, so that given a query point, the nearest data point to the query point can be reported efficiently. Because data sets can be quite large, we are interested in data structures that use optimal O(dn) storage. Given the limitation of linear storage, the best known data structures suffer from expected-case query times that grow exponentially in d. However, it is widely regarded in practice that data sets in high dimensional spaces tend to consist of clusters residing in much lower dimensional subspaces. This raises the question of whether data structures for nearest neighbor searching adapt to the presence of lower dimensional clustering, and further how performance varies when the clusters are aligned with the coordinate axes.
We analyze the popular kd-tree data structure in the form of two variants
based on a modification of the splitting method, which produces cells
satisfy the basic packing properties needed for efficiency without producing empty cells. We show that when data points are uniformly
distributed on a k-dimensional hyperplane for k <= d, then expected number of leaves visited in such a kd-tree grows exponentially in k, but not in d. We show that the growth rate is even smaller still if the hyperplane is aligned with the coordinate axes. We present empirical studies to support our theoretical results.