📄 kdtree2.lyx
字号:
#LyX 1.3 created this file. For more info see http://www.lyx.org/\lyxformat 221\textclass revtex4\begin_preamble\usepackage{setspace}\end_preamble\options aps,preprint\language english\inputencoding latin1\fontscheme default\graphics default\paperfontsize default\spacing single \papersize Default\paperpackage a4\use_geometry 0\use_amsmath 0\use_natbib 0\use_numerical_citations 0\paperorientation portrait\secnumdepth 3\tocdepth 3\paragraph_separation skip\defskip medskip\quotes_language english\quotes_times 2\papercolumns 1\papersides 1\paperpagestyle default\layout TitleKDTREE 2: Fortran 95 and C++ software to efficiently search for near neighbors in a multi-dimensional Euclidean space\layout AuthorMatthew B. Kennel\layout Author Emailmkennel@ucsd.edu\layout AffiliationInstitute for Nonlinear Science; University of California, San Diego\layout AbstractMany data-based statistical algorithms require that one find \shape italic near or nearest neighbors\shape default to a given vector among a set of points in that vector space, usually with Euclidean topology. The k-d data structure and search algorithms are the generalization of classical binary search trees to higher dimensional spaces, so that one may locate near neighbors to an example vector in \begin_inset Formula $O(\log N)$\end_inset time instead of the brute-force \begin_inset Formula $O(N)$\end_inset time, with \begin_inset Formula $N$\end_inset being the size of the data base. KDTREE2 is a Fortran 95 module, and a parallel set of C++ classes which implement tree construction and search routines to find either a set of \begin_inset Formula $m$\end_inset nearest neighbors to an example, or all the neighbors within some Euclidean distance \begin_inset Formula $r.$\end_inset The two versions are independent and function fully on their own. Considerable care has been taken in the implementation of the search methods, resulting in substantially higher computational efficiency (up to an order of magnitude faster) than the author's previous Internet-distributed version. Architectural improvements include rearrangement for memory cache-friendly performance, heap-based priority queues for large \begin_inset Formula $m$\end_inset searches, and more effective pruning of search paths by geometrical constraints to avoid wasted effort. The improvements are the most potent in the more difficult and slowest cases: larger data base sizes, higher dimensionality manifolds containing the data set, and larger numbers of neighbors to search for. The C++ implementation requires the Standard Template Library as well as the BOOST C++ library be installed. \layout Sectionintroduction\layout StandardGiven a fixed data base of points on the real line, how would one efficiently find the closest point to some example \begin_inset Formula $q$\end_inset (the query), assuming that the question will be asked for many different arbitrary \begin_inset Formula $x$\end_inset values? The solution here is classical and obvious: sort the original data base, and perform a recursive binary bisection for each query, successively narrowing the range of the original data points which are possible neighbors to the query point, until the true nearest neighbors have been found. On average it will take \begin_inset Formula $O(\log N)$\end_inset bisections to locate the nearest neighbor, much less than the effort needed to \emph on exhaustively\emph default search all \begin_inset Formula $N$\end_inset points and remember the one with the closest distance to \begin_inset Formula $q$\end_inset . Although for an algorithm this simple it is not usually implemented this way, one may view the binary search as progressing down a tree of depth \begin_inset Formula $c\,\log N$\end_inset , with each node in the tree specifying a interval, namely the support of the points that it represents. Descendents of any node partition the interval represented by their parent, without overlap. Searching for nearest neighbors involves descending the particular nodes of the tree whose support intervals contain the query point.\layout StandardThe k-d tree is the natural generalization to a \begin_inset Formula $k$\end_inset -dimensional Euclidean space, instead of the 1-d line. Each node of the k-d tree now has an associated support hyper-rectangle (outer product of \begin_inset Formula $k$\end_inset intervals) instead of a simple 1-d interval. As before each non-terminal node has two descendants, whose hyper-rectangles partition the parent's support space in two, along a certain dimension known as the \shape italic cut dimension,\shape default which is chosen in a data-dependent way by the tree building algorithm. Similarly, each node is associated with a subset of the data base elements; this subset is guaranteed to lie within each node's associated hyper-rectangle, known as the \shape italic bounding box.\shape default The non-trivial algorithmic complication, compared to searches of a line, is that one may need to search for near neighbor candidates contained not only in those nodes whose bounding boxes contain the query (even though one generally expects the best matches there), but through a number of additional neighboring nodes. The trick is to minimize the number of these \begin_inset Quotes eld\end_inset extra\begin_inset Quotes erd\end_inset searches which must be performed.\layout StandardK-d trees are clever, but not magic. The infamous \begin_inset Quotes eld\end_inset curse of dimensionality\begin_inset Quotes erd\end_inset still reigns, and will effectively thwart even good k-d tree methods when the underlying dimensionality of the point set (i.e. the dimension of a manifold containing the database) is sufficiently high. Why? In high-dimensional spaces, as opposed to our intuition trained in 2 or 3 dimensional space, the distance to even near or nearest neighbors is almost as large as the distance to a random point. In practice this means that for sufficiently high dimension, the k-d tree search methods end up having to search many \begin_inset Quotes eld\end_inset extra\begin_inset Quotes erd\end_inset nodes and offer little improvement over brute-force exhaustive searching. K-d trees are excellent for 3 dimensional data, offer significant benefits up to perhaps 10-15 dimensional data, but are useless for 100 dimensional data.\layout SectionBuild and search algorithms\layout StandardK-d trees have been discussed a number of times before in the literature; we refer the reader to \begin_inset LatexCommand \cite{Moore91}\end_inset for a good tutorial and the references therein. I will discuss the specific implementation choices in the KDTREE2 package. Firstly, the KDTREE2 package implements ONLY the Euclidean metric, i.e. finding near neighbors relative to the squared distance between points \begin_inset Formula $x$\end_inset and \begin_inset Formula $y$\end_inset , \begin_inset Formula $d^{2}(x,y)=\sum_{i=1}^{d}(x_{i}-y_{i})^{2}$\end_inset . The usual alternative situation is for periodic topology in some coordinates. That can be simulated easily by converting each periodic component \begin_inset Formula $\theta$\end_inset of the original space into two components in Euclidean space, \begin_inset Formula $(A\cos\theta,A\sin\theta)$\end_inset with some scaling factor \begin_inset Formula $A$\end_inset as appropriate.\layout SubsectionBuilding\layout StandardThe software assumes a fixed, unchanging input data base of points, and builds a tree structure for the entire data set. Each tree node retains local data specifying the range of points (as viewed through a separately maintained index) represented by the node, the identity and value of which dimension the children (\begin_inset Quotes eld\end_inset cut dimension\begin_inset Quotes erd\end_inset and \begin_inset Quotes eld\end_inset cut value\begin_inset Quotes erd\end_inset ) are split on, pointers to the left and right child nodes, and the bounding box for all points in the node. The root node is constructed corresponding to all points, and the bounding box explicitly computed with exhaustive enumeration of the maximum and minimum values for each coordinate.\layout StandardGiven a bounding box, the build algorithm splits upon that dimension with the maximum extent, i.e. the largest difference between maximum and minimum values. The split location is defined (initially) as the arithmetic average between the maximum and minimum, and then using an efficient one-pass algorithm, the indexes are moved appropriately for the two children, corresponding to points with coordinate less than (left child) and greater than (right child) the cut value.\layout StandardDuring the top-down build process, only an approximate bounding box is used for efficiency. Finding the max and min of points along each dimension, especially when the database is large, can be time consuming, requiring \begin_inset Formula $O(N\cdot d)$\end_inset work for each level of the tree. The approximation is to recompute the bounding box only for that dimension that the parent split upon, and otherwise copy the parent's bounding box for all other dimensions. This creates bounding boxes which are overestimates of the true bounding box. When the number of points in any node is less than or equal to a user-specified constant known as the \emph on bucket size\emph default , no children are attached, and this becomes a terminal node of the k-d tree.\layout StandardOnce the points have been identified with nodes, the exact bounding boxes are then refined in an efficient process taking logarithmic time. The true bounding boxes are computed explicitly for the terminal nodes. Once these are created, the exact bounding boxes for any internal node can be computed rapidly as unions of its children's bounding boxes. As a result, using approximate bounding boxes in the tree creation has little effect on search performance, but significantly reduces the time necessary to build the tree. Along the way, the cut value is refined to the arithmetic average between the maximum of the box on the left node and the minimum of box on the right node.\layout StandardOnce the complete tree has been created, the build procedure optionally can create a new internal copy of the data set, permuting its order so that data for points in the terminal nodes lie contiguously in memory, and near points for nodes close topologically. This is recommended for performance when the database is too large for the main CPU cache. In that circumstance, the rearrangement approximately doubles performance.\layout SubsectionSearching\layout StandardThe search algorithm is simple, but implemented carefully for efficiency. There are two search modes, fixed neighbor number search (find the closest \begin_inset Formula $m$\end_inset points to the query vector \begin_inset Formula $q$\end_inset ), and the fixed radius search (find all points \begin_inset Formula $x_{i}$\end_inset with \begin_inset Formula $d^{2}(x_{i},q)\leq r^{2}$\end_inset ). The fixed radius search is simplest to describe. At each non-terminal node the query vector is compared to the cut plane's value in the cut dimension, which chooses one child node as the \begin_inset Quotes eld\end_inset closer\begin_inset Quotes erd\end_inset and the other as \begin_inset Quotes eld\end_inset farther\begin_inset Quotes erd\end_inset . The tree is recursively descended depth-first, searching all closer nodes unconditionally. At the terminal nodes (buckets), the distances from the points of in terminal node to the query vector are computed explicitly, and any point with distance less than \begin_inset Formula $r^{2}$\end_inset is added to the list of results. As the terminal node search is a significant performance bottleneck it was implemented carefully to avoid needless computation and cache misses, and minimize memory bandwidth.\layout StandardThe farther nodes are also searched, but only if the bounding box of the farther node intersects the hypersphere of radius \begin_inset Formula $r^{2}$\end_inset centered at the query vector. Again for efficiency, this is implemented in two pieces. First, the perpendicular distance from the query vector to the cut plane value is computed (as this is available at the node itself). If this is larger than \begin_inset Formula $r^{2}$\end_inset then the farther node is excluded automatically. If not, then the distance from the query vector to the closest point in the bounding box of the farther node is computed. If this distance is less than \begin_inset Formula $r^{2}$\end_inset then the farther node is searched, otherwise it is excluded as well. Theoretically, the perpendicular test is not necessary as if it rejects searching the farther node, then so would have the bounding box test. It takes more computation to test the bounding box, and it was empirically worthwhile to obviate some of those computations with the simple perpendicular test. \layout StandardSearching for \begin_inset Formula $m$\end_inset nearest neighbors is similar, except that the \begin_inset Formula $r^{2}$\end_inset value changes dynamically during the search. A list of up to \begin_inset Formula $m$\end_inset candidate neighbors is maintained during the above search procedure. The distance to the \begin_inset Formula $m$\end_inset th neighbor (i.e. the largest distance on the current list) is the search ball's size, \begin_inset Formula $r^{2}$\end_inset . In the terminal node search any point with distance less than this value must replace one on the working list, and the new \begin_inset Formula $r^{2}$\end_inset recomputed as the maximum value on the list. In the present implementation, this is done with a priority queue structure based on binary heaps, allowing replacement of the maximum value and retrieval of the new maximum in \begin_inset Formula $O(\log m)$\end_inset time instead of the usual \begin_inset Formula $O(m)$\end_inset time for maintaining an always sorted sorted list. This efficiency gain is important for larger \begin_inset Formula $m$\end_inset searches. At the beginning of the search, \begin_inset Formula $r^{2}$\end_inset is set to \begin_inset Formula $+\infty$\end_inset and any point found is added until there are \begin_inset Formula $m$\end_inset on the working list.\layout StandardBy default, points are returned in arbitrary order for both searches, except that for the fixed \begin_inset Formula $m$\end_inset search the point with the \emph on largest\emph default distance will be first on the list because of the priority queue implementation. There is a subroutine to sort results in ascending order of distance. This can be called manually by the user, or, if specified at time of tree creation, will be performed for all searches on that tree.\layout StandardFor many geometrical time-series algorithms, for example, the False Nearest Neighbor methods for testing embeddings, the points are ordered in time corresponding to their index. One often wants to find neighbors in the set close to other existing points in the data set, but exclude the reference point (which provides the query vector), and a window of points close in time often known as the \begin_inset Quotes eld\end_inset decorrelation window\begin_inset Quotes erd\end_inset \begin_inset Formula $W$\end_inset . The software offers search routines for this task. For these searches the query vector is set to the value of the reference index \begin_inset Formula $i$\end_inset , and in the terminal node search, any candidate neighbor \begin_inset Formula $x_{j}$\end_inset is excluded as a valid result if \begin_inset Formula $|i-j|<W$\end_inset .\layout SectionInterface and examples\layout StandardThe codes also offer the option of defining distances along fewer dimensions than exist in the matrix of input data, i.e, projecting the input \begin_inset Formula $d$\end_inset -dimensional vectors to the first \begin_inset Formula $d'<d$\end_inset coordinates.\layout SubsectionFortran 95\layout Standard\paragraph_spacing single \noindent The Fortran code is in \family typewriter kdtree2.f90\family default , in module \family typewriter kdtree2_module\family default . The interfaces ought be self-explanatory: \begin_inset ERTstatus Collapsed\layout Standard\backslash begin{verbatim}\newline function kdtree2_create(input_data,dim,sort,rearrange) Result (mr)\newline ! create a tree from input_data(1:d,1:N)\newline ! if PRESENT(dim), use this as dimensionality. \newline ! if (sort .eqv. .true.) then sort all search results by increasing distance.\newline real, target :: input_data(:,:)\newline integer, intent(in), optional :: dim\newline logical, intent(in), optional :: sort\newline logical, intent(in), optional :: rearrange\newline !\newline Type (kdtree2), Pointer :: mr ! the master record for the tree.\newline \newline subroutine kdtree2_destroy(tp)\newline ! Deallocates all memory for the tree, except input data matrix\newline Type (kdtree2), Pointer :: tp\newline \newline subroutine kdtree2_n_nearest(tp,qv,nn,results)
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -