CN112308122B - High-dimensional vector space sample rapid searching method and device based on double trees - Google Patents

High-dimensional vector space sample rapid searching method and device based on double trees Download PDF

Info

Publication number
CN112308122B
CN112308122B CN202011127725.9A CN202011127725A CN112308122B CN 112308122 B CN112308122 B CN 112308122B CN 202011127725 A CN202011127725 A CN 202011127725A CN 112308122 B CN112308122 B CN 112308122B
Authority
CN
China
Prior art keywords
tree
point
points
pruning
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011127725.9A
Other languages
Chinese (zh)
Other versions
CN112308122A (en
Inventor
徐国天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Criminal Police University
Original Assignee
China Criminal Police University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Criminal Police University filed Critical China Criminal Police University
Priority to CN202011127725.9A priority Critical patent/CN112308122B/en
Publication of CN112308122A publication Critical patent/CN112308122A/en
Application granted granted Critical
Publication of CN112308122B publication Critical patent/CN112308122B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24147Distances to closest patterns, e.g. nearest neighbour classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A high-dimensional vector space sample rapid searching method and device based on double trees filters out a very small number of data points from an original data point set to form a pruning point set, filters out remaining data points to form a deleted point set, and maximally retains the distribution form of the original data point set in a multi-dimensional space, so that the nearest neighbor point of a point to be checked in the pruning point set is very close to the K nearest neighbor point of the point to be checked in a complete set. The pruning tree is formed by utilizing the pruning point set, the pruned tree is formed by utilizing the pruned point set, and the number of data points in the pruning point set is small, so that the nearest neighbor point can be rapidly positioned in the pruning tree, then the nearest neighbor point is utilized as an initial nearest neighbor point, K nearest neighbors are searched in the pruned tree and the complete tree, and the positions of the initial nearest neighbor points are not fixed but are positioned near the points to be checked, thereby effectively reducing pruning radius, reducing the number of times of calculation and comparison of the space distance, improving pruning effect and improving the overall query efficiency.

Description

High-dimensional vector space sample rapid searching method and device based on double trees
Technical Field
The invention relates to the technical field of sample processing, in particular to a high-dimensional vector space sample rapid searching method and device based on double trees.
Background
The K-nearest neighbor (K Nearest Neighbor, KNN) algorithm refers to that given a point to be checked, K sample points closest to the point to be checked are found in the multidimensional vector space sample point set, and if most of the K sample points belong to a certain class, the point to be checked also belongs to the class. K-nearest neighbor search is widely applied to many data analysis problems, such as machine learning, geographic information systems, artificial intelligence and other fields. When facing to high-dimensional and massive sample sets, the problems of large calculation amount and low query speed are caused by the need of calculating and comparing the space distance with a large number of sample points, so that the practical application of the KNN algorithm is influenced.
Aiming at the problems of large calculation amount and low query efficiency of the KNN algorithm, the following improvement directions are mainly provided at present:
first, clipping the sample set. For example, in the prior art, by analyzing the sample set, redundant elements are removed from the sample set, so that the sample set is reduced, and the calculated amount of the K neighbor searching stage is further reduced. However, when the method is oriented to high-dimensional and massive sample sets, the sample sets are limited in scale reduction, and the method still faces the defect of large calculation amount.
Secondly, a parallel computing architecture is adopted to improve the K neighbor query speed. The parallel KNN algorithm based on the CUDA model is provided in the prior art, the general matrix is adopted to multiply the distance calculation speed, the distance is compared by two methods according to the K value, and the parallel calculation method effectively improves the K neighbor query efficiency.
Thirdly, a space partition data structure is adopted to realize K neighbor quick search. Dividing the sample point set according to a specific geometric form, constructing a search binary tree, and pruning the binary tree by comparing the space distances in the K neighbor searching stage, so that the number of sample points needing to calculate and compare the distances is reduced, and the K neighbor searching speed is further increased. Common spatial partitioning data structures include Metric-tree, ball-tree, cover-tree, VP-tree, MVP-tree, R-tree, and KD-tree.
The various spatially partitioned data structures differ primarily in the following three aspects, first, partition geometry; secondly, establishing a space division strategy of the search tree; third, a policy of searching using the data structure. The Ball-tree structure adopts the hypersphere to divide the data set, has more advantages in the aspect of K neighbor query, and is widely applied. However, the K neighbor points of the initial K neighbor point positions based on the Ball-tree structure are fixed, namely, K leaf nodes at the leftmost side of the Ball-tree structure have good pruning effect if the to-be-checked point is close to the spatial distance of the initial K neighbor points, more data points are pruned and filtered, and further the faster K neighbor query speed is obtained. However, when the point to be checked is far from the initial K adjacent points, the pruning radius is too large, the pruning effect is poor, only a small number of data points are filtered by pruning, the K adjacent point query speed is reduced, and the defect is particularly obvious when facing to a high-dimensional and massive sample set.
Disclosure of Invention
Therefore, the invention provides a high-dimensional vector space sample rapid searching method and device based on double trees, which can furthest reserve the distribution form of an original data point set in a high-dimensional space, rapidly locate nearest neighbors, effectively reduce pruning radius, improve pruning effect and improve K neighbor query efficiency.
In order to achieve the above object, in a first aspect, the present invention provides a fast search method for a high-dimensional vector space sample based on a dual tree, including:
step 1, in the construction process of a double tree structure, counting information of each subtree, storing a counting result into a node array, wherein each element in the node array corresponds to the counting information of one subtree, the counting information comprises the number of data points in the subtree, the super sphere radius corresponding to the subtree and a data point set corresponding to the subtree, the elements in the node array are arranged in descending order according to the number of the data points in each subtree, and the elements with the same number of the data points in the subtree are arranged in ascending order according to the super sphere radius corresponding to the subtree;
step 2, filtering each subtree with the number of data points being greater than or equal to the maximum pruning point number, reserving the data point closest to the mass center in each subtree, filtering and deleting the data points which are not reserved, and storing the data points into a deleted point set; after all subtrees are processed, forming the data points in the deleted point set into deleted trees according to a double tree construction algorithm; removing the remaining elements after the elements of the deleted point set from all the sample point sets to form a pruning point set, and forming the data points in the pruning point set into a pruning tree according to a double-tree construction algorithm;
Step 3, firstly, calling a K neighbor algorithm in the pruning tree to find Kg adjacent points of the point to be checked, wherein Kg is more than or equal to 1 and less than or equal to K; then respectively using the 1 st and Kg th neighbor points as initial neighbor points of the deleted tree, searching K neighbor points of the to-be-searched point in the deleted tree, when the number of the found neighbor points is less than K, setting the nearest neighbor array to be empty, and searching the K neighbor points of the to-be-searched point in a double-tree structure;
and 4, dividing the original data set into a test set and a training set according to a given proportion, generating a plurality of groups of test sets and training sets by using a random algorithm, calculating an average value of optimal query parameters obtained by each group of test sets and training sets as a final optimal parameter value Kg, and enabling a given number of points to be checked to find K nearest neighbors in the pruned tree and the pruned tree.
As a preferred scheme of the fast searching method for the high-dimensional vector space samples based on the double trees, in the step 2, a prune tree is defined as a reduced_tree, and a pruned tree is defined as a pruned_tree;
the input parameters are: data point set data to be pruned; node array node_density of statistical information of each subtree is stored, node array elements are arranged in descending order according to the number of data points in each subtree, and elements with the same point_num value are arranged in ascending order according to the radius of the super sphere corresponding to the subtree; the integer variable max_points represents the maximum pruning point;
The output parameters are: a pruned tree constructed from pruned data points, a pruned tree reduce_tree constructed from pruned remaining data points;
the construction steps of the pruned tree reduce_tree and the pruned tree pruned_tree comprise:
step 2.1, enabling the deleted tree array to store deleted data points, wherein an initial value is null; enabling the pruning tree array to store pruning residual data points, wherein an initial value is null;
step 2.2, forming a set T by all elements with the data points greater than or equal to the maximum pruning point number in the node array, wherein T= { T1, T2, …, tn }; classifying subtrees according to the number of data points, and defining a real variable avg_max_radius to represent the maximum radius average value of each subtree which is pruned, wherein the initial value is T1.Radius; defining an integer variable current_point_num to represent the number of data points of the subtree currently processed, wherein the initial value is T1.Point_num; defining a real variable previous_radius to represent the radius of a sphere corresponding to the previous subtree, wherein the initial value is T1.Radius; defining an integer variable i, wherein the initial value is 2;
step 2.3, retrieving the element Ti from the set T, if Ti.Point_num-! The step 2.6 is executed if the current_point_num is the current subtree is processed, otherwise the step 2.4 is executed;
Step 2.4, if Ti.radius-previous_radius is less than or equal to 2 or Ti.radius < avg_max_radius, executing step 2.5; otherwise, let i=i+1, if i is not more than len (T), then execute step 2.3, skip the rest subtrees in such subtrees in turn, otherwise execute step 2.7;
step 2.5, let center be the centroid of the data point set ti.points, min_dist_point be the closest point to the centroid center in the data point set ti.points, let the pruned_tree=pruned_tree @ (ti.points-min_dist_point), i.e. incorporate the data points except for min_dist_point in the ti.points into the pruned_tree array; let previous_radius=ti.radius, and update avg_max_radius value at the same time; let i=i+1, if i is less than or equal to len (T), then step 2.3 is performed, otherwise step 2.7 is performed;
step 2.6, let current_point_num=ti.point_num, update the current count current_point_num of such subtrees currently processed, execute step 2.5;
step 2.7, let reduce_tree=data_pruned_tree, execute step 2.1 to step 2.6 to construct pruned tree composed of all pruned data points and pruned tree composed of all pruned remaining data points.
As a preferred scheme of the fast search method for high-dimensional vector space samples based on double trees, the step 3 includes:
Step 3.1, calling a single tree structure K neighbor searching algorithm, determining Kg neighbor points of a target to be checked in a pruning tree, and storing the found Kg neighbor points in a reduce_tree.KNN_result array;
step 3.2, presetting the nearest neighbor point of the target to be checked in the pruning tree into a K neighbor array of the deleted tree, determining K adjacent points of the target to be checked in the deleted tree, and when the number num of the positioned neighbor points is more than or equal to K, ending the search, otherwise executing the step 3.3;
step 3.3, presetting the nearest neighbor point of the target to be checked in the pruning tree into a K neighbor array of the pruned tree, determining K adjacent points of the target to be checked in the pruned tree, and executing the step 3.4 when the number num of the positioned adjacent points is not less than K, and finishing the searching, otherwise;
and 3.4, presetting a K neighbor array of the complete tree to be null, and searching K neighbor points in the complete tree by using a K neighbor search algorithm with a single tree structure, wherein the searching is finished.
As a preferred scheme of the fast search method for high-dimensional vector space samples based on double trees, in step 4, input parameters are defined as follows: a data point set train_data for training; data point set test_data for test; a node array node_density for storing statistical information of each subtree; the nearest neighbor number K is to be checked; upper limit max_len of pruning number;
The output parameters are: the optimal pruning number max_points; the number Kg of the neighbor points of the optimal pruning tree; a starting neighbor point begin_point; ending the adjacent point end_point;
the step of determining optimal query parameters includes:
step 4.1, constructing a Ball-tree structure according to the train_data set; defining real variables T1, T2 for time statistics, T 1 、T 2 The initial values are all 0; defining an integer variable Len, wherein the initial value is 2;
step 4.2, constructing a pruning tree reduction_tree and a pruned tree according to the train_data set, setting the pruning number as Len, defining an integer variable i, and setting the initial value as 1;
step 4.3, if i is not more than Len (test_data), executing step 4.4, otherwise, enabling len=len+1, if Len < max_len, executing step 4.2, otherwise, executing step 4.8, and counting optimal parameters;
step 4.4, defining an integer variable j, wherein the initial value is 2, if j is less than or equal to K, executing step 4.5, otherwise, letting i=i+1, and executing step 4.3;
step 4.5, query test_data [ i ]]J adjacent points in the pruned tree, T 1 Time spent for the step 4.5 query;
step 4.6, defining an integer variable n, wherein the initial value is 1, if n is less than or equal to j-1, executing step 4.7, otherwise, enabling j to be equal to j+1, if j is less than or equal to K, executing step 4.5, otherwise, enabling i to be equal to i+1, and executing step 4.3;
Step 4.7, let begin_point_ =n, end_point_ =j, query test_data [ i ]]K adjacent points in the double-tree structure define the number Kg=j of adjacent points queried in the pruning tree, and define two initial adjacent points of the deleted tree as reduce_tree]And reduce_tree. KNN_result [ n-1 ]],T 2 The time spent for the query in step 4.7; counting the sum of query time consumed by each query parameter, wherein the result array element structure is [ max_points-, K_g-, begin_point-, end_point-, T\u ]]Wherein the first four items are primary key fields, if no corresponding record exists in the result array, the record is newly added, otherwise, T is added 1 +T 2 The value is appended to the corresponding record; let n=n+1, if n is less than or equal to j-1, go to step 4.7, otherwiseLet j=j+1, if j is less than or equal to K, execute step 4.5, otherwise let i=i+1, execute step 4.3;
and 4.8, according to the time statistical result, arranging result array elements in ascending order, and after sorting, obtaining the values of all the fields of the result [0] element as the optimal parameter value.
As a preferred scheme of the fast search method for the high-dimensional vector space samples based on the double trees, the method is used for sample searching in the processes of information retrieval, text classification, pattern recognition, data mining and image processing.
In a second aspect, a fast search device for a high-dimensional vector space sample based on a dual tree is provided, and the fast search method for the high-dimensional vector space sample based on the dual tree in the first aspect or any implementation manner thereof includes:
The initial statistics module is used for carrying out statistics on information of each subtree in the construction process of the double-tree structure, and storing a statistics result into a node array, wherein each element in the node array corresponds to the statistics information of one subtree, the statistics information comprises the number of data points in the subtree, the super sphere radius corresponding to the subtree and the data point set corresponding to the subtree, the elements in the node array are arranged according to the descending order of the number of the data points in each subtree, and the elements with the same numerical value in the subtree are arranged according to the ascending order of the super sphere radius corresponding to the subtree;
the deleted tree and pruning tree construction module is used for filtering each subtree with the data point number larger than or equal to the maximum pruning point number, reserving the data point closest to the mass center in each subtree, filtering, deleting and storing the data point which is not reserved in the deleted point set; after all subtrees are processed, forming the data points in the deleted point set into deleted trees according to a double tree construction algorithm; removing the remaining elements after the elements of the deleted point set from all the sample point sets to form a pruning point set, and forming the data points in the pruning point set into a pruning tree according to a double-tree construction algorithm;
The recursive search module is used for firstly calling a K nearest neighbor algorithm in the pruning tree to search Kg adjacent points of the point to be checked, wherein Kg is more than or equal to 1 and less than or equal to K; then respectively using the 1 st and Kg th neighbor points as initial neighbor points of the deleted tree, searching K neighbor points of the to-be-searched point in the deleted tree, when the number of the found neighbor points is less than K, setting the nearest neighbor array to be empty, and searching the K neighbor points of the to-be-searched point in a double-tree structure;
the optimal query parameter determining module is used for dividing an original data set into a test set and a training set according to a given proportion, generating a plurality of groups of test sets and training sets by utilizing a random algorithm, calculating an average value of optimal query parameters obtained by each group of test sets and training sets as a final optimal parameter value Kg, and enabling a given number of points to be checked to find K nearest neighbors in the pruned tree and the pruned tree.
In a third aspect, a computer readable storage medium is provided, in which program code for dual tree based high dimensional vector space sample fast searching is stored, the program code comprising instructions for performing the dual tree based high dimensional vector space sample fast searching method of the first aspect or any implementation thereof.
In a fourth aspect, an electronic device is provided, the electronic device comprising a processor coupled to a storage medium, which when executed with instructions in the storage medium, causes the electronic device to perform the dual tree based high-dimensional vector space sample fast search method of the first aspect or any implementation thereof.
According to the invention, a few data points are filtered from the original data point set to form a pruning point set, the filtered remaining data points form a pruned point set, and the data points in the pruning point set furthest maintain the distribution form of the original data point set in a multidimensional space, so that the nearest neighbor point of the point to be checked in the pruning point set is very close to the K nearest neighbor point of the point to be checked in the complete set. The pruning tree is formed by utilizing the pruning point set, the pruned tree is formed by utilizing the pruned point set, and because the number of data points in the pruning point set is small, the nearest neighbor point can be rapidly positioned in the pruned tree, then the nearest neighbor point is used as an initial nearest neighbor point, K nearest neighbors are searched in the pruned tree and the complete tree, and the positions of the initial nearest neighbor points are not fixed but are positioned near the points to be searched, so that the pruning radius is effectively reduced, more subtrees are pruned and filtered due to the shortening of the pruning distance, the calculation times of the space distance and the size comparison times are reduced, the pruning effect is improved, and the overall query efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.
FIG. 1 is a schematic diagram illustrating a Node splitting process in a Ball-tree in the prior art;
FIG. 2 is a schematic diagram of a Ball-tree structure formed by 44 sample points in a two-dimensional space in the prior art;
fig. 3 is a schematic diagram of a 5-neighbor search process of a target to be checked in a two-dimensional space under ideal conditions;
fig. 4 is a schematic diagram of a 5-neighbor search process of a target to be checked in a two-dimensional space under a non-ideal condition;
FIG. 5 is a flowchart of a fast search method for a dual-tree-based high-dimensional vector space sample provided in embodiment 1 of the present invention;
FIG. 6 is a diagram showing the effect of adjusting the value of the pruning number max_points on pruning effect;
FIG. 7 is a reduce-tree structure constructed by the sample points of FIG. 2 according to algorithm 2;
FIG. 8 is a reduce-tree structure constructed according to algorithm 2;
FIG. 9 is a schematic diagram of a partial pruning process according to algorithm 2;
FIG. 10 is a diagram of an example process of K-nearest neighbor search based on dual trees;
FIG. 11 is a diagram illustrating the statistics of optimal query parameters;
FIG. 12 is a graph showing neighbor distance statistics for 400 query points under the action of the optimal and worst parameters;
fig. 13 is a schematic diagram of a fast search device for high-dimensional vector space samples based on dual trees.
Detailed Description
Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The following knowledge is known to those skilled in the art:
a dual tree (Ball-tree) structure is proposed by Moore. The construction process of the Ball-tree is as follows, and under the initial condition, all sample points in the multidimensional vector space form a minimum hypersphere (namely a minimum circle for the two-dimensional space), and the hypersphere is the root node of the Ball-tree. And dividing all sample points into two parts according to a certain principle to construct two minimum sub-superspheres, wherein the two sub-superspheres are left and right sub-tree root nodes of a root node respectively, and the splitting process is continuously carried out until each sub-supersphere only contains 1 sample point or has an extremely small radius, and the splitting is stopped, so that the Ball-tree is successfully constructed. The weighted spatial euclidean distance is typically used as a distance calculation formula between sample points in the multidimensional vector space.
Each root node in the Ball-tree comprises five important attributes, namely a centroid pivot (namely the center of a sub-hypersphere); radius, satisfying formula (1); sample point set points (i.e., all sample point data sets within a sub-hypersphere); child1 and Child2 are the left and right subtree root nodes of the root node, respectively, and if the root node is a non-leaf node, formulas (2) and (3) are satisfied.
Node.Child1.points∪Node.Child2.points=Node.points (3)
The split principle of each Node in the Ball-tree is as follows: firstly, calculating a sample point1 which is farthest from the centroid in a sub-hypersphere, then calculating a sample point2 which is farthest from the centroid, dividing all sample point set points into two subsets Child1 and Child2, wherein a sample point which is closer to the point1 is divided into the Child1 set, a sample point which is closer to the point2 is divided into the Child2 set, and the node splitting principle can be described by a formula (4).
Referring to FIG. 1, in order to perform Node1 Node splitting, 7 sample points A-G in two-dimensional space form a minimum circle, i.e. Node1 Node in the Ball-tree. The point A is farthest from the centroid, the point B is farthest from the centroid, and the hyperplane between the point A and the point B divides all 7 sample points into two subsets Child1 and Child2, and the two subsets form two minimum sub-circles, namely the left sub-tree root Node and the right sub-tree root Node of the Node1 Node.
And the space distance between the point to be checked p in the multidimensional space and any sample point x in the Node satisfies formulas (5) and (6) by the K neighbor searching algorithm. From the triangle inequality, we can get:
|x-p|≥|p-Node.pivot|-|x-Node.pivot|
given x.epsilon.node.points, satisfying |x-node.pivot| is less than or equal to node.radius, therefore, the formula (5) can be proved to be established, and the same can be proved to be established as the formula (6).
|x-p|≥|p-Node.pivot|-Node.radius (5)
|x-p|≤|p-Node.pivot+Node.radius (6)
Root in the formula (7) is a root node of a Ball-tree structure,the formula is calculated for the smallest possible spatial distance between the point to be checked p and the Child1 node, which is derived from the boundary definition given by the formula (5) and the fact that all sample points in the Child node must be covered by their parent node. This property means +.>Can never be less than the minimum distance of ancestorsFig. 1 shows->By comparing to determine the minimum distance between the Node1 and the point p to be checked ∈1>As the minimum distance value of p to the node child 1.
dist=|p-Node.pivot|-Node.radius
In the query stage, the quick pruning of the subtrees can be realized by using the formula (7), and the retrieval process is quickened. As shown in FIG. 1, the nearest neighbor to the point to be checked p is H, sinceTherefore, all 7 sample points in the Node1 Node are pruned and filtered, the space distance between the sample points and p points is not needed to be calculated one by one, and the size relation between the sample points and |p-H| is compared, so that the retrieval process can be quickened. Due to- >Therefore, the spatial distance from p is calculated for 3 sample points in Node2 Node one by one, and compared with the magnitude relation of |p-H| to go through threeAnd comparing again, and determining the sample point M as the most recent adjacent point of the point p to be checked.
K neighbor recursive search algorithm 1 based on Ball-tree structure: search_knn (knn_result, node, target, K):
inputting parameters: the current nearest neighbor array KNN_result is an empty array when being called for the first time; the current processing node is a Ball-tree root node when being called for the first time; a target to be checked; the nearest neighbor number K is to be checked.
Output parameters: k nearest neighbor data points of the target to be checked.
T1. if the current processing node is empty, then T7 is executed, otherwise T2 is executed.
T2. if the current processing node is a leaf node, then T3 is executed, otherwise T4 is executed.
T3. let all n data points of node form set S, s= { S1, S2, …, sn }, where Si is the i-th data point of node. Sequentially taking out each data point Si in the set S, and calculating the space distance between the target to be checked and the data point Si; let the integer variable len be the nearest neighbor array KNN_result element number, if len < K, (Si, distance) tuple is added to the end of KNN_result array, and the KNN_result array is arranged in descending order according to the distance value of each tuple; if len > =k, comparing the distance with knn_result [0] [1], where knn_result [0] [1] is the distance value of the first tuple of knn_result array, if distance < knn_result [0] [1], updating knn_result array, deleting the first tuple, adding (Si, distance) tuple to knn_result array, and arranging in descending order according to the distance value of each tuple; if distance > = KNN_result [0] [1], then the KNN_result array is not updated.
And T4, enabling the sphere radius corresponding to the node in the current processing to be radius, enabling the corresponding centroid point to be center, enabling the spatial distance between the centroid point center and the target to be detected to be distance_c, and executing T7 if |distance_c| > radius+KNN_result [0] [1], namely, K neighbor data points of the target to be detected cannot exist in the current node, and pruning and filtering the whole subtree corresponding to the node. If |distance_c| < =radius+knn_result [0] [1], then T5 is performed, i.e., there may be K neighbor data points of the target to be looked up in the current node, requiring processing of the left and right subtrees of the node respectively.
T5. let the root node of the left subtree of the current node be node.child1, recursively call K neighbor search algorithm search_knn (knn_result, node.child1, target, K) to process the left subtree of the node.
T6. let the root node of the right subtree of the current node be node.child2, recursively call the K-nearest neighbor search algorithm search_knn (knn_result, node.child2, target, K) to process the right subtree of the node,
and T7, ending the algorithm, and returning to the current KNN_result.
According to the K-nearest neighbor recursive search algorithm 1, no matter where the point to be checked is in the multidimensional space, the initial value of the KNN_result array is the leftmost K leaf nodes of the Ball-tree structure. This results in good pruning effect, high query efficiency, and conversely poor pruning effect, low query efficiency,
Fig. 2 shows a Ball-tree structure formed by 44 sample points in a two-dimensional space, the coordinate position of the target to be checked in the two-dimensional space is 80,400, and fig. 3 shows a 5 neighbor search process of the target to be checked in the two-dimensional space. In the initial condition, the leftmost 5 leaf nodes (namely 20,8, 39, 41, 44) of the Ball-tree are filled in a KNN_result array, and the array elements are arranged in descending order of distance values. And then calculating the space distance between the leaf nodes and the target one by the algorithm from left to right, comparing the space distance with the distance value of the first element of the KNN_result array, and dynamically updating the element value of the KNN_result array to ensure that the current 5 nearest neighbors are always stored. Because the target to be checked is closer to the initial 5-neighbor spatial distance, the pruning effect is good, and the whole right branch subtree of the Ball-tree is pruned in the query process, wherein the whole right branch subtree contains 33 sample points. The whole query process only compares the distance values of 11 sample points and the target to finish 5 neighbor search, and the query efficiency is higher.
In the example shown in fig. 4, the position of the target to be checked in the two-dimensional space is [66,212], and the position is far from the leftmost 5 leaf nodes of the Ball-tree structure, namely the initial 5 neighbor, so that the pruning effect is poor in the whole query process, and the pruning operation is only performed twice, and the pruning operation comprises 10 sample points in total. And the space distance is compared with 34 leaf nodes, so that 5-neighbor search is completed, and the query efficiency is lower.
The reason for the low query efficiency of the K neighbor search algorithm of the original Ball-tree structure is as follows:
when the number of elements in the KNN_result array is less than K, the leaf node which is processed first is stored in the KNN_result array, so that the number of elements reaches K as soon as possible, and then the KNN_result array is continuously updated by using better neighbor points until the algorithm is ended. The main disadvantage of this original algorithm is that the initial K data points in the knn_result array are fixed, i.e. the leftmost K leaf nodes of the Ball-tree structure, no matter where the target to be checked is located in the multidimensional space. In the multidimensional space, the positions of the targets to be checked are randomly changed, the initial K adjacent points of each target point are fixed, the K initial adjacent points are likely to be far away from the target, and the K initial adjacent points are used for pruning the Ball-tree, so that the ideal pruning effect is difficult to achieve due to the overlarge pruning distance.
If a group of relatively adjacent points can be quickly found near the target point to be checked, the group of data points are used as initial adjacent points to prune the Ball-tree, more subtrees are pruned and filtered due to the shortened pruning distance, the space distance calculation and the size comparison times are reduced, and the overall query efficiency is further improved.
Based on the prior art foundation and the problems, the specific implementation process of the invention is as follows:
example 1
Referring to fig. 5, a fast search method for high-dimensional vector space samples based on dual trees is provided for sample searching in the processes of information retrieval, text classification, pattern recognition, data mining and image processing, comprising:
s1, in the construction process of a double tree structure, counting information of each subtree, storing a counting result into a node array, wherein each element in the node array corresponds to the counting information of one subtree, the counting information comprises the number of data points in the subtree, the super sphere radius corresponding to the subtree and a data point set corresponding to the subtree, the elements in the node array are arranged in descending order according to the number of the data points in each subtree, and the elements with the same data point number in the subtree are arranged in ascending order according to the super sphere radius corresponding to the subtree;
s2, filtering each subtree with the number of data points being greater than or equal to the maximum pruning point number, reserving the data point closest to the mass center in each subtree, filtering and deleting the data points which are not reserved, and storing the data points into a deleted point set; after all subtrees are processed, forming the data points in the deleted point set into deleted trees according to a double tree construction algorithm; removing the remaining elements after the elements of the deleted point set from all the sample point sets to form a pruning point set, and forming the data points in the pruning point set into a pruning tree according to a double-tree construction algorithm;
S3, firstly, calling a K neighbor algorithm in the pruning tree to find Kg adjacent points of the point to be checked, wherein Kg is more than or equal to 1 and less than or equal to K; then respectively using the 1 st and Kg th neighbor points as initial neighbor points of the deleted tree, searching K neighbor points of the to-be-searched point in the deleted tree, when the number of the found neighbor points is less than K, setting the nearest neighbor array to be empty, and searching the K neighbor points of the to-be-searched point in a double-tree structure;
s4, dividing the original data set into a test set and a training set according to a given proportion, generating a plurality of groups of test sets and training sets by using a random algorithm, calculating an average value of optimal query parameters obtained by each group of test sets and training sets as a final optimal parameter value Kg, and enabling a given number of points to be checked to find K nearest neighbors in the pruned tree and the pruned tree.
And adding one step of processing in the original Ball-tree construction algorithm, counting the information of each subtree, and storing the counting result into a node_density array. Each element in the node_density array corresponds to the statistical information of a subtree, and the statistical information comprises the number of data points in the subtree, namely the number of leaf nodes, namely the number of points_num, the radius of the super sphere corresponding to the subtree, and the number of the data points corresponding to the subtree is the point set. The array elements are arranged in descending order according to the number of data points point_num in each subtree, and the elements with the same point_num value are arranged in ascending order according to radius.
Specifically, in step S2, a pruned tree is defined as a reduce_tree, and a pruned tree is defined as a pruned_tree;
the input parameters are: data point set data to be pruned; node array node_density of statistical information of each subtree is stored, node array elements are arranged in descending order according to the number of data points in each subtree, and elements with the same point_num value are arranged in ascending order according to the radius of the super sphere corresponding to the subtree; the integer variable max_points represents the maximum pruning point;
the output parameters are: a pruned tree constructed from pruned data points, a pruned tree reduce_tree constructed from pruned remaining data points;
defining a pruning tree and a deleted tree construction algorithm 2:
reduce_and_pruned_tree(node_density,data,max_points)
the construction steps of the pruned tree reduce_tree and the pruned tree pruned_tree comprise:
s2.1, enabling the deleted tree array to store deleted data points, wherein an initial value is null; enabling the pruning tree array to store pruning residual data points, wherein an initial value is null;
s2.2, forming a set T by all elements with the data points greater than or equal to the maximum pruning point number in the node array, wherein T= { T1, T2, …, tn }; classifying subtrees according to the number of data points, and defining a real variable avg_max_radius to represent the maximum radius average value of each subtree which is pruned, wherein the initial value is T1.Radius; defining an integer variable current_point_num to represent the number of data points of the subtree currently processed, wherein the initial value is T1.Point_num; defining a real variable previous_radius to represent the radius of a sphere corresponding to the previous subtree, wherein the initial value is T1.Radius; defining an integer variable i, wherein the initial value is 2;
S2.3, element Ti is fetched from the set T, if Ti.Point_num-! Current_point_num, i.e. current such subtree is processed, then S2.6 is performed, otherwise S2.4 is performed;
s2.4, if Ti.radius-previous_radius is less than or equal to 2 or Ti.radius < avg_max_radius, executing S2.5; otherwise, let i=i+1, if i is not greater than len (T), then execute S2.3, skip the remaining subtrees in such subtrees in turn, otherwise execute S2.7;
s2.5, making the center be the centroid of the data point set Ti.points, wherein min_dist_point is the closest point to the centroid center in the data point set Ti.points, and making the sampled_tree=sampled_tree (Ti.points-min_dist_point), namely, incorporating the data points except for the min_dist_point in the Ti.points into a sampled_tree array; let previous_radius=ti.radius, and update avg_max_radius value at the same time; let i=i+1, if i+.len (T), then S2.3 is performed, otherwise S2.7 is performed;
s2.6, let current_point_num=ti.point_num, update the current count current_point_num of such subtrees currently processed, execute S2.5;
s2.7, let reduce_tree=data_pruned_tree, execute S2.1 to S2.6 to construct a pruned tree consisting of all pruned data points and a pruned tree consisting of all pruned remaining data points.
Referring to FIG. 6, the method can ensure that the elements in the reduce-tree set are as few as possible, and the spatial morphology of the original sample points is reserved to the maximum extent. In the original sample point set, the number of outlier data points is relatively small, but plays an important role in maintaining the spatial morphology of the original sample points, and the sample points should be kept to the maximum extent. S2.4 of the algorithm 2 calculates the radius difference of the sub-hypersphere, and when the radius increment is too large and the radius is larger than the radius average value of all the sub-hyperspheres which are filtered, the sample points contained in the sub-hyperspheres are considered to have an important effect on the form of the reserved space, and pruning processing of the data points in the sub-hyperspheres is stopped.
The integer variable max_points is used to define pruning force, and increasing this value will have more data points to prune, and this value determination will require accurate calculation to achieve the best result. Fig. 6 shows that the pruning effect of the 3960 sample points in the two-dimensional space is compared with the pruning effect of the adjustment of the pruning number max_points, so that the densities of the remaining sample points of the pruning are different, but the basic form of the original sample points is retained to the maximum extent, and the outliers are retained to the maximum extent.
Referring to FIG. 7, a reduced-tree structure constructed according to algorithm 2 is shown for 44 sample points shown in FIG. 2, the max_points value is set to 10, and 16 sample points remain for pruning, and FIG. 8 is a pruned-tree structure constructed according to algorithm 2. Fig. 9 shows a partial pruning process, where the first three subtrees each have five leaf nodes and all meet pruning conditions, each subtree only retains the data points closest to the centroid of the sub-sphere, and the remaining four data points are deleted. The difference of the sphere radius of the fourth subtree and the sphere radius of the previous subtree is 6.258964, is larger than a critical value, and the sphere radius is larger than the average value of the sphere radius of all pruned subtrees, so that pruning treatment is not performed, and all 5 leaf nodes (2,4,10,11,33) of the subtree are reserved. However, in the following processing process, the left and right children of the subtree, namely, two subtrees with the leaf node numbers of 3 and 2 respectively meet pruning processing conditions, and finally, only (11, 2) two leaf nodes are reserved, and the rest three leaf nodes (4,10,33) are pruned. This process embodies the idea of maximally preserving the spatial morphology of the original sample points.
Defining a double tree structure K neighbor recursive search algorithm 3:
search_KNN_IN_double_tree(ball_tree,pruned_tree,reduce_tree,target,K,K_g)
inputting parameters: an original ball_tree root node; the pruned tree root node; pruning tree root node; a target to be checked; the nearest neighbor number K is to be checked; the nearest neighbor number K_g is to be checked in the pruning tree.
Output parameters: k nearest neighbor data points to be checked.
Specifically, step S3 includes:
s3.1, a single tree structure K neighbor searching algorithm is called, kg neighbor points of a target to be checked in a pruning tree are determined, and the found Kg neighbor points are stored in a reduce_tree.KNN_result array;
s3.2, presetting the nearest neighbor point of the target to be checked in the pruning tree into a K neighbor array of the deleted tree, determining K adjacent points of the target to be checked in the deleted tree, and when the number num of the positioned neighbor points is more than or equal to K, ending the search, otherwise executing S3.3;
s3.3, presetting the nearest neighbor point of the target to be checked in the pruning tree into a K neighbor array of the pruned tree, determining K adjacent points of the target to be checked in the pruned tree, and executing S3.4 when the number num of the positioned adjacent points is not less than K, and finishing the searching, otherwise;
s3.4, presetting a K neighbor array of the complete tree to be null, and searching K neighbor points in the complete tree by using a K neighbor search algorithm with a single tree structure, wherein the searching is finished.
Referring to fig. 10, an example of a 5 neighbor search within the "dual tree" structure shown in fig. 7, 8 is shown. When the position of the target to be checked in the two-dimensional space is 66,212 and the K nearest neighbor algorithm of the original Ball-tree structure is adopted, only 10 leaf nodes are pruned and filtered, and the 5 nearest neighbor search is completed after the spatial distance is calculated and compared with 34 leaf nodes, so that the query efficiency is lower. After the double-tree structure K neighbor algorithm is adopted, the spatial distance between the two adjacent points and 10 leaf nodes is compared in the pruning tree, 3 adjacent points in the pruning tree are determined, the spatial distance between the point to be checked and the 36-leaf node numbered 24.08 is set as the initial neighbor value of the pruned tree, the initial neighbor value of the original algorithm is 238.13, the initial neighbor value of the double-tree structure is closer to the 5 th neighbor point of the point to be checked, the pruning effect is better, and the 5 adjacent points are determined after 7 times of comparison in the pruned tree. The space distance of 17 leaf nodes is calculated and compared in the whole query process, and 27 leaf nodes are cut and filtered out in total, so that the query efficiency is obviously improved compared with the original algorithm. Experimental results show that the number of pruning max_points and the number Kg of adjacent points in the pruning tree greatly influence the overall query efficiency, for example, the max_points are [2,11] in the value interval, and the Kg is [2,5], so that 10 multiplied by 10=100 groups of selectable parameters can be obtained, and the query efficiency of different parameters is greatly different.
Dividing the original data set according to the proportion of 8:2, randomly selecting 20% data points as a test set, ensuring that the test data points are distributed uniformly by randomness, and using the rest 80% data points as a training set. And generating 10 groups of test and training data sets by using a random algorithm, and calculating an average value of the optimal parameters obtained by each group of data sets to be used as a final optimal parameter value.
Definition of the optimal query parameter determination algorithm 4:
determine_optimal_parameters(node_density,train_data,test_data,K,max_len):
specifically, the input parameters are defined as: a data point set train_data for training; data point set test_data for test; a node array node_density for storing statistical information of each subtree; the nearest neighbor number K is to be checked; upper limit max_len of pruning number;
the output parameters are: the optimal pruning number max_points; the number Kg of the neighbor points of the optimal pruning tree; a starting neighbor point begin_point; ending the adjacent point end_point;
the step of determining optimal query parameters includes:
s4.1, constructing a Ball-tree structure according to the train_data set; defining real variables T1, T2 for time statistics, T 1 、T 2 The initial values are all 0; defining an integer variable Len, wherein the initial value is 2;
s4.2, constructing a pruning tree reduction_tree and a pruned tree according to the train_data set, setting the pruning number as Len, defining an integer variable i, and setting the initial value as 1;
S4.3, if i is not more than Len (test_data), executing S4.4, otherwise, enabling Len to be=len+1, if Len is less than max_len, executing S4.2, otherwise, executing S4.8, and counting optimal parameters;
s4.4, defining an integer variable j, wherein the initial value is 2, if j is less than or equal to K, executing S4.5, otherwise, enabling i=i+1, and executing S4.3;
s4.5, query test_data [ i ]]J adjacent points in the pruned tree, T 1 Time spent for the S4.5 query;
s4.6, defining an integer variable n, wherein the initial value is 1, if n is less than or equal to j-1, executing S4.7, otherwise, enabling j to be equal to j+1, if j is less than or equal to K, executing S4.5, otherwise, enabling i to be equal to i+1, and executing S4.3;
s4.7. let begin_point_ =n, end_point_ =j, query test_data [ i ]]K adjacent points in the double-tree structure define the number Kg=j of adjacent points queried in the pruning tree, and define two initial adjacent points of the deleted tree as reduce_tree]And reduce_tree. KNN_result [ n-1 ]],T 2 Time spent for the query in S4.7; counting the sum of query time consumed by each query parameter, wherein the result array element structure is [ max_points-, K_g-, begin_point-, end_point-, T\u ]]Wherein the first four items are primary key fields, if there is no corresponding result arrayRecord, then add new, otherwise, will T 1 +T 2 The value is appended to the corresponding record; let n=n+1, if n is less than or equal to j-1, execute S4.7, otherwise let j=j+1, if j is less than or equal to K, execute S4.5, otherwise let i=i+1, execute S4.3;
S4.8, according to the time statistical result, the result array elements are arranged in an ascending order, and after the result array elements are arranged in an ascending order, each field value of the result [0] element is the optimal parameter value.
Dividing training and testing sets according to the proportion of 8:2 for 3960 sample points in a two-dimensional space shown in fig. 6 a, limiting the number K of neighbors to be searched to be 5, limiting the number range of pruning to be [6,29], limiting the number Kg of neighbors of a pruning tree to be [2,5], calculating the average value of 10 groups of training and testing results by using optimal parameters obtained by an algorithm 4 as shown in fig. 11, and obtaining optimal max_points=14 and optimal K_g=2, wherein the average query time of all test points is 3.2 seconds under the action of the group of parameters; the worst parameter was found to be max_points=6 and worst kg=2, and using this set of parameters, the average inquiry time was 10.8 seconds.
FIG. 12 shows that under the action of the optimal parameters and the worst parameters, statistical analysis is performed on the distances between 400 test points and the initial adjacent points of the pruning tree, the distance between the end adjacent points and the distance between the K-th adjacent points of the complete tree, the distance between the end adjacent points and the distance between the K-th adjacent points of the complete tree. In fig. 12, (a) is an optimal parameter test result arranged in descending order of reduction_min_radius, it can be seen that under the action of the optimal parameter, the distance value all_max_radius between most of the test points and the kth adjacent point is smaller than reduction_min_radius, and the K adjacent point can be hit by searching in the deleted tree once, only all_max_radius of the few test points is larger than reduction_max_radius, so that the overall query efficiency is highest. In fig. 12 (b), the worst parameter test results are arranged in descending order of all_max_radius, and it can be seen that under the action of the worst parameter, the distance value all_max_radius between most of the test points and the kth adjacent point is greater than the reduce_min_radius, but less than the reduce_max_radius, that is, two times of searching of the deleted tree are needed to determine the K adjacent point, and the extreme individual test points need to search the complete tree to determine the K adjacent point.
In the training stage, the original data set is divided into a training set and a testing set according to the proportion of 8:2, 10 groups of training and testing sets are generated by using a random selection method, and the optimal dual-tree construction parameters are obtained through statistical analysis. And filtering out a very small number of data points from the original data point set by utilizing the optimal parameters to form a pruning tree, filtering out the rest data points to form a pruned tree, and keeping the distribution form of the original data point set in a high-dimensional space to the greatest extent by the pruning tree. In the query stage, as the number of data points in the pruned tree is small, the nearest neighbor point can be rapidly positioned, and then the nearest neighbor point is used as the initial nearest neighbor point of the pruned tree, and K nearest neighbors are searched in the pruned tree. Experimental results show that the initial adjacent point is not fixed, but is positioned near the point to be checked, so that the pruning radius is effectively reduced, the pruning effect is improved, and the K adjacent query efficiency is improved.
Example 2
Referring to fig. 13, a fast search device for a high-dimensional vector space sample based on a dual tree is provided, and a fast search method for a high-dimensional vector space sample based on a dual tree in embodiment 1 or any implementation manner thereof is adopted, which includes:
the initial statistics module 1 is used for carrying out statistics on information of each subtree in the construction process of the double-tree structure, and storing a statistics result into a node array, wherein each element in the node array corresponds to the statistics information of one subtree, the statistics information comprises the number of data points in the subtree, the super sphere radius corresponding to the subtree and the data point set corresponding to the subtree, the elements in the node array are arranged according to the descending order of the number of the data points in each subtree, and the elements with the same numerical value of the data points in the subtree are arranged according to the ascending order of the super sphere radius corresponding to the subtree;
The deleted tree and pruning tree construction module 2 is used for filtering each subtree with the data point number greater than or equal to the maximum pruning point number, reserving the data point closest to the mass center in each subtree, filtering and deleting the data points which are not reserved, and storing the data points into a deleted point set; after all subtrees are processed, forming the data points in the deleted point set into deleted trees according to a double tree construction algorithm; removing the remaining elements after the elements of the deleted point set from all the sample point sets to form a pruning point set, and forming the data points in the pruning point set into a pruning tree according to a double-tree construction algorithm;
the recursion searching module 3 is used for firstly calling a K nearest neighbor algorithm in the pruning tree to find Kg nearest neighbors of the point to be checked, wherein Kg is more than or equal to 1 and less than or equal to K; then respectively using the 1 st and Kg th neighbor points as initial neighbor points of the deleted tree, searching K neighbor points of the to-be-searched point in the deleted tree, when the number of the found neighbor points is less than K, setting the nearest neighbor array to be empty, and searching the K neighbor points of the to-be-searched point in a double-tree structure;
the optimal query parameter determining module 4 is configured to divide the original data set into a test set and a training set according to a given proportion, generate a plurality of sets of test set and training set by using a random algorithm, calculate an average value of the optimal query parameters obtained by each set of test set and training set as a final optimal parameter value Kg, and enable a given number of points to be checked to find K nearest neighbors in the pruned tree and the pruned tree.
It should be noted that, because the content of information interaction and execution process between the modules/units of the above-mentioned apparatus is based on the same concept as the method embodiment in embodiment 1 of the present application, the technical effects brought by the content are the same as those of the method embodiment of the present application, and the specific content can be referred to the description in embodiment 1 of the method shown in the foregoing description of the present application.
Example 3
There is provided a computer readable storage medium having stored therein program code for dual tree based high dimensional vector space sample fast searching, the program code comprising instructions for performing the dual tree based high dimensional vector space sample fast searching method of embodiment 1 or any implementation thereof.
Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (SolidStateDisk, SSD)), etc.
Example 4
There is provided an electronic device comprising a processor coupled to a storage medium, which when executing instructions in the storage medium, causes the electronic device to perform the dual-tree based high-dimensional vector space sample fast search method of embodiment 1 or any implementation thereof.
Specifically, the processor may be implemented by hardware or software, and when implemented by hardware, the processor may be a logic circuit, an integrated circuit, or the like; when implemented in software, the processor may be a general-purpose processor, implemented by reading software code stored in a memory, which may be integrated in the processor, or may reside outside the processor, and which may reside separately.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.).
Specifically, a Central Processing Unit (CPU) performs various processes according to a program stored in a Read Only Memory (ROM) or a program loaded from a storage section into a Random Access Memory (RAM). In the RAM, data required when the CPU executes various processes and the like is also stored as needed. The CPU, ROM, and RAM are connected to each other via a bus. An input/output interface is also connected to the bus.
The following components are connected to the input/output interface: an input section (including a keyboard, a mouse, etc.), an output section (including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), etc., and a speaker, etc.), a storage section (including a hard disk, etc.), a communication section (including a network interface card such as a LAN card, a modem, etc.). The communication section performs communication processing via a network such as the internet. The drive may also be connected to the input/output interface as desired. Removable media such as magnetic disks, optical disks, magneto-optical disks, semiconductor memories, and the like may be mounted on the drive as needed so that a computer program read out therefrom is mounted in the storage section as needed.
In the case of implementing the above-described series of processes by software, a program constituting the software is installed from a network such as the internet or a storage medium such as a removable medium.
Those skilled in the art will appreciate that. Examples of the removable medium include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disk read only memory (CD-ROM) and a Digital Versatile Disk (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be a ROM, a hard disk contained in a storage section, or the like, in which a program is stored, and distributed to users together with a device containing them.
It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present invention is not limited to any specific combination of hardware and software.
While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims (7)

1. The high-dimensional vector space sample quick searching method based on the double trees is characterized by comprising the following steps of:
step 1, in the construction process of a double tree structure, counting information of each subtree, storing a counting result into a node array, wherein each element in the node array corresponds to the counting information of one subtree, the counting information comprises the number of data points in the subtree, the super sphere radius corresponding to the subtree and a data point set corresponding to the subtree, the elements in the node array are arranged in descending order according to the number of the data points in each subtree, and the elements with the same number of the data points in the subtree are arranged in ascending order according to the super sphere radius corresponding to the subtree;
step 2, filtering each subtree with the number of data points being greater than or equal to the maximum pruning point number, reserving the data point closest to the mass center in each subtree, filtering and deleting the data points which are not reserved, and storing the data points into a deleted point set; after all subtrees are processed, forming the data points in the deleted point set into deleted trees according to a double tree construction algorithm; removing the remaining elements after the elements of the deleted point set from all the sample point sets to form a pruning point set, and forming the data points in the pruning point set into a pruning tree according to a double-tree construction algorithm;
Step 3, firstly, calling a K neighbor algorithm in the pruning tree to find Kg adjacent points of the point to be checked, wherein Kg is more than or equal to 1 and less than or equal to K; then respectively using the 1 st and Kg th neighbor points as initial neighbor points of the deleted tree, searching K neighbor points of the to-be-searched point in the deleted tree, when the number of the found neighbor points is less than K, setting the nearest neighbor array to be empty, and searching the K neighbor points of the to-be-searched point in a double-tree structure;
and 4, dividing the original data set into a test set and a training set according to a given proportion, generating a plurality of groups of test sets and training sets by using a random algorithm, calculating an average value of optimal query parameters obtained by each group of test sets and training sets as a final optimal parameter value Kg, and enabling a given number of points to be checked to find K nearest neighbors in the pruned tree and the pruned tree.
2. The method for fast searching high-dimensional vector space samples based on double trees according to claim 1, wherein in step 2, a pruned tree is defined as a reduce_tree, and a pruned tree is defined as a pruned_tree;
the input parameters are: data point set data to be pruned; node array node_density of statistical information of each subtree is stored, node array elements are arranged in descending order according to the number of data points in each subtree, and elements with the same point_num value are arranged in ascending order according to the radius of the super sphere corresponding to the subtree; the integer variable max_points represents the maximum pruning point;
The output parameters are: a pruned tree constructed from pruned data points, a pruned tree reduce_tree constructed from pruned remaining data points;
the construction steps of the pruned tree reduce_tree and the pruned tree pruned_tree comprise:
step 2.1, enabling the deleted tree array to store deleted data points, wherein an initial value is null; enabling the pruning tree array to store pruning residual data points, wherein an initial value is null;
step 2.2, forming a set T by all elements with the data points greater than or equal to the maximum pruning point number in the node array, wherein T= { T1, T2, …, tn }; classifying subtrees according to the number of data points, and defining a real variable avg_max_radius to represent the maximum radius average value of each subtree which is pruned, wherein the initial value is T1.Radius; defining an integer variable current_point_num to represent the number of data points of the subtree currently processed, wherein the initial value is T1.Point_num; defining a real variable previous_radius to represent the radius of a sphere corresponding to the previous subtree, wherein the initial value is T1.Radius; defining an integer variable i, wherein the initial value is 2;
step 2.3, retrieving the element Ti from the set T, if Ti.Point_num-! The step 2.6 is executed if the current_point_num is the current subtree is processed, otherwise the step 2.4 is executed;
Step 2.4, if Ti.radius-previous_radius is less than or equal to 2 or Ti.radius < avg_max_radius, executing step 2.5; otherwise, let i=i+1, if i is not more than len (T), then execute step 2.3, skip the rest subtrees in such subtrees in turn, otherwise execute step 2.7;
step 2.5, let center be the centroid of the data point set ti.points, min_dist_point be the closest point to the centroid center in the data point set ti.points, let the pruned_tree=pruned_tree @ (ti.points-min_dist_point), i.e. incorporate the data points except for min_dist_point in the ti.points into the pruned_tree array; let previous_radius=ti.radius, and update avg_max_radius value at the same time; let i=i+1, if i is less than or equal to len (T), then step 2.3 is performed, otherwise step 2.7 is performed;
step 2.6, let current_point_num=ti.point_num, update the current count current_point_num of such subtrees currently processed, execute step 2.5;
step 2.7, let reduce_tree=data_pruned_tree, execute step 2.1 to step 2.6 to construct pruned tree composed of all pruned data points and pruned tree composed of all pruned remaining data points.
3. The fast search method of high-dimensional vector space samples based on double trees according to claim 1, wherein the step 3 comprises:
Step 3.1, calling a single tree structure K neighbor searching algorithm, determining Kg neighbor points of a target to be checked in a pruning tree, and storing the found Kg neighbor points in a reduce_tree.KNN_result array;
step 3.2, presetting the nearest neighbor point of the target to be checked in the pruning tree into a K neighbor array of the deleted tree, determining K adjacent points of the target to be checked in the deleted tree, and when the number num of the positioned neighbor points is more than or equal to K, ending the search, otherwise executing the step 3.3;
step 3.3, presetting the nearest neighbor point of the target to be checked in the pruning tree into a K neighbor array of the pruned tree, determining K adjacent points of the target to be checked in the pruned tree, and executing the step 3.4 when the number num of the positioned adjacent points is not less than K, and finishing the searching, otherwise;
and 3.4, presetting a K neighbor array of the complete tree to be null, and searching K neighbor points in the complete tree by using a K neighbor search algorithm with a single tree structure, wherein the searching is finished.
4. The fast search method of high-dimensional vector space samples based on dual trees according to claim 3, wherein in step 4, input parameters are defined as follows: a data point set train_data for training; data point set test_data for test; a node array node_density for storing statistical information of each subtree; the nearest neighbor number K is to be checked; upper limit max_len of pruning number;
The output parameters are: the optimal pruning number max_points; the number Kg of the neighbor points of the optimal pruning tree; a starting neighbor point begin_point; ending the adjacent point end_point;
the step of determining optimal query parameters includes:
step 4.1, constructing a Ball-tree structure according to the train_data set; defining real variables T1, T2 for time statistics, T 1 、T 2 The initial values are all 0; defining an integer variable Len, wherein the initial value is 2;
step 4.2, constructing a pruning tree reduction_tree and a pruned tree according to the train_data set, setting the pruning number as Len, defining an integer variable i, and setting the initial value as 1;
step 4.3, if i is not more than Len (test_data), executing step 4.4, otherwise, enabling len=len+1, if Len < max_len, executing step 4.2, otherwise, executing step 4.8, and counting optimal parameters;
step 4.4, defining an integer variable j, wherein the initial value is 2, if j is less than or equal to K, executing step 4.5, otherwise, letting i=i+1, and executing step 4.3;
step 4.5, query test_data [ i ]]J adjacent points in the pruned tree, T 1 Time spent for the step 4.5 query;
step 4.6, defining an integer variable n, wherein the initial value is 1, if n is less than or equal to j-1, executing step 4.7, otherwise, enabling j to be equal to j+1, if j is less than or equal to K, executing step 4.5, otherwise, enabling i to be equal to i+1, and executing step 4.3;
Step 4.7, let begin_point_ =n, end_point_ =j, query test_data [ i ]]K adjacent points in the double-tree structure define the number Kg=j of adjacent points queried in the pruning tree, and define two initial adjacent points of the deleted tree as reduce_tree]And reduce_tree. KNN_result [ n-1 ]],T 2 The time spent for the query in step 4.7; counting the sum of query time consumed by each query parameter, wherein the result array element structure is [ max_points-, K_g-, begin_point-, end_point-, T\u ]]Wherein the first four items are primary key fields, if no corresponding record exists in the result array, the record is newly added, otherwise, T is added 1 +T 2 The value is appended to the corresponding record; let n=n+1, if n is less than or equal to j-1, execute step 4.7, otherwise let j=j+1, if j is less than or equal to K, execute step 4.5, otherwise let i=i+1, execute step 4.3;
and 4.8, according to the time statistical result, arranging result array elements in ascending order, and after sorting, obtaining the values of all the fields of the result [0] element as the optimal parameter value.
5. The fast search device for high-dimensional vector space samples based on double trees, which adopts the fast search method for high-dimensional vector space samples based on double trees according to any one of claims 1 to 4, is characterized by comprising the following steps:
the initial statistics module is used for carrying out statistics on information of each subtree in the construction process of the double-tree structure, and storing a statistics result into a node array, wherein each element in the node array corresponds to the statistics information of one subtree, the statistics information comprises the number of data points in the subtree, the super sphere radius corresponding to the subtree and the data point set corresponding to the subtree, the elements in the node array are arranged according to the descending order of the number of the data points in each subtree, and the elements with the same numerical value in the subtree are arranged according to the ascending order of the super sphere radius corresponding to the subtree;
The deleted tree and pruning tree construction module is used for filtering each subtree with the data point number larger than or equal to the maximum pruning point number, reserving the data point closest to the mass center in each subtree, filtering, deleting and storing the data point which is not reserved in the deleted point set; after all subtrees are processed, forming the data points in the deleted point set into deleted trees according to a double tree construction algorithm; removing the remaining elements after the elements of the deleted point set from all the sample point sets to form a pruning point set, and forming the data points in the pruning point set into a pruning tree according to a double-tree construction algorithm;
the recursive search module is used for firstly calling a K nearest neighbor algorithm in the pruning tree to search Kg adjacent points of the point to be checked, wherein Kg is more than or equal to 1 and less than or equal to K; then respectively using the 1 st and Kg th neighbor points as initial neighbor points of the deleted tree, searching K neighbor points of the to-be-searched point in the deleted tree, when the number of the found neighbor points is less than K, setting the nearest neighbor array to be empty, and searching the K neighbor points of the to-be-searched point in a double-tree structure;
the optimal query parameter determining module is used for dividing an original data set into a test set and a training set according to a given proportion, generating a plurality of groups of test sets and training sets by utilizing a random algorithm, calculating an average value of optimal query parameters obtained by each group of test sets and training sets as a final optimal parameter value Kg, and enabling a given number of points to be checked to find K nearest neighbors in the pruned tree and the pruned tree.
6. A computer readable storage medium, wherein the computer readable storage medium has stored therein program code for dual tree based high dimensional vector space sample fast searching, the program code comprising instructions for performing the dual tree based high dimensional vector space sample fast searching method of any of claims 1 to 4.
7. An electronic device comprising a processor coupled to a storage medium, which when executing instructions in the storage medium, causes the electronic device to perform the dual-tree based high-dimensional vector space sample fast search method of any one of claims 1 to 4.
CN202011127725.9A 2020-10-20 2020-10-20 High-dimensional vector space sample rapid searching method and device based on double trees Active CN112308122B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011127725.9A CN112308122B (en) 2020-10-20 2020-10-20 High-dimensional vector space sample rapid searching method and device based on double trees

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011127725.9A CN112308122B (en) 2020-10-20 2020-10-20 High-dimensional vector space sample rapid searching method and device based on double trees

Publications (2)

Publication Number Publication Date
CN112308122A CN112308122A (en) 2021-02-02
CN112308122B true CN112308122B (en) 2024-03-01

Family

ID=74328206

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011127725.9A Active CN112308122B (en) 2020-10-20 2020-10-20 High-dimensional vector space sample rapid searching method and device based on double trees

Country Status (1)

Country Link
CN (1) CN112308122B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11961198B2 (en) 2021-07-20 2024-04-16 Dhana Inc. System and method for improved generation of avatars for virtual try-on of garments

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004348426A (en) * 2003-05-22 2004-12-09 Nippon Telegr & Teleph Corp <Ntt> Multidimensional spatial data search method, multidimensional spatial data search device, multidimensional spatial data search program and recording medium with multidimensional spatial data search program recorded
US7370055B1 (en) * 2003-06-04 2008-05-06 Symantec Operating Corporation Efficiently performing deletion of a range of keys in a B+ tree
CN104881846A (en) * 2015-03-11 2015-09-02 哈尔滨工业大学深圳研究生院 Structured image compressive sensing restoration method based on double-density dual-tree complex wavelet
CN107844461A (en) * 2017-10-17 2018-03-27 华南理工大学 A kind of Gaussian process based on broad sense N body problems returns computational methods
CN108733966A (en) * 2017-04-14 2018-11-02 国网重庆市电力公司 A kind of multidimensional electric energy meter field thermodynamic state verification method based on decision woodlot
CN109492150A (en) * 2018-10-30 2019-03-19 石家庄铁道大学 Reverse nearest neighbor queries method and device based on semantic track big data
CN110070121A (en) * 2019-04-15 2019-07-30 西北工业大学 A kind of quick approximate k nearest neighbor method based on tree strategy with balance K mean cluster
CN110309139A (en) * 2018-03-05 2019-10-08 理光软件研究所(北京)有限公司 Higher-dimension neighbour is to searching method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004348426A (en) * 2003-05-22 2004-12-09 Nippon Telegr & Teleph Corp <Ntt> Multidimensional spatial data search method, multidimensional spatial data search device, multidimensional spatial data search program and recording medium with multidimensional spatial data search program recorded
US7370055B1 (en) * 2003-06-04 2008-05-06 Symantec Operating Corporation Efficiently performing deletion of a range of keys in a B+ tree
CN104881846A (en) * 2015-03-11 2015-09-02 哈尔滨工业大学深圳研究生院 Structured image compressive sensing restoration method based on double-density dual-tree complex wavelet
CN108733966A (en) * 2017-04-14 2018-11-02 国网重庆市电力公司 A kind of multidimensional electric energy meter field thermodynamic state verification method based on decision woodlot
CN107844461A (en) * 2017-10-17 2018-03-27 华南理工大学 A kind of Gaussian process based on broad sense N body problems returns computational methods
CN110309139A (en) * 2018-03-05 2019-10-08 理光软件研究所(北京)有限公司 Higher-dimension neighbour is to searching method and system
CN109492150A (en) * 2018-10-30 2019-03-19 石家庄铁道大学 Reverse nearest neighbor queries method and device based on semantic track big data
CN110070121A (en) * 2019-04-15 2019-07-30 西北工业大学 A kind of quick approximate k nearest neighbor method based on tree strategy with balance K mean cluster

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐国天.网络入侵检测中K 近邻高速匹配算法研究.技术研究.2020,(第8期),71-80. *

Also Published As

Publication number Publication date
CN112308122A (en) 2021-02-02

Similar Documents

Publication Publication Date Title
Zhang et al. Hybrid fuzzy clustering method based on FCM and enhanced logarithmical PSO (ELPSO)
CN108549696B (en) Time series data similarity query method based on memory calculation
CN108280236A (en) A kind of random forest visualization data analysing method based on LargeVis
CN111597230A (en) Parallel density clustering mining method based on MapReduce
Esfandiari et al. Almost linear time density level set estimation via dbscan
CN112308122B (en) High-dimensional vector space sample rapid searching method and device based on double trees
JP2017045291A (en) Similar image searching system
CN115293919A (en) Graph neural network prediction method and system oriented to social network distribution generalization
Chen et al. Evolutionary clustering with differential evolution
CN115358308A (en) Big data instance reduction method and device, electronic equipment and storage medium
Chavan et al. Mini batch K-Means clustering on large dataset
CN114463596A (en) Small sample image identification method, device and equipment of hypergraph neural network
CN114417095A (en) Data set partitioning method and device
CN117493920A (en) Data classification method and device
Fahy et al. Finding multi-density clusters in non-stationary data streams using an ant colony with adaptive parameters
Deng et al. Research on C4. 5 Algorithm Optimization for User Churn
CN115496133A (en) Density data stream clustering method based on self-adaptive online learning
Wang et al. RODA: A fast outlier detection algorithm supporting multi-queries
CN110704575B (en) Dynamic self-adaptive binary hierarchical vocabulary tree image retrieval method
Hacid et al. Incremental neighborhood graphs construction for multidimensional databases indexing
Yang et al. A generalized fuzzy clustering framework for incomplete data by integrating feature weighted and kernel learning
Yingfan et al. Revisiting $ k $-Nearest Neighbor Graph Construction on High-Dimensional Data: Experiments and Analyses
Tareq et al. A new density-based method for clustering data stream using genetic algorithm
CN111310857A (en) Feature extraction method, electronic device and medical case similarity model construction method
Sheng et al. Distributed evolution strategies using tpus for meta-learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant