CN108710626A - A kind of the approximate KNN search method and searching system of the high dimensional data based on satellite system figure - Google Patents

A kind of the approximate KNN search method and searching system of the high dimensional data based on satellite system figure Download PDF

Info

Publication number
CN108710626A
CN108710626A CN201810229529.9A CN201810229529A CN108710626A CN 108710626 A CN108710626 A CN 108710626A CN 201810229529 A CN201810229529 A CN 201810229529A CN 108710626 A CN108710626 A CN 108710626A
Authority
CN
China
Prior art keywords
point
candidate
point set
satellite system
points
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810229529.9A
Other languages
Chinese (zh)
Inventor
付聪
蔡登�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201810229529.9A priority Critical patent/CN108710626A/en
Publication of CN108710626A publication Critical patent/CN108710626A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of approximate KNN search method of high dimensional data based on satellite system figure and searching systems, wherein the approximate KNN search method of the high dimensional data based on satellite system figure includes:(1) satellite system figure is established to high dimensional data library point set;(2) Access Points are treated, randomly choose several data points as candidate point set, greedy approximate KNN retrieval is carried out on satellite system figure;(3) by the point of the given quantity of obtained candidate point concentration as a result, the arest neighbors point set of point i.e. to be retrieved.Using the present invention, there can be the retrieval complexity of Logarithmic degree, can not only greatly improve the retrieval precision in certain time, but also can significantly reduce memory footprint when retrieval and establish and index the required time.

Description

A kind of approximate KNN search method of the high dimensional data based on satellite system figure and retrieval System
Technical field
The present invention relates to data retrieval technology fields, and in particular to a kind of high dimensional data based on satellite system figure it is approximate most Neighbour's search method and searching system.
Background technology
In recent years, approximate KNN retrieval technique is applied more and more extensive, especially in computer vision, machine learning, number It handles according to needs such as excavation, natural language processing, text and image retrievals and is played very in the field of extensive high dimensional data Important function.The object that approximate KNN retrieval discusses is extensive high dimensional data point set, it is desirable that can rapidly be counted at these Several data points nearest apart from some point to be retrieved are found in strong point.
For extensive high dimensional data, it is expensive that the computing cost of traversal formula violence retrieval is carried out in luv space, is The efficiency of nearest _neighbor retrieval is improved, scholars propose a series of approximate KNN searching algorithms.Common approximate KNN Search method includes mainly:Method based on tree construction, the method based on Hash, the method based on accumulated amount and the side based on figure Method.
Since data have complicated structure under real industrial production scene, ultra-large higher-dimension real vector number is being handled According to when, existing method is difficult to obtain good effect.Approximate KNN retrieval is carried out to high dimensional data based on the method for tree construction When, as data dimension increases, retrieval precision is greatly reduced.And the method based on Hash, recall precision is by Hash letter Number ability to express and Hash table itself search efficiency greatly limit.Had when low accuracy rating is retrieved based on the method for accumulated amount Preferable effect, but under the scene of high-precision retrieval, very big limitation that the precision upper limit is generated by quantization error.Nearest base Contain huge potentiality in the method for figure.Certain methods experimentally prove to be more than traditional based on Hash, accumulated amount With the method for tree construction.However the efficiency of the method based on figure is influenced by figure immanent structure, some existing are based on figure and tie The method of structure be individually present establish Index process take, recall precision is relatively low, occupies the problems such as memory is big.
In order to fully demonstrate the high efficiency of the method for the present invention, the method for the present invention will be with a series of algorithms based on graph structure It is compared.Including in international network data mining top-level meeting the 20th international in 2011 On Conference on World Wide Web《Efficient k-nearest neighbor graph construction for generic similarity measures》Involved in arrive it is a kind of based on approximate k nearest neighbor figure Method, entitled KGraph;The Chinese patent literature of Publication No. CN105550358A disclose a kind of high dimensional data it is approximate most Neighbour's search method and searching system, it is proposed that a kind of composite index method based on tree construction and approximate k nearest neighbor figure, it is entitled Efanna;In international conference IEEE Conference on Computer Vision and Pattern in 2016 Article on Recognition《FANNG:Fast Approximate Nearest Neighbour Graphs》In, it discloses A method of based on FANNG graph structures;Technical article on the websites Cornell University Arxiv《Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs》In, disclose a kind of method of the entitled HNSW based on multilayer navigation beta pruning graph structure with navigation spots; Technical article on the websites Cornell University Arxiv《Approximate Nearest Neighbor Search on High Dimensional Data—Experiments,Analyses,and Improvement(v1.0)》In, disclose a kind of base In the method for angle diversity, the entitled DPG of the undirected graph structure of differentiation;The Chinese patent of Publication No. CN107729348A Document discloses a kind of the approximate KNN search method and searching system of the high dimensional data based on radiation stretching, extension figure, it is proposed that one The method of entitled NSG of the kind based on the radiation stretching, extension figure with single navigation spots.
Wherein, the most efficient method of retrieval before NSG methods are the present invention, retrieval performance are better than other sides based on figure Method, while being significantly better than other methods based on tree construction, Hash and accumulated amount.NSG needs to build a band navigation spots first Radiation stretching, extension figure treat Access Points then on the figure from navigation spots and carry out greedy retrieval.
However, NSG methods need to select navigation spots when retrieving, initial point cannot be selected at random.Meanwhile the cutting edge of this method Strategy selects side using alternative rule, and the length on Primary Reference side can not be distributed according to data set and adjust the angle size, with suitable Answer the characteristic of data set, thus can not radiation data sets every bit periphery adjacent domain.
Invention content
The present invention provides a kind of approximate KNN search methods of the high dimensional data based on satellite system figure, are obviously improved inspection Rope efficiency, and committed memory space greatly reduces.
A kind of approximate KNN search method of the high dimensional data based on satellite system figure, which is characterized in that including following step Suddenly:
(1) satellite system figure is established to high dimensional data library point set;
(2) Access Points are treated, randomly choose several data points as candidate point set, are carried out on satellite system figure greedy approximate Nearest _neighbor retrieval;
(3) by the point of the given quantity of obtained candidate point concentration as a result, the arest neighbors point set of point i.e. to be retrieved.
Wherein, step (1) the specific steps are:
(1-1) establishes the approximate KNN figure of high dimensional data library point set.Approximate KNN figure is digraph, for appointing in figure A bit, it is a fixed value k to go out number of edges amount to meaning, and the neighbours that this k side is connected are not all its k nearest neighbour.
(1-2) is used as point to be investigated for any point a in database, by its neighbour in approximate KNN figure, with And the neighbours of neighbours take out, and constitute index point set;By all the points that index point is concentrated and wait that investigating point a calculates distance, and press away from From sequence sequence from small to large, retain apart from L nearest point, L is predetermined value, and remaining point is deleted from index point concentration It removes.Predetermined value L is adjusted according to data set size and data dimension.
Result point set is added after being deleted from index point set since point minimum with a distances in index point set in (1-3), and Whether verification current results point set meets abundant radiativity, if not satisfied, then deleting new addition point;The abundant radiativity is: Any two points b and c, the angle of side ab and ac is concentrated to be more than or equal to m degree result points, wherein m is pre-value.
(1-4) is traversed when result point set size reaches the point that predetermined value R or index point are concentrated, which is made For neighbours' point set of satellite system figure midpoint a.Predetermined value R is adjusted according to data set size and data dimension.
(1-5) repeats step (1-2)~step (1-4), until database all the points are traversed, obtains intermediate result figure.
(1-6) chooses any point d from data set, and intermediate result is found from point d using depth-first search The strong continune component of figure.
(1-7) adds in two-way side to intermediate result figure any two connected component being continuously found.
(1-8) repeats step 1-6~step 1-7, until reaching certain number upper limit, obtains satellite system figure.
In step (2), greedy approximate KNN retrieval, including:
(2-1) establishes empty candidate point set, if by database point concentrate it is randomly selected do, candidate point set is added, and Labeled as non-accessing points.
Candidate point is concentrated the non-accessing points nearest apart from point to be retrieved as investigation point by (2-2), and is labeled as having accessed Point.
(2-3) obtains the neighbours for investigating point, all neighbours is labeled as non-accessing points, are added by inquiring satellite system figure Candidate point set, and candidate point set is sorted from small to large according to the distance to point to be retrieved.
(2-4) deletes candidate point and concentrates apart from point to be retrieved farthest several if the size of candidate point set is more than predetermined value M Point makes candidate point set size be no more than predetermined value M.Predetermined value M is adjusted according to data set size and data dimension.
(2-5) repeats step (2-2)~step (2-4), and until the no non-accessing points of candidate point concentration, candidate point is concentrated The point of the specified number nearest apart from point to be retrieved returns as a result.
The input that satellite system drawing method is established in the present invention is high dimensional data library point set, is exported as satellite system figure.It is greedy close Input like arest neighbors method is point to be retrieved, high dimensional data library point set and satellite system figure.Approximation based on satellite system figure is recently The input of adjacent search method is point to be retrieved, high dimensional data library point set and satellite system figure.
The present invention also provides a kind of approximate KNN searching system of the high dimensional data based on satellite system figure, including it is offline Satellite system figure part and on-line search part, wherein the offline satellite system figure part includes:
Arest neighbors module, for high dimensional data library point set, establishing approximate KNN figure;
Composition candidate's point set acquisition module extends to obtain for collecting to certain investigation point progress neighbour that database point is concentrated Neighbor Points, constitute composition index point set;
As a result point set screening module, the point for concentrating composition index point filter out result point set;
Composition iteration judgment module, for judging whether composition iterative process reaches end condition, when database point is concentrated All the points all obtain stopping iteration when corresponding result point set, obtain intermediate result figure;
Strong connectedness enhances module, for detecting the strong continune component in the presence of intermediate result figure, and they is connected As a complete strongly connected graph;
Satellite system figure result output module for the result point set of all the points to be constituted satellite system figure, and ensures satellite system Neighbours' number of all the points is no more than given value in figure, if the neighbours of certain point are more than given value, deletes neighbours farther out;
The on-line search part includes:
Initialization module retrieves module for greedy approximate KNN and provides input, including point to be retrieved, satellite system figure;
Greedy approximate KNN retrieves module, for according to satellite system figure, obtaining data point to be retrieved in the high dimension Several the closest points concentrated according to library point;
As a result output module, what the candidate point that greedy approximate KNN is retrieved was concentrated, it is nearest apart from point to be retrieved K point return as a result, k is predetermined value;
Wherein, greedy approximate KNN retrieval module is nucleus module, including:
For constructing empty candidate point set candidate point set is added, and mark in given initialization points by initialization retrieval submodule It is denoted as non-accessing points;
It obtains and investigates point submodule, concentrate the non-accessing points nearest apart from point to be retrieved for obtaining present candidate points, make To investigate point, and labeled as accessing points;
The candidate point set of retrieval updates submodule, and for inquiring satellite system figure, candidate point set is added in the neighbours for obtaining investigation point, And according to point to be retrieved apart from ascending sort;
The candidate point set of retrieval screens submodule, the candidate point for screening candidate point concentration, when candidate point set size is more than When given value, candidate point is concentrated to several point deletions farthest apart from tested point, ensures that candidate point set size is just no more than and gives Definite value;
Iteration control submodule is retrieved, for calling acquisition to investigate point submodule, the candidate point set update submodule of retrieval successively Block and the candidate point set of retrieval screen submodule, until candidate point is concentrated without non-accessing points, stop iteration;
Retrieval result output sub-module, for using candidate point concentrate the value fixed number point of destination nearest apart from point to be retrieved as As a result it returns.
The approximate KNN search method of high dimensional data provided by the invention based on satellite system figure will be by that will randomly choose If do the candidate point set of addition initialization, by satellite system figure iteration extend initialization candidate point set (will initialization it is candidate Several Neighbor Points of point centrostigma are added to candidate point concentration);It calculates candidate point and concentrates each candidate point and the point to be retrieved Actual range obtains better neighbour candidate point set from the candidate Neighbor Points according at a distance from point to be retrieved, changes repeatedly In generation, obtains the arest neighbors point set of the point to be retrieved.
The time complexity that can have Logarithmic degree using the searching algorithm of the present invention, can not only greatly improve certain time Interior retrieval precision, and can significantly reduce memory footprint when retrieval and establish and index the required time.
Description of the drawings
Fig. 1 is that the present invention is based on the flow diagrams of the approximate KNN search method of the high dimensional data of satellite system figure;
Fig. 2 is the greedy approximate KNN search method flow diagram of the present invention;
When Fig. 3 is k values 100, calling together when the present invention is retrieved with other based on graph structure algorithm on SIFT1M data sets It returns rate value and retrieval time compares figure;
When Fig. 4 is k values 100, calling together when the present invention is retrieved with other based on graph structure algorithm on GIST1M data sets It returns rate value and retrieval time compares figure;
Fig. 5 is a kind of module composition knot of approximate KNN searching system of the high dimensional data based on satellite system figure of the present invention Structure schematic diagram.
Specific implementation mode
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention make into One step it is described in detail.
As shown in Figure 1, a kind of approximate KNN searching system of the high dimensional data based on satellite system figure, including off-line phase And on-line stage.
The purpose of off-line phase is to establish satellite system figure, including step S101 to step S107.
S101 establishes the approximate KNN figure of high dimensional data library point set.Approximate KNN figure is digraph, for appointing in figure A bit, it is a fixed value k to go out number of edges amount to meaning, and it is entirely its k nearest neighbour that the neighbours that are connected of this k side, which are not,.
S102 takes out a point pi in database and on the approximate KNN figure, searches as point is investigated and investigate point pi's The neighbour of neighbour and its neighbour are added index point set P, are ranked up from small to large according to the distance for investigating point pi, only protected L nearest point is stayed, L is predetermined value.
By the neighbour for covering neighbour and neighbour so that investigation point pi wide ranges can be covered by investigating the side on point periphery Interior neighboring regions so that a wide range of jump at a distance can be carried out when retrieving on the diagram.
S103 since apart from smallest point, deletes from index point set for the point that index point is concentrated, result point set is added S, and verify whether current results point set S meets abundant radiativity.If not satisfied, then deleting new addition point.Wherein, described abundant Radiativity is to concentrate any two points b and c, the angle of side ab and ac to be more than or equal to m degree the point, m is preset value.Meanwhile as a result The point that point is concentrated is close as possible from point pi is investigated.Such property can make the side that at any point in figure fully cover point week The adjacent domain enclosed quickly is cruised on the diagram experiments have shown that being more advantageous to greedy searching algorithm in this way.
S104, as a result point set screening control module.Judgement builds whether index of the picture point set is empty set, is to stop screening.Or As a result whether point set reaches specified size r, is to stop iteration.
S105 builds figure iteration control module.Judge whether current investigation point pi is the last one point of database.It is to stop Iteration;Otherwise i increases by 1, returns to step S102 and continues iteration.After iteration stopping, intermediate result figure is obtained.Due to different investigations Behavior between point is mutual indepedent, is easy to parallelization.
S106, strong connectedness enhance module.For converting non-strongly connected graph to strongly connected graph, enhance the connectivity of figure. Intermediate result figure is traversed using Depth Priority Algorithm, traversal while detection backward channel whether there is or No cyclization detects the strong continune component number of intermediate result figure.Add between the strong continune component that any two is continuously found Add two-way side so that all strong continune components are connected as a strongly connected graph, enhance the connectivity of figure, contribute to retrieval accurate The promotion of degree.
S107, output module.The result point set of obtained all database points is constituted into the output of satellite system figure.
On-line stage is based on satellite system figure, is retrieved by greedy approximate KNN and obtains arest neighbors point set, including step S111 to step S113.
S111, initialization retrieval module.Using point q to be retrieved and satellite system figure G as input, it is supplied to greedy approximate nearest Neighbour's retrieval module.
S112 carries out greedy approximate KNN retrieving on satellite system figure, obtains point q according to given output parameter Arest neighbors candidate's point set.
S113, output module return to the k point that candidate point concentrates range points q nearest as a result, and k is predetermined value.
As shown in Fig. 2, the approximate KNN search method of high dimensional data includes a crucial greedy approximate KNN inspection Rope module, including step S201 to step S207.
S201, establishes empty candidate point set, and maximum capacity is predetermined value p.Several random choosings that database point is concentrated The point selected is added candidate point set, and is labeled as non-accessing points.
The point object that candidate point is concentrated includes the index value (or subscript value) of the point, the distance to point to be retrieved and access Three attributes of label.Purpose is that calculation amount is conveniently ranked up and saved to candidate point.
Candidate point is concentrated the non-accessing points nearest apart from point to be retrieved as investigation point, and is labeled as having accessed by S202 Point.Investigate point neighbours probably apart from point to be retrieved closer to.The purpose for being marked as accessing points is later inspection In looking into will not the rechecking point, cause extra calculation amount.
S203 obtains the neighbours for investigating point by inquiring satellite system figure.All neighbours are labeled as non-accessing points, are calculated After the distance between point to be retrieved, it is added into candidate point set.And to ensure when being inserted into candidate point set be according to away from From what is be ranked up from small to large.It is intended that this greediness retrieval mode is retrieved similar to depth-first.It is waited for when along distance When the nearest direction movement of Access Points, it is possible to reach some local optimum and stranded.The secondary near point that do not retrieved is investigated at this time Neighbours, may escape from local optimum predicament, increase retrieval accuracy.
It is farthest apart from point to be retrieved to delete candidate point concentration if the size of candidate point set is more than predetermined value M by S204~S205 If do, make candidate point set size be no more than predetermined value M.It is intended that enabling the algorithm to the iteration in certain number Stop retrieval afterwards, is not absorbed in Infinite Cyclic until traversing all the points.
S206~S207, iteration ends judgment module.Whether the point that detection candidate point is concentrated all is accessing points.If no It is the possibility for illustrating also to be found new nearest neighbor point;Otherwise, illustrate that in the case where parameter current is set new arest neighbors can not be found Point stops retrieval.
The preceding k point that candidate point is concentrated, i.e., k nearest apart from point to be retrieved point return as a result.K is predetermined Value.
For the accuracy of the nearest neighbor point of the point to be retrieved more intuitively described, to the accuracy amount of progress Change, specifically, the accuracy of the nearest neighbor point is described using average recall rate amount, the calculating of average recall rate (recall) is public Formula is as follows:
Wherein, q is the quantity of point to be retrieved, and k is the quantity of the nearest neighbor point of each Access Points, and recall is described nearest The mean accuracy of adjoint point indicates whether j-th of nearest neighbor point of i-th of point to be retrieved is accurate nearest neighbor point, if so, pi,j It is 1, if it is not, pi,jIt is 0.
By the approximate KNN search method of above-mentioned high dimensional data it is found that obtaining initialization candidate using database hub point Point set extends candidate's point set by satellite system figure and (several Neighbor Points for initializing candidate point intensive data point is added to time Reconnaissance is concentrated), calculate candidate point and concentrate the actual ranges of each candidate point and Access Points, according at a distance from point to be retrieved from candidate Better neighbour candidate point set is obtained in Neighbor Points, iterates the arest neighbors point set for obtaining Access Points.
The present invention takes full advantage of Fast Convergent characteristic based on drawing method and satellite system figure in approximate KNN retrieval Advantage, substantially increase the recall precision of high dimensional data nearest neighbor point, at the same substantially reduce index establish structure the time, subtract EMS memory occupation is lacked.
It is a preferable realization method of the approximate KNN search method of high dimensional data of the present invention, detailed process below It is as follows:
This realization method is described in further detail by taking GIST image feature datas as an example, such as the GIST data of table 1 Collect information table.
Table 1
Data set Baseset data are counted out Test set data are counted out Dimension
GIST 1000000 10000 960
Off-line phase uses 10000 data of the test set (data point for being different from baseset) in GIST data sets Point is used as Access Points, and 1000000 data points of baseset are database point set, establish satellite system figure.
Step a, approximate KNN figure N, the wherein k=300 of the arest neighbors figure are established on GIST data sets, i.e., it is every in figure A point has 300 neighbours (going out side).
Step b, a point p in database is taken outiAs point is investigated, on figure N, to investigating point piCarry out neighbour's extension.It establishes Empty build index of the picture point set P, P is subordinate ordered array, is ranked up from small to large according to distance.By piNeighbour and piNeighbour Neighbour be added index point set, will index point set be dimensioned to 500.If the point that candidate point is concentrated is more than 500, delete farther out Point only retains 500 nearest points.
Step c, since P is subordinate ordered array, and the point in P is that basis arrives point piDistance from small to large sort.It establishes As a result point set Fi since P first point, it is deleted from P, is added to point set Fi, and verifies whether Fi meets fully Radiativity.If not satisfied, then deleting new addition point.Otherwise it continuously adds first point in P and is removed from P, is iterated.Its In, the abundant radiativity is to concentrate any two points b and c, the angle of side ab and ac to be more than or equal to 60 degree the point.
When P becomes stopping iteration when the point number in empty set or Fi reaches predetermined value 70.
Step d, it checks whether i is more than or equal to 1000000, is to stop iteration, otherwise i=i+1, returns to step b.
Step e, the result point set Fi of all the points satellite system figure G is constituted to export as a result.
Retrieval phase obtains several candidate Neighbor Points nearest with point distance to be retrieved using following steps:
Step 1 sets input as point q to be retrieved, above-mentioned satellite system figure G, and setting p is greedy nearest _neighbor retrieval candidate's point set T Maximum capacity.Wherein, p is adjustable parameter, and p is bigger, and precision is higher, and retrieval time is longer.
Step 2 carries out greedy nearest _neighbor retrieval according to the above parameter, obtains the set T for including p candidate point.
Step 3 returns to k nearest point of range points q in T as a result.
Use average recall rate amount calculate the accuracy of k nearest neighbor point for:
Wherein, q is the quantity of Access Points, and value 10000, k is the quantity of the nearest neighbor point of each Access Points, recall For the mean accuracy of nearest neighbor point, indicate whether j-th of nearest neighbor point of i-th of Access Points is accurate nearest neighbor point, if so, pi,jIt is 1, if it is not, pi,jIt is 0.
According to above-mentioned formula, the recall values and elapsed time of nearest neighbor point retrieval result is calculated.Similarly counting It tests the method for the present invention (SSG algorithms), NSG algorithms, SSG-Naive algorithms down according to collection and (is not the SSG that neighbour's extension build figure The simple version of algorithm), the retrieval result of DPG algorithms, KGraph algorithms, HNSW algorithms, FANNG algorithms and Efanna algorithms Recall values and retrieval time find out the inquiry times of unit interval processing.
Count the retrieval result obtained through this embodiment in the case of the quantity k=100 of nearest neighbor point Recall values and elapsed time and the method for the present invention (SSG algorithms), NSG algorithms, SSG-Naive algorithms (do not do neighbour's extension Build the simple version of the SSG algorithms of figure), DPG algorithms, KGraph algorithms, HNSW algorithms, FANNG algorithms and Efanna calculate The recall values for the retrieval result that method obtains and unit interval handle inquiry times.
Wherein, institute's comparative approach is all known some best approximate KNN search methods based on different graph structures.
When Fig. 3 is k values 100, when the present embodiment on SIFT1M data sets is retrieved with other based on graph structure algorithm Recall values and unit interval processing inquiry times compare.Fig. 4 be k values 100 when, on GIST1M data sets the present embodiment with Recall values and unit interval processing inquiry times when other are retrieved based on graph structure algorithm compare.As can be seen from figs. 3 and 4 In the case of unit interval processing inquiry times are identical, the recall values for the retrieval result that the present embodiment obtains are apparently higher than NSG Algorithm, SSG-Naive algorithms (not doing the simple version that neighbour's extension build the SSG algorithms of figure), DPG algorithms, KGraph are calculated The recall values of method, the retrieval result of HNSW algorithms, FANNG algorithms and Efanna algorithms, therefore, high dimension provided by the invention According to approximate KNN search method recall precision highest.
Table 2 is when establishing index on SIFT1M and GIST1M data sets about index size (memory use) and rope Draw settling time the result record of (building time), wherein the index settling time of the method for the present invention is relatively short, rope Draw size minimum, while achieving above-mentioned best retrieval performance on this basis.
Table 2
As shown in figure 5, a kind of approximate KNN searching system of the high dimensional data based on satellite system figure, including offline satellite System's figure part and on-line search part, wherein offline satellite system figure part includes:
Arest neighbors module, for high dimensional data library point set, establishing approximate KNN figure;
Composition candidate's point set acquisition module extends to obtain for collecting to certain investigation point progress neighbour that database point is concentrated Neighbor Points, constitute composition index point set;
As a result point set screening module, the point for concentrating composition index point filter out result point set;
Composition iteration judgment module, for judging whether composition iterative process reaches end condition, when database point is concentrated All the points all obtain stopping iteration when corresponding result point set, obtain intermediate result figure;
Strong connectedness enhances module, for detecting the strong continune component in the presence of intermediate result figure, and they is connected As a complete strongly connected graph;
Satellite system figure result output module for the result point set of all the points to be constituted satellite system figure, and ensures satellite system Neighbours' number of all the points is no more than given value in figure, if the neighbours of certain point are more than given value, deletes neighbours farther out;
On-line search part includes:
Initialization module retrieves module for greedy approximate KNN and provides input, including point to be retrieved, satellite system figure;
Greedy approximate KNN retrieves module, for according to satellite system figure, obtaining data point to be retrieved in the high dimension Several the closest points concentrated according to library point;
As a result output module, what the candidate point that greedy approximate KNN is retrieved was concentrated, it is nearest apart from point to be retrieved K point return as a result, k is predetermined value;
Wherein, greedy approximate KNN retrieval module is nucleus module, including:
For constructing empty candidate point set candidate point set is added, and mark in given initialization points by initialization retrieval submodule It is denoted as non-accessing points;
It obtains and investigates point submodule, concentrate the non-accessing points nearest apart from point to be retrieved for obtaining present candidate points, make To investigate point, and labeled as accessing points;
The candidate point set of retrieval updates submodule, and for inquiring satellite system figure, candidate point set is added in the neighbours for obtaining investigation point, And according to point to be retrieved apart from ascending sort;
The candidate point set of retrieval screens submodule, the candidate point for screening candidate point concentration, when candidate point set size is more than When given value, candidate point is concentrated to several point deletions farthest apart from tested point, ensures that candidate point set size is just no more than and gives Definite value;
Iteration control submodule is retrieved, for calling acquisition to investigate point submodule, the candidate point set update submodule of retrieval successively Block and the candidate point set of retrieval screen submodule, until candidate point is concentrated without non-accessing points, stop iteration;
Retrieval result output sub-module, for using candidate point concentrate the value fixed number point of destination nearest apart from point to be retrieved as As a result it returns.

Claims (6)

1. a kind of approximate KNN search method of the high dimensional data based on satellite system figure, which is characterized in that include the following steps:
(1) satellite system figure is established to high dimensional data library point set;
(2) Access Points are treated, several data points is randomly choosed as candidate point set, greedy approximation is carried out on satellite system figure recently Neighbour's retrieval;
(3) by the point of the given quantity of obtained candidate point concentration as a result, the arest neighbors point set of point i.e. to be retrieved.
2. the approximate KNN search method of the high dimensional data according to claim 1 based on satellite system figure, feature exist In, step (1) the specific steps are:
(1-1) establishes the approximate KNN figure of high dimensional data library point set;
(1-2) is used as point to be investigated for any point a in database, by its neighbour in approximate KNN figure, Yi Jilin The neighbours in residence take out, and constitute index point set;By all the points that index point is concentrated and wait that investigating point a calculates distance, and press distance from It is small to sort to big sequence, retain apart from L nearest point, L is predetermined value, and remaining point is concentrated from index point and is deleted;
Since point minimum with a distances in index point set result point set is added after being deleted from index point set, and verify in (1-3) Whether current results point set meets abundant radiativity, if not satisfied, then deleting new addition point;
(1-4) is traversed when result point set size reaches the point that predetermined value R or index point are concentrated, using the result point set as defending Neighbours' point set of galaxy figure midpoint a;
(1-5) repeats step (1-2)~step (1-4), until database all the points are traversed, obtains intermediate result figure;
(1-6) chooses any point d from data set, and intermediate result figure is found from point d using depth-first search Strong continune component;
(1-7) adds in two-way side to intermediate result figure any two connected component being continuously found;
(1-8) repeats step 1-6~step 1-7, until it is predetermined value to reach maximum iteration T, T, obtains satellite system figure.
3. the approximate KNN search method of the high dimensional data according to claim 2 based on satellite system figure, feature exist In in step (1-1), the approximate KNN figure is digraph, and for any point in figure, it is a fixation to go out number of edges amount Value k, and the neighbours that this k side is connected are not all its k nearest neighbour.
4. the approximate KNN search method of the high dimensional data according to claim 2 based on satellite system figure, feature exist In in step (1-3), the abundant radiativity is:Any two points b and c concentrated to result points, the angle of side ab and ac is more than etc. In m degree, wherein m is preset value.
5. the approximate KNN search method of the high dimensional data according to claim 1 based on satellite system figure, feature exist In, in step (2), greedy approximate KNN retrieval, including:
(2-1) establishes empty candidate point set, if by database point concentrate it is randomly selected do, candidate point set is added, and mark For non-accessing points;
Candidate point is concentrated the non-accessing points nearest apart from point to be retrieved as investigation point by (2-2), and labeled as accessing points;
(2-3) obtains the neighbours for investigating point, all neighbours is labeled as non-accessing points, be added candidate by inquiring satellite system figure Point set, and candidate point set is sorted from small to large according to the distance to point to be retrieved;
(2-4) is if the size of candidate point set is more than predetermined value M, if deleting candidate point concentrates apart from point to be retrieved doing farthest, Candidate point set size is set to be no more than predetermined value M;
(2-5) repeats step (2-2)~step (2-4), and until the no non-accessing points of candidate point concentration, candidate point is concentrated distance The point of the nearest specified number of point to be retrieved returns as a result.
6. a kind of approximate KNN searching system of the high dimensional data based on satellite system figure, which is characterized in that including offline satellite System's figure part and on-line search part, wherein the offline satellite system figure part includes:
Arest neighbors module, for high dimensional data library point set, establishing approximate KNN figure;
Composition candidate's point set acquisition module, for collect certain that database point is concentrated investigate point carry out a neighbour extend it is close Adjoint point constitutes composition and indexes point set;
As a result point set screening module, the point for concentrating composition index point filter out result point set;
Composition iteration judgment module is concentrated when database point and is owned for judging whether composition iterative process reaches end condition Point all obtains stopping iteration when corresponding result point set, obtains intermediate result figure;
Strong connectedness enhances module, for detecting the strong continune component in the presence of intermediate result figure, and they is connected to become One complete strongly connected graph;
Satellite system figure result output module for the result point set of all the points to be constituted satellite system figure, and ensures in satellite system figure Neighbours' number of all the points is no more than given value, if the neighbours of certain point are more than given value, deletes neighbours farther out;
The on-line search part includes:
Initialization module retrieves module for greedy approximate KNN and provides input, including point to be retrieved, satellite system figure;
Greedy approximate KNN retrieves module, for according to satellite system figure, obtaining data point to be retrieved in the high dimensional data library Several closest points that point is concentrated;
As a result output module, what the candidate point that greedy approximate KNN is retrieved was concentrated, the k nearest apart from point to be retrieved Point returns as a result, and k is predetermined value;
Wherein, greedy approximate KNN retrieval module is nucleus module, including:
For constructing empty candidate point set candidate point set is added, and be labeled as in given initialization points by initialization retrieval submodule Non- accessing points;
It obtains and investigates a point submodule, the non-accessing points nearest apart from point to be retrieved are concentrated for obtaining present candidate points, as examining It examines a little, and labeled as accessing points;
The candidate point set of retrieval updates submodule, and for inquiring satellite system figure, candidate point set is added in the neighbours for obtaining investigation point, and presses Shine point to be retrieved apart from ascending sort;
The candidate point set of retrieval screens submodule, the candidate point for screening candidate point concentration, when candidate point set size is more than given When value, candidate point is concentrated to several point deletions farthest apart from tested point, ensures that candidate point set size is just no more than given value;
Retrieve iteration control submodule, for call successively obtain investigate a point submodule, the candidate point set update submodule of retrieval and The candidate point set of retrieval screens submodule, until candidate point is concentrated without non-accessing points, stops iteration;
Retrieval result output sub-module, for candidate point to be concentrated the value fixed number point of destination nearest apart from point to be retrieved as a result It returns.
CN201810229529.9A 2018-03-20 2018-03-20 A kind of the approximate KNN search method and searching system of the high dimensional data based on satellite system figure Pending CN108710626A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810229529.9A CN108710626A (en) 2018-03-20 2018-03-20 A kind of the approximate KNN search method and searching system of the high dimensional data based on satellite system figure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810229529.9A CN108710626A (en) 2018-03-20 2018-03-20 A kind of the approximate KNN search method and searching system of the high dimensional data based on satellite system figure

Publications (1)

Publication Number Publication Date
CN108710626A true CN108710626A (en) 2018-10-26

Family

ID=63866172

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810229529.9A Pending CN108710626A (en) 2018-03-20 2018-03-20 A kind of the approximate KNN search method and searching system of the high dimensional data based on satellite system figure

Country Status (1)

Country Link
CN (1) CN108710626A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111444297A (en) * 2020-03-17 2020-07-24 杭州电子科技大学 Multi-attribute approximate nearest neighbor searching method based on navigable small world graph
CN111813988A (en) * 2020-09-07 2020-10-23 上海云从企业发展有限公司 HNSW node deletion method, system, device and medium for image feature library
CN111859192A (en) * 2020-07-28 2020-10-30 科大讯飞股份有限公司 Searching method, searching device, electronic equipment and storage medium
CN112286942A (en) * 2020-12-25 2021-01-29 成都索贝数码科技股份有限公司 Data retrieval method based on regional hierarchical route map algorithm
CN112685603A (en) * 2019-10-18 2021-04-20 百度(美国)有限责任公司 Efficient retrieval of top-level similarity representations
CN112835627A (en) * 2019-11-25 2021-05-25 百度(美国)有限责任公司 Approximate nearest neighbor search for single instruction multi-thread or single instruction multiple data type processors
CN113157688A (en) * 2020-01-07 2021-07-23 四川大学 Nearest neighbor point searching method based on spatial index and neighbor point information
CN113761311A (en) * 2021-01-28 2021-12-07 北京沃东天骏信息技术有限公司 Information retrieval method, device, server and readable storage medium
US11989233B2 (en) 2019-11-01 2024-05-21 Baidu Usa Llc Transformation for fast inner product search on graph
CN118152141A (en) * 2024-05-07 2024-06-07 浪潮电子信息产业股份有限公司 Memory expansion system-based high-dimensional vector retrieval method, system and device
US12130865B2 (en) 2020-09-16 2024-10-29 Baidu Usa Llc Efficient retrieval of top similarity representations

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112685603B (en) * 2019-10-18 2024-07-23 百度(美国)有限责任公司 Efficient retrieval of top-level similarity representations
CN112685603A (en) * 2019-10-18 2021-04-20 百度(美国)有限责任公司 Efficient retrieval of top-level similarity representations
US11989233B2 (en) 2019-11-01 2024-05-21 Baidu Usa Llc Transformation for fast inner product search on graph
CN112835627A (en) * 2019-11-25 2021-05-25 百度(美国)有限责任公司 Approximate nearest neighbor search for single instruction multi-thread or single instruction multiple data type processors
CN112835627B (en) * 2019-11-25 2023-10-03 百度(美国)有限责任公司 Near nearest neighbor search for single instruction multithreading or single instruction multiple data type processors
CN113157688A (en) * 2020-01-07 2021-07-23 四川大学 Nearest neighbor point searching method based on spatial index and neighbor point information
CN111444297A (en) * 2020-03-17 2020-07-24 杭州电子科技大学 Multi-attribute approximate nearest neighbor searching method based on navigable small world graph
CN111859192B (en) * 2020-07-28 2023-01-17 科大讯飞股份有限公司 Searching method, searching device, electronic equipment and storage medium
CN111859192A (en) * 2020-07-28 2020-10-30 科大讯飞股份有限公司 Searching method, searching device, electronic equipment and storage medium
CN111813988A (en) * 2020-09-07 2020-10-23 上海云从企业发展有限公司 HNSW node deletion method, system, device and medium for image feature library
US12130865B2 (en) 2020-09-16 2024-10-29 Baidu Usa Llc Efficient retrieval of top similarity representations
CN112286942A (en) * 2020-12-25 2021-01-29 成都索贝数码科技股份有限公司 Data retrieval method based on regional hierarchical route map algorithm
CN113761311A (en) * 2021-01-28 2021-12-07 北京沃东天骏信息技术有限公司 Information retrieval method, device, server and readable storage medium
CN118152141A (en) * 2024-05-07 2024-06-07 浪潮电子信息产业股份有限公司 Memory expansion system-based high-dimensional vector retrieval method, system and device

Similar Documents

Publication Publication Date Title
CN108710626A (en) A kind of the approximate KNN search method and searching system of the high dimensional data based on satellite system figure
Frahling et al. Coresets in dynamic geometric data streams
CN104408191B (en) The acquisition methods and device of the association keyword of keyword
Li et al. G*-tree: An efficient spatial index on road networks
CN102004786B (en) Acceleration method in image retrieval system
Zou et al. Answering pattern match queries in large graph databases via graph embedding
CN105224961B (en) A kind of infrared spectrum feature extracting and matching method of high resolution
CN103744886B (en) Directly extracted k nearest neighbor searching algorithm
CN102012915A (en) Keyword recommendation method and system for document sharing platform
Yang et al. DBSCAN-MS: distributed density-based clustering in metric spaces
Ji et al. A divisive hierarchical clustering approach to hyperspectral band selection
Song et al. Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering
WO2016112618A1 (en) Distance-based algorithm for solving representative node set in two dimensional space
CN105956203B (en) A kind of information storage means, information query method, search engine device
Zhang et al. Maximizing range sum in trajectory data
CN106095779A (en) A kind of search method based on key word position and device
Gulzar et al. D-SKY: A framework for processing skyline queries in a dynamic and incomplete database
CN108829694A (en) The optimization method of flexible polymer K-NN search G tree on road network
CN114677341A (en) Video popularity prior prediction method fusing video text content
Carbone et al. Random projections for improved adversarial robustness
CN110853010B (en) High-speed railway cable detection method based on FWA and SM
Sarkar et al. Core2vec: A core-preserving feature learning framework for networks
Ito et al. OFA 2: A Multi-Objective Perspective for the Once-for-All Neural Architecture Search
Liu et al. Feature data selection for improving the performance of entity similarity searches in the Internet of Things
Ren et al. Adaptive road candidates search algorithm for map matching by clustering road segments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20181026

RJ01 Rejection of invention patent application after publication