CN108710626A - A kind of the approximate KNN search method and searching system of the high dimensional data based on satellite system figure - Google Patents
A kind of the approximate KNN search method and searching system of the high dimensional data based on satellite system figure Download PDFInfo
- Publication number
- CN108710626A CN108710626A CN201810229529.9A CN201810229529A CN108710626A CN 108710626 A CN108710626 A CN 108710626A CN 201810229529 A CN201810229529 A CN 201810229529A CN 108710626 A CN108710626 A CN 108710626A
- Authority
- CN
- China
- Prior art keywords
- point
- candidate
- point set
- satellite system
- points
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- HUTDUHSNJYTCAR-UHFFFAOYSA-N ancymidol Chemical compound C1=CC(OC)=CC=C1C(O)(C=1C=NC=NC=1)C1CC1 HUTDUHSNJYTCAR-UHFFFAOYSA-N 0.000 claims abstract description 13
- 239000000203 mixture Substances 0.000 claims description 16
- 239000012141 concentrate Substances 0.000 claims description 13
- 238000011835 investigation Methods 0.000 claims description 11
- 238000012216 screening Methods 0.000 claims description 8
- 230000001174 ascending effect Effects 0.000 claims description 3
- 238000012217 deletion Methods 0.000 claims description 3
- 230000037430 deletion Effects 0.000 claims description 3
- 238000012804 iterative process Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 201000011243 gastrointestinal stromal tumor Diseases 0.000 description 5
- 238000012545 processing Methods 0.000 description 4
- 230000005855 radiation Effects 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000007689 inspection Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000004364 calculation method Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000009776 industrial production Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000007363 ring formation reaction Methods 0.000 description 1
- 238000011524 similarity measure Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of approximate KNN search method of high dimensional data based on satellite system figure and searching systems, wherein the approximate KNN search method of the high dimensional data based on satellite system figure includes:(1) satellite system figure is established to high dimensional data library point set;(2) Access Points are treated, randomly choose several data points as candidate point set, greedy approximate KNN retrieval is carried out on satellite system figure;(3) by the point of the given quantity of obtained candidate point concentration as a result, the arest neighbors point set of point i.e. to be retrieved.Using the present invention, there can be the retrieval complexity of Logarithmic degree, can not only greatly improve the retrieval precision in certain time, but also can significantly reduce memory footprint when retrieval and establish and index the required time.
Description
Technical field
The present invention relates to data retrieval technology fields, and in particular to a kind of high dimensional data based on satellite system figure it is approximate most
Neighbour's search method and searching system.
Background technology
In recent years, approximate KNN retrieval technique is applied more and more extensive, especially in computer vision, machine learning, number
It handles according to needs such as excavation, natural language processing, text and image retrievals and is played very in the field of extensive high dimensional data
Important function.The object that approximate KNN retrieval discusses is extensive high dimensional data point set, it is desirable that can rapidly be counted at these
Several data points nearest apart from some point to be retrieved are found in strong point.
For extensive high dimensional data, it is expensive that the computing cost of traversal formula violence retrieval is carried out in luv space, is
The efficiency of nearest _neighbor retrieval is improved, scholars propose a series of approximate KNN searching algorithms.Common approximate KNN
Search method includes mainly:Method based on tree construction, the method based on Hash, the method based on accumulated amount and the side based on figure
Method.
Since data have complicated structure under real industrial production scene, ultra-large higher-dimension real vector number is being handled
According to when, existing method is difficult to obtain good effect.Approximate KNN retrieval is carried out to high dimensional data based on the method for tree construction
When, as data dimension increases, retrieval precision is greatly reduced.And the method based on Hash, recall precision is by Hash letter
Number ability to express and Hash table itself search efficiency greatly limit.Had when low accuracy rating is retrieved based on the method for accumulated amount
Preferable effect, but under the scene of high-precision retrieval, very big limitation that the precision upper limit is generated by quantization error.Nearest base
Contain huge potentiality in the method for figure.Certain methods experimentally prove to be more than traditional based on Hash, accumulated amount
With the method for tree construction.However the efficiency of the method based on figure is influenced by figure immanent structure, some existing are based on figure and tie
The method of structure be individually present establish Index process take, recall precision is relatively low, occupies the problems such as memory is big.
In order to fully demonstrate the high efficiency of the method for the present invention, the method for the present invention will be with a series of algorithms based on graph structure
It is compared.Including in international network data mining top-level meeting the 20th international in 2011
On Conference on World Wide Web《Efficient k-nearest neighbor graph
construction for generic similarity measures》Involved in arrive it is a kind of based on approximate k nearest neighbor figure
Method, entitled KGraph;The Chinese patent literature of Publication No. CN105550358A disclose a kind of high dimensional data it is approximate most
Neighbour's search method and searching system, it is proposed that a kind of composite index method based on tree construction and approximate k nearest neighbor figure, it is entitled
Efanna;In international conference IEEE Conference on Computer Vision and Pattern in 2016
Article on Recognition《FANNG:Fast Approximate Nearest Neighbour Graphs》In, it discloses
A method of based on FANNG graph structures;Technical article on the websites Cornell University Arxiv《Efficient and
robust approximate nearest neighbor search using Hierarchical Navigable Small
World graphs》In, disclose a kind of method of the entitled HNSW based on multilayer navigation beta pruning graph structure with navigation spots;
Technical article on the websites Cornell University Arxiv《Approximate Nearest Neighbor Search on High
Dimensional Data—Experiments,Analyses,and Improvement(v1.0)》In, disclose a kind of base
In the method for angle diversity, the entitled DPG of the undirected graph structure of differentiation;The Chinese patent of Publication No. CN107729348A
Document discloses a kind of the approximate KNN search method and searching system of the high dimensional data based on radiation stretching, extension figure, it is proposed that one
The method of entitled NSG of the kind based on the radiation stretching, extension figure with single navigation spots.
Wherein, the most efficient method of retrieval before NSG methods are the present invention, retrieval performance are better than other sides based on figure
Method, while being significantly better than other methods based on tree construction, Hash and accumulated amount.NSG needs to build a band navigation spots first
Radiation stretching, extension figure treat Access Points then on the figure from navigation spots and carry out greedy retrieval.
However, NSG methods need to select navigation spots when retrieving, initial point cannot be selected at random.Meanwhile the cutting edge of this method
Strategy selects side using alternative rule, and the length on Primary Reference side can not be distributed according to data set and adjust the angle size, with suitable
Answer the characteristic of data set, thus can not radiation data sets every bit periphery adjacent domain.
Invention content
The present invention provides a kind of approximate KNN search methods of the high dimensional data based on satellite system figure, are obviously improved inspection
Rope efficiency, and committed memory space greatly reduces.
A kind of approximate KNN search method of the high dimensional data based on satellite system figure, which is characterized in that including following step
Suddenly:
(1) satellite system figure is established to high dimensional data library point set;
(2) Access Points are treated, randomly choose several data points as candidate point set, are carried out on satellite system figure greedy approximate
Nearest _neighbor retrieval;
(3) by the point of the given quantity of obtained candidate point concentration as a result, the arest neighbors point set of point i.e. to be retrieved.
Wherein, step (1) the specific steps are:
(1-1) establishes the approximate KNN figure of high dimensional data library point set.Approximate KNN figure is digraph, for appointing in figure
A bit, it is a fixed value k to go out number of edges amount to meaning, and the neighbours that this k side is connected are not all its k nearest neighbour.
(1-2) is used as point to be investigated for any point a in database, by its neighbour in approximate KNN figure, with
And the neighbours of neighbours take out, and constitute index point set;By all the points that index point is concentrated and wait that investigating point a calculates distance, and press away from
From sequence sequence from small to large, retain apart from L nearest point, L is predetermined value, and remaining point is deleted from index point concentration
It removes.Predetermined value L is adjusted according to data set size and data dimension.
Result point set is added after being deleted from index point set since point minimum with a distances in index point set in (1-3), and
Whether verification current results point set meets abundant radiativity, if not satisfied, then deleting new addition point;The abundant radiativity is:
Any two points b and c, the angle of side ab and ac is concentrated to be more than or equal to m degree result points, wherein m is pre-value.
(1-4) is traversed when result point set size reaches the point that predetermined value R or index point are concentrated, which is made
For neighbours' point set of satellite system figure midpoint a.Predetermined value R is adjusted according to data set size and data dimension.
(1-5) repeats step (1-2)~step (1-4), until database all the points are traversed, obtains intermediate result figure.
(1-6) chooses any point d from data set, and intermediate result is found from point d using depth-first search
The strong continune component of figure.
(1-7) adds in two-way side to intermediate result figure any two connected component being continuously found.
(1-8) repeats step 1-6~step 1-7, until reaching certain number upper limit, obtains satellite system figure.
In step (2), greedy approximate KNN retrieval, including:
(2-1) establishes empty candidate point set, if by database point concentrate it is randomly selected do, candidate point set is added, and
Labeled as non-accessing points.
Candidate point is concentrated the non-accessing points nearest apart from point to be retrieved as investigation point by (2-2), and is labeled as having accessed
Point.
(2-3) obtains the neighbours for investigating point, all neighbours is labeled as non-accessing points, are added by inquiring satellite system figure
Candidate point set, and candidate point set is sorted from small to large according to the distance to point to be retrieved.
(2-4) deletes candidate point and concentrates apart from point to be retrieved farthest several if the size of candidate point set is more than predetermined value M
Point makes candidate point set size be no more than predetermined value M.Predetermined value M is adjusted according to data set size and data dimension.
(2-5) repeats step (2-2)~step (2-4), and until the no non-accessing points of candidate point concentration, candidate point is concentrated
The point of the specified number nearest apart from point to be retrieved returns as a result.
The input that satellite system drawing method is established in the present invention is high dimensional data library point set, is exported as satellite system figure.It is greedy close
Input like arest neighbors method is point to be retrieved, high dimensional data library point set and satellite system figure.Approximation based on satellite system figure is recently
The input of adjacent search method is point to be retrieved, high dimensional data library point set and satellite system figure.
The present invention also provides a kind of approximate KNN searching system of the high dimensional data based on satellite system figure, including it is offline
Satellite system figure part and on-line search part, wherein the offline satellite system figure part includes:
Arest neighbors module, for high dimensional data library point set, establishing approximate KNN figure;
Composition candidate's point set acquisition module extends to obtain for collecting to certain investigation point progress neighbour that database point is concentrated
Neighbor Points, constitute composition index point set;
As a result point set screening module, the point for concentrating composition index point filter out result point set;
Composition iteration judgment module, for judging whether composition iterative process reaches end condition, when database point is concentrated
All the points all obtain stopping iteration when corresponding result point set, obtain intermediate result figure;
Strong connectedness enhances module, for detecting the strong continune component in the presence of intermediate result figure, and they is connected
As a complete strongly connected graph;
Satellite system figure result output module for the result point set of all the points to be constituted satellite system figure, and ensures satellite system
Neighbours' number of all the points is no more than given value in figure, if the neighbours of certain point are more than given value, deletes neighbours farther out;
The on-line search part includes:
Initialization module retrieves module for greedy approximate KNN and provides input, including point to be retrieved, satellite system figure;
Greedy approximate KNN retrieves module, for according to satellite system figure, obtaining data point to be retrieved in the high dimension
Several the closest points concentrated according to library point;
As a result output module, what the candidate point that greedy approximate KNN is retrieved was concentrated, it is nearest apart from point to be retrieved
K point return as a result, k is predetermined value;
Wherein, greedy approximate KNN retrieval module is nucleus module, including:
For constructing empty candidate point set candidate point set is added, and mark in given initialization points by initialization retrieval submodule
It is denoted as non-accessing points;
It obtains and investigates point submodule, concentrate the non-accessing points nearest apart from point to be retrieved for obtaining present candidate points, make
To investigate point, and labeled as accessing points;
The candidate point set of retrieval updates submodule, and for inquiring satellite system figure, candidate point set is added in the neighbours for obtaining investigation point,
And according to point to be retrieved apart from ascending sort;
The candidate point set of retrieval screens submodule, the candidate point for screening candidate point concentration, when candidate point set size is more than
When given value, candidate point is concentrated to several point deletions farthest apart from tested point, ensures that candidate point set size is just no more than and gives
Definite value;
Iteration control submodule is retrieved, for calling acquisition to investigate point submodule, the candidate point set update submodule of retrieval successively
Block and the candidate point set of retrieval screen submodule, until candidate point is concentrated without non-accessing points, stop iteration;
Retrieval result output sub-module, for using candidate point concentrate the value fixed number point of destination nearest apart from point to be retrieved as
As a result it returns.
The approximate KNN search method of high dimensional data provided by the invention based on satellite system figure will be by that will randomly choose
If do the candidate point set of addition initialization, by satellite system figure iteration extend initialization candidate point set (will initialization it is candidate
Several Neighbor Points of point centrostigma are added to candidate point concentration);It calculates candidate point and concentrates each candidate point and the point to be retrieved
Actual range obtains better neighbour candidate point set from the candidate Neighbor Points according at a distance from point to be retrieved, changes repeatedly
In generation, obtains the arest neighbors point set of the point to be retrieved.
The time complexity that can have Logarithmic degree using the searching algorithm of the present invention, can not only greatly improve certain time
Interior retrieval precision, and can significantly reduce memory footprint when retrieval and establish and index the required time.
Description of the drawings
Fig. 1 is that the present invention is based on the flow diagrams of the approximate KNN search method of the high dimensional data of satellite system figure;
Fig. 2 is the greedy approximate KNN search method flow diagram of the present invention;
When Fig. 3 is k values 100, calling together when the present invention is retrieved with other based on graph structure algorithm on SIFT1M data sets
It returns rate value and retrieval time compares figure;
When Fig. 4 is k values 100, calling together when the present invention is retrieved with other based on graph structure algorithm on GIST1M data sets
It returns rate value and retrieval time compares figure;
Fig. 5 is a kind of module composition knot of approximate KNN searching system of the high dimensional data based on satellite system figure of the present invention
Structure schematic diagram.
Specific implementation mode
To make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with attached drawing to the present invention make into
One step it is described in detail.
As shown in Figure 1, a kind of approximate KNN searching system of the high dimensional data based on satellite system figure, including off-line phase
And on-line stage.
The purpose of off-line phase is to establish satellite system figure, including step S101 to step S107.
S101 establishes the approximate KNN figure of high dimensional data library point set.Approximate KNN figure is digraph, for appointing in figure
A bit, it is a fixed value k to go out number of edges amount to meaning, and it is entirely its k nearest neighbour that the neighbours that are connected of this k side, which are not,.
S102 takes out a point pi in database and on the approximate KNN figure, searches as point is investigated and investigate point pi's
The neighbour of neighbour and its neighbour are added index point set P, are ranked up from small to large according to the distance for investigating point pi, only protected
L nearest point is stayed, L is predetermined value.
By the neighbour for covering neighbour and neighbour so that investigation point pi wide ranges can be covered by investigating the side on point periphery
Interior neighboring regions so that a wide range of jump at a distance can be carried out when retrieving on the diagram.
S103 since apart from smallest point, deletes from index point set for the point that index point is concentrated, result point set is added
S, and verify whether current results point set S meets abundant radiativity.If not satisfied, then deleting new addition point.Wherein, described abundant
Radiativity is to concentrate any two points b and c, the angle of side ab and ac to be more than or equal to m degree the point, m is preset value.Meanwhile as a result
The point that point is concentrated is close as possible from point pi is investigated.Such property can make the side that at any point in figure fully cover point week
The adjacent domain enclosed quickly is cruised on the diagram experiments have shown that being more advantageous to greedy searching algorithm in this way.
S104, as a result point set screening control module.Judgement builds whether index of the picture point set is empty set, is to stop screening.Or
As a result whether point set reaches specified size r, is to stop iteration.
S105 builds figure iteration control module.Judge whether current investigation point pi is the last one point of database.It is to stop
Iteration;Otherwise i increases by 1, returns to step S102 and continues iteration.After iteration stopping, intermediate result figure is obtained.Due to different investigations
Behavior between point is mutual indepedent, is easy to parallelization.
S106, strong connectedness enhance module.For converting non-strongly connected graph to strongly connected graph, enhance the connectivity of figure.
Intermediate result figure is traversed using Depth Priority Algorithm, traversal while detection backward channel whether there is or
No cyclization detects the strong continune component number of intermediate result figure.Add between the strong continune component that any two is continuously found
Add two-way side so that all strong continune components are connected as a strongly connected graph, enhance the connectivity of figure, contribute to retrieval accurate
The promotion of degree.
S107, output module.The result point set of obtained all database points is constituted into the output of satellite system figure.
On-line stage is based on satellite system figure, is retrieved by greedy approximate KNN and obtains arest neighbors point set, including step
S111 to step S113.
S111, initialization retrieval module.Using point q to be retrieved and satellite system figure G as input, it is supplied to greedy approximate nearest
Neighbour's retrieval module.
S112 carries out greedy approximate KNN retrieving on satellite system figure, obtains point q according to given output parameter
Arest neighbors candidate's point set.
S113, output module return to the k point that candidate point concentrates range points q nearest as a result, and k is predetermined value.
As shown in Fig. 2, the approximate KNN search method of high dimensional data includes a crucial greedy approximate KNN inspection
Rope module, including step S201 to step S207.
S201, establishes empty candidate point set, and maximum capacity is predetermined value p.Several random choosings that database point is concentrated
The point selected is added candidate point set, and is labeled as non-accessing points.
The point object that candidate point is concentrated includes the index value (or subscript value) of the point, the distance to point to be retrieved and access
Three attributes of label.Purpose is that calculation amount is conveniently ranked up and saved to candidate point.
Candidate point is concentrated the non-accessing points nearest apart from point to be retrieved as investigation point, and is labeled as having accessed by S202
Point.Investigate point neighbours probably apart from point to be retrieved closer to.The purpose for being marked as accessing points is later inspection
In looking into will not the rechecking point, cause extra calculation amount.
S203 obtains the neighbours for investigating point by inquiring satellite system figure.All neighbours are labeled as non-accessing points, are calculated
After the distance between point to be retrieved, it is added into candidate point set.And to ensure when being inserted into candidate point set be according to away from
From what is be ranked up from small to large.It is intended that this greediness retrieval mode is retrieved similar to depth-first.It is waited for when along distance
When the nearest direction movement of Access Points, it is possible to reach some local optimum and stranded.The secondary near point that do not retrieved is investigated at this time
Neighbours, may escape from local optimum predicament, increase retrieval accuracy.
It is farthest apart from point to be retrieved to delete candidate point concentration if the size of candidate point set is more than predetermined value M by S204~S205
If do, make candidate point set size be no more than predetermined value M.It is intended that enabling the algorithm to the iteration in certain number
Stop retrieval afterwards, is not absorbed in Infinite Cyclic until traversing all the points.
S206~S207, iteration ends judgment module.Whether the point that detection candidate point is concentrated all is accessing points.If no
It is the possibility for illustrating also to be found new nearest neighbor point;Otherwise, illustrate that in the case where parameter current is set new arest neighbors can not be found
Point stops retrieval.
The preceding k point that candidate point is concentrated, i.e., k nearest apart from point to be retrieved point return as a result.K is predetermined
Value.
For the accuracy of the nearest neighbor point of the point to be retrieved more intuitively described, to the accuracy amount of progress
Change, specifically, the accuracy of the nearest neighbor point is described using average recall rate amount, the calculating of average recall rate (recall) is public
Formula is as follows:
Wherein, q is the quantity of point to be retrieved, and k is the quantity of the nearest neighbor point of each Access Points, and recall is described nearest
The mean accuracy of adjoint point indicates whether j-th of nearest neighbor point of i-th of point to be retrieved is accurate nearest neighbor point, if so, pi,j
It is 1, if it is not, pi,jIt is 0.
By the approximate KNN search method of above-mentioned high dimensional data it is found that obtaining initialization candidate using database hub point
Point set extends candidate's point set by satellite system figure and (several Neighbor Points for initializing candidate point intensive data point is added to time
Reconnaissance is concentrated), calculate candidate point and concentrate the actual ranges of each candidate point and Access Points, according at a distance from point to be retrieved from candidate
Better neighbour candidate point set is obtained in Neighbor Points, iterates the arest neighbors point set for obtaining Access Points.
The present invention takes full advantage of Fast Convergent characteristic based on drawing method and satellite system figure in approximate KNN retrieval
Advantage, substantially increase the recall precision of high dimensional data nearest neighbor point, at the same substantially reduce index establish structure the time, subtract
EMS memory occupation is lacked.
It is a preferable realization method of the approximate KNN search method of high dimensional data of the present invention, detailed process below
It is as follows:
This realization method is described in further detail by taking GIST image feature datas as an example, such as the GIST data of table 1
Collect information table.
Table 1
Data set | Baseset data are counted out | Test set data are counted out | Dimension |
GIST | 1000000 | 10000 | 960 |
Off-line phase uses 10000 data of the test set (data point for being different from baseset) in GIST data sets
Point is used as Access Points, and 1000000 data points of baseset are database point set, establish satellite system figure.
Step a, approximate KNN figure N, the wherein k=300 of the arest neighbors figure are established on GIST data sets, i.e., it is every in figure
A point has 300 neighbours (going out side).
Step b, a point p in database is taken outiAs point is investigated, on figure N, to investigating point piCarry out neighbour's extension.It establishes
Empty build index of the picture point set P, P is subordinate ordered array, is ranked up from small to large according to distance.By piNeighbour and piNeighbour
Neighbour be added index point set, will index point set be dimensioned to 500.If the point that candidate point is concentrated is more than 500, delete farther out
Point only retains 500 nearest points.
Step c, since P is subordinate ordered array, and the point in P is that basis arrives point piDistance from small to large sort.It establishes
As a result point set Fi since P first point, it is deleted from P, is added to point set Fi, and verifies whether Fi meets fully
Radiativity.If not satisfied, then deleting new addition point.Otherwise it continuously adds first point in P and is removed from P, is iterated.Its
In, the abundant radiativity is to concentrate any two points b and c, the angle of side ab and ac to be more than or equal to 60 degree the point.
When P becomes stopping iteration when the point number in empty set or Fi reaches predetermined value 70.
Step d, it checks whether i is more than or equal to 1000000, is to stop iteration, otherwise i=i+1, returns to step b.
Step e, the result point set Fi of all the points satellite system figure G is constituted to export as a result.
Retrieval phase obtains several candidate Neighbor Points nearest with point distance to be retrieved using following steps:
Step 1 sets input as point q to be retrieved, above-mentioned satellite system figure G, and setting p is greedy nearest _neighbor retrieval candidate's point set T
Maximum capacity.Wherein, p is adjustable parameter, and p is bigger, and precision is higher, and retrieval time is longer.
Step 2 carries out greedy nearest _neighbor retrieval according to the above parameter, obtains the set T for including p candidate point.
Step 3 returns to k nearest point of range points q in T as a result.
Use average recall rate amount calculate the accuracy of k nearest neighbor point for:
Wherein, q is the quantity of Access Points, and value 10000, k is the quantity of the nearest neighbor point of each Access Points, recall
For the mean accuracy of nearest neighbor point, indicate whether j-th of nearest neighbor point of i-th of Access Points is accurate nearest neighbor point, if so,
pi,jIt is 1, if it is not, pi,jIt is 0.
According to above-mentioned formula, the recall values and elapsed time of nearest neighbor point retrieval result is calculated.Similarly counting
It tests the method for the present invention (SSG algorithms), NSG algorithms, SSG-Naive algorithms down according to collection and (is not the SSG that neighbour's extension build figure
The simple version of algorithm), the retrieval result of DPG algorithms, KGraph algorithms, HNSW algorithms, FANNG algorithms and Efanna algorithms
Recall values and retrieval time find out the inquiry times of unit interval processing.
Count the retrieval result obtained through this embodiment in the case of the quantity k=100 of nearest neighbor point
Recall values and elapsed time and the method for the present invention (SSG algorithms), NSG algorithms, SSG-Naive algorithms (do not do neighbour's extension
Build the simple version of the SSG algorithms of figure), DPG algorithms, KGraph algorithms, HNSW algorithms, FANNG algorithms and Efanna calculate
The recall values for the retrieval result that method obtains and unit interval handle inquiry times.
Wherein, institute's comparative approach is all known some best approximate KNN search methods based on different graph structures.
When Fig. 3 is k values 100, when the present embodiment on SIFT1M data sets is retrieved with other based on graph structure algorithm
Recall values and unit interval processing inquiry times compare.Fig. 4 be k values 100 when, on GIST1M data sets the present embodiment with
Recall values and unit interval processing inquiry times when other are retrieved based on graph structure algorithm compare.As can be seen from figs. 3 and 4
In the case of unit interval processing inquiry times are identical, the recall values for the retrieval result that the present embodiment obtains are apparently higher than NSG
Algorithm, SSG-Naive algorithms (not doing the simple version that neighbour's extension build the SSG algorithms of figure), DPG algorithms, KGraph are calculated
The recall values of method, the retrieval result of HNSW algorithms, FANNG algorithms and Efanna algorithms, therefore, high dimension provided by the invention
According to approximate KNN search method recall precision highest.
Table 2 is when establishing index on SIFT1M and GIST1M data sets about index size (memory use) and rope
Draw settling time the result record of (building time), wherein the index settling time of the method for the present invention is relatively short, rope
Draw size minimum, while achieving above-mentioned best retrieval performance on this basis.
Table 2
As shown in figure 5, a kind of approximate KNN searching system of the high dimensional data based on satellite system figure, including offline satellite
System's figure part and on-line search part, wherein offline satellite system figure part includes:
Arest neighbors module, for high dimensional data library point set, establishing approximate KNN figure;
Composition candidate's point set acquisition module extends to obtain for collecting to certain investigation point progress neighbour that database point is concentrated
Neighbor Points, constitute composition index point set;
As a result point set screening module, the point for concentrating composition index point filter out result point set;
Composition iteration judgment module, for judging whether composition iterative process reaches end condition, when database point is concentrated
All the points all obtain stopping iteration when corresponding result point set, obtain intermediate result figure;
Strong connectedness enhances module, for detecting the strong continune component in the presence of intermediate result figure, and they is connected
As a complete strongly connected graph;
Satellite system figure result output module for the result point set of all the points to be constituted satellite system figure, and ensures satellite system
Neighbours' number of all the points is no more than given value in figure, if the neighbours of certain point are more than given value, deletes neighbours farther out;
On-line search part includes:
Initialization module retrieves module for greedy approximate KNN and provides input, including point to be retrieved, satellite system figure;
Greedy approximate KNN retrieves module, for according to satellite system figure, obtaining data point to be retrieved in the high dimension
Several the closest points concentrated according to library point;
As a result output module, what the candidate point that greedy approximate KNN is retrieved was concentrated, it is nearest apart from point to be retrieved
K point return as a result, k is predetermined value;
Wherein, greedy approximate KNN retrieval module is nucleus module, including:
For constructing empty candidate point set candidate point set is added, and mark in given initialization points by initialization retrieval submodule
It is denoted as non-accessing points;
It obtains and investigates point submodule, concentrate the non-accessing points nearest apart from point to be retrieved for obtaining present candidate points, make
To investigate point, and labeled as accessing points;
The candidate point set of retrieval updates submodule, and for inquiring satellite system figure, candidate point set is added in the neighbours for obtaining investigation point,
And according to point to be retrieved apart from ascending sort;
The candidate point set of retrieval screens submodule, the candidate point for screening candidate point concentration, when candidate point set size is more than
When given value, candidate point is concentrated to several point deletions farthest apart from tested point, ensures that candidate point set size is just no more than and gives
Definite value;
Iteration control submodule is retrieved, for calling acquisition to investigate point submodule, the candidate point set update submodule of retrieval successively
Block and the candidate point set of retrieval screen submodule, until candidate point is concentrated without non-accessing points, stop iteration;
Retrieval result output sub-module, for using candidate point concentrate the value fixed number point of destination nearest apart from point to be retrieved as
As a result it returns.
Claims (6)
1. a kind of approximate KNN search method of the high dimensional data based on satellite system figure, which is characterized in that include the following steps:
(1) satellite system figure is established to high dimensional data library point set;
(2) Access Points are treated, several data points is randomly choosed as candidate point set, greedy approximation is carried out on satellite system figure recently
Neighbour's retrieval;
(3) by the point of the given quantity of obtained candidate point concentration as a result, the arest neighbors point set of point i.e. to be retrieved.
2. the approximate KNN search method of the high dimensional data according to claim 1 based on satellite system figure, feature exist
In, step (1) the specific steps are:
(1-1) establishes the approximate KNN figure of high dimensional data library point set;
(1-2) is used as point to be investigated for any point a in database, by its neighbour in approximate KNN figure, Yi Jilin
The neighbours in residence take out, and constitute index point set;By all the points that index point is concentrated and wait that investigating point a calculates distance, and press distance from
It is small to sort to big sequence, retain apart from L nearest point, L is predetermined value, and remaining point is concentrated from index point and is deleted;
Since point minimum with a distances in index point set result point set is added after being deleted from index point set, and verify in (1-3)
Whether current results point set meets abundant radiativity, if not satisfied, then deleting new addition point;
(1-4) is traversed when result point set size reaches the point that predetermined value R or index point are concentrated, using the result point set as defending
Neighbours' point set of galaxy figure midpoint a;
(1-5) repeats step (1-2)~step (1-4), until database all the points are traversed, obtains intermediate result figure;
(1-6) chooses any point d from data set, and intermediate result figure is found from point d using depth-first search
Strong continune component;
(1-7) adds in two-way side to intermediate result figure any two connected component being continuously found;
(1-8) repeats step 1-6~step 1-7, until it is predetermined value to reach maximum iteration T, T, obtains satellite system figure.
3. the approximate KNN search method of the high dimensional data according to claim 2 based on satellite system figure, feature exist
In in step (1-1), the approximate KNN figure is digraph, and for any point in figure, it is a fixation to go out number of edges amount
Value k, and the neighbours that this k side is connected are not all its k nearest neighbour.
4. the approximate KNN search method of the high dimensional data according to claim 2 based on satellite system figure, feature exist
In in step (1-3), the abundant radiativity is:Any two points b and c concentrated to result points, the angle of side ab and ac is more than etc.
In m degree, wherein m is preset value.
5. the approximate KNN search method of the high dimensional data according to claim 1 based on satellite system figure, feature exist
In, in step (2), greedy approximate KNN retrieval, including:
(2-1) establishes empty candidate point set, if by database point concentrate it is randomly selected do, candidate point set is added, and mark
For non-accessing points;
Candidate point is concentrated the non-accessing points nearest apart from point to be retrieved as investigation point by (2-2), and labeled as accessing points;
(2-3) obtains the neighbours for investigating point, all neighbours is labeled as non-accessing points, be added candidate by inquiring satellite system figure
Point set, and candidate point set is sorted from small to large according to the distance to point to be retrieved;
(2-4) is if the size of candidate point set is more than predetermined value M, if deleting candidate point concentrates apart from point to be retrieved doing farthest,
Candidate point set size is set to be no more than predetermined value M;
(2-5) repeats step (2-2)~step (2-4), and until the no non-accessing points of candidate point concentration, candidate point is concentrated distance
The point of the nearest specified number of point to be retrieved returns as a result.
6. a kind of approximate KNN searching system of the high dimensional data based on satellite system figure, which is characterized in that including offline satellite
System's figure part and on-line search part, wherein the offline satellite system figure part includes:
Arest neighbors module, for high dimensional data library point set, establishing approximate KNN figure;
Composition candidate's point set acquisition module, for collect certain that database point is concentrated investigate point carry out a neighbour extend it is close
Adjoint point constitutes composition and indexes point set;
As a result point set screening module, the point for concentrating composition index point filter out result point set;
Composition iteration judgment module is concentrated when database point and is owned for judging whether composition iterative process reaches end condition
Point all obtains stopping iteration when corresponding result point set, obtains intermediate result figure;
Strong connectedness enhances module, for detecting the strong continune component in the presence of intermediate result figure, and they is connected to become
One complete strongly connected graph;
Satellite system figure result output module for the result point set of all the points to be constituted satellite system figure, and ensures in satellite system figure
Neighbours' number of all the points is no more than given value, if the neighbours of certain point are more than given value, deletes neighbours farther out;
The on-line search part includes:
Initialization module retrieves module for greedy approximate KNN and provides input, including point to be retrieved, satellite system figure;
Greedy approximate KNN retrieves module, for according to satellite system figure, obtaining data point to be retrieved in the high dimensional data library
Several closest points that point is concentrated;
As a result output module, what the candidate point that greedy approximate KNN is retrieved was concentrated, the k nearest apart from point to be retrieved
Point returns as a result, and k is predetermined value;
Wherein, greedy approximate KNN retrieval module is nucleus module, including:
For constructing empty candidate point set candidate point set is added, and be labeled as in given initialization points by initialization retrieval submodule
Non- accessing points;
It obtains and investigates a point submodule, the non-accessing points nearest apart from point to be retrieved are concentrated for obtaining present candidate points, as examining
It examines a little, and labeled as accessing points;
The candidate point set of retrieval updates submodule, and for inquiring satellite system figure, candidate point set is added in the neighbours for obtaining investigation point, and presses
Shine point to be retrieved apart from ascending sort;
The candidate point set of retrieval screens submodule, the candidate point for screening candidate point concentration, when candidate point set size is more than given
When value, candidate point is concentrated to several point deletions farthest apart from tested point, ensures that candidate point set size is just no more than given value;
Retrieve iteration control submodule, for call successively obtain investigate a point submodule, the candidate point set update submodule of retrieval and
The candidate point set of retrieval screens submodule, until candidate point is concentrated without non-accessing points, stops iteration;
Retrieval result output sub-module, for candidate point to be concentrated the value fixed number point of destination nearest apart from point to be retrieved as a result
It returns.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810229529.9A CN108710626A (en) | 2018-03-20 | 2018-03-20 | A kind of the approximate KNN search method and searching system of the high dimensional data based on satellite system figure |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810229529.9A CN108710626A (en) | 2018-03-20 | 2018-03-20 | A kind of the approximate KNN search method and searching system of the high dimensional data based on satellite system figure |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108710626A true CN108710626A (en) | 2018-10-26 |
Family
ID=63866172
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810229529.9A Pending CN108710626A (en) | 2018-03-20 | 2018-03-20 | A kind of the approximate KNN search method and searching system of the high dimensional data based on satellite system figure |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108710626A (en) |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111444297A (en) * | 2020-03-17 | 2020-07-24 | 杭州电子科技大学 | Multi-attribute approximate nearest neighbor searching method based on navigable small world graph |
CN111813988A (en) * | 2020-09-07 | 2020-10-23 | 上海云从企业发展有限公司 | HNSW node deletion method, system, device and medium for image feature library |
CN111859192A (en) * | 2020-07-28 | 2020-10-30 | 科大讯飞股份有限公司 | Searching method, searching device, electronic equipment and storage medium |
CN112286942A (en) * | 2020-12-25 | 2021-01-29 | 成都索贝数码科技股份有限公司 | Data retrieval method based on regional hierarchical route map algorithm |
CN112685603A (en) * | 2019-10-18 | 2021-04-20 | 百度(美国)有限责任公司 | Efficient retrieval of top-level similarity representations |
CN112835627A (en) * | 2019-11-25 | 2021-05-25 | 百度(美国)有限责任公司 | Approximate nearest neighbor search for single instruction multi-thread or single instruction multiple data type processors |
CN113157688A (en) * | 2020-01-07 | 2021-07-23 | 四川大学 | Nearest neighbor point searching method based on spatial index and neighbor point information |
CN113761311A (en) * | 2021-01-28 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Information retrieval method, device, server and readable storage medium |
US11989233B2 (en) | 2019-11-01 | 2024-05-21 | Baidu Usa Llc | Transformation for fast inner product search on graph |
CN118152141A (en) * | 2024-05-07 | 2024-06-07 | 浪潮电子信息产业股份有限公司 | Memory expansion system-based high-dimensional vector retrieval method, system and device |
US12130865B2 (en) | 2020-09-16 | 2024-10-29 | Baidu Usa Llc | Efficient retrieval of top similarity representations |
-
2018
- 2018-03-20 CN CN201810229529.9A patent/CN108710626A/en active Pending
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112685603B (en) * | 2019-10-18 | 2024-07-23 | 百度(美国)有限责任公司 | Efficient retrieval of top-level similarity representations |
CN112685603A (en) * | 2019-10-18 | 2021-04-20 | 百度(美国)有限责任公司 | Efficient retrieval of top-level similarity representations |
US11989233B2 (en) | 2019-11-01 | 2024-05-21 | Baidu Usa Llc | Transformation for fast inner product search on graph |
CN112835627A (en) * | 2019-11-25 | 2021-05-25 | 百度(美国)有限责任公司 | Approximate nearest neighbor search for single instruction multi-thread or single instruction multiple data type processors |
CN112835627B (en) * | 2019-11-25 | 2023-10-03 | 百度(美国)有限责任公司 | Near nearest neighbor search for single instruction multithreading or single instruction multiple data type processors |
CN113157688A (en) * | 2020-01-07 | 2021-07-23 | 四川大学 | Nearest neighbor point searching method based on spatial index and neighbor point information |
CN111444297A (en) * | 2020-03-17 | 2020-07-24 | 杭州电子科技大学 | Multi-attribute approximate nearest neighbor searching method based on navigable small world graph |
CN111859192B (en) * | 2020-07-28 | 2023-01-17 | 科大讯飞股份有限公司 | Searching method, searching device, electronic equipment and storage medium |
CN111859192A (en) * | 2020-07-28 | 2020-10-30 | 科大讯飞股份有限公司 | Searching method, searching device, electronic equipment and storage medium |
CN111813988A (en) * | 2020-09-07 | 2020-10-23 | 上海云从企业发展有限公司 | HNSW node deletion method, system, device and medium for image feature library |
US12130865B2 (en) | 2020-09-16 | 2024-10-29 | Baidu Usa Llc | Efficient retrieval of top similarity representations |
CN112286942A (en) * | 2020-12-25 | 2021-01-29 | 成都索贝数码科技股份有限公司 | Data retrieval method based on regional hierarchical route map algorithm |
CN113761311A (en) * | 2021-01-28 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Information retrieval method, device, server and readable storage medium |
CN118152141A (en) * | 2024-05-07 | 2024-06-07 | 浪潮电子信息产业股份有限公司 | Memory expansion system-based high-dimensional vector retrieval method, system and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108710626A (en) | A kind of the approximate KNN search method and searching system of the high dimensional data based on satellite system figure | |
Frahling et al. | Coresets in dynamic geometric data streams | |
CN104408191B (en) | The acquisition methods and device of the association keyword of keyword | |
Li et al. | G*-tree: An efficient spatial index on road networks | |
CN102004786B (en) | Acceleration method in image retrieval system | |
Zou et al. | Answering pattern match queries in large graph databases via graph embedding | |
CN105224961B (en) | A kind of infrared spectrum feature extracting and matching method of high resolution | |
CN103744886B (en) | Directly extracted k nearest neighbor searching algorithm | |
CN102012915A (en) | Keyword recommendation method and system for document sharing platform | |
Yang et al. | DBSCAN-MS: distributed density-based clustering in metric spaces | |
Ji et al. | A divisive hierarchical clustering approach to hyperspectral band selection | |
Song et al. | Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering | |
WO2016112618A1 (en) | Distance-based algorithm for solving representative node set in two dimensional space | |
CN105956203B (en) | A kind of information storage means, information query method, search engine device | |
Zhang et al. | Maximizing range sum in trajectory data | |
CN106095779A (en) | A kind of search method based on key word position and device | |
Gulzar et al. | D-SKY: A framework for processing skyline queries in a dynamic and incomplete database | |
CN108829694A (en) | The optimization method of flexible polymer K-NN search G tree on road network | |
CN114677341A (en) | Video popularity prior prediction method fusing video text content | |
Carbone et al. | Random projections for improved adversarial robustness | |
CN110853010B (en) | High-speed railway cable detection method based on FWA and SM | |
Sarkar et al. | Core2vec: A core-preserving feature learning framework for networks | |
Ito et al. | OFA 2: A Multi-Objective Perspective for the Once-for-All Neural Architecture Search | |
Liu et al. | Feature data selection for improving the performance of entity similarity searches in the Internet of Things | |
Ren et al. | Adaptive road candidates search algorithm for map matching by clustering road segments |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181026 |
|
RJ01 | Rejection of invention patent application after publication |