CN103020321A - Neighbor searching method and neighbor searching system - Google Patents

Neighbor searching method and neighbor searching system Download PDF

Info

Publication number
CN103020321A
CN103020321A CN201310011407XA CN201310011407A CN103020321A CN 103020321 A CN103020321 A CN 103020321A CN 201310011407X A CN201310011407X A CN 201310011407XA CN 201310011407 A CN201310011407 A CN 201310011407A CN 103020321 A CN103020321 A CN 103020321A
Authority
CN
China
Prior art keywords
data
point
data point
dimensionality reduction
nuclear matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310011407XA
Other languages
Chinese (zh)
Other versions
CN103020321B (en
Inventor
钟海兰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
GUANGDONG TUTUSOU NETWORK TECHNOLOGY Co Ltd
Original Assignee
GUANGDONG TUTUSOU NETWORK TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGDONG TUTUSOU NETWORK TECHNOLOGY Co Ltd filed Critical GUANGDONG TUTUSOU NETWORK TECHNOLOGY Co Ltd
Priority to CN201310011407.XA priority Critical patent/CN103020321B/en
Publication of CN103020321A publication Critical patent/CN103020321A/en
Application granted granted Critical
Publication of CN103020321B publication Critical patent/CN103020321B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a neighbor searching method and a neighbor searching system. The neighbor searching method includes offline learning and online searching. A learning function of data point number in small regions on two sides of a Hash hyperplane is minimized to enable the Hash hyperplane to penetrate through a data sparse region, and accordingly high accuracy in neighbor searching can be guaranteed. An approximate balance barrel regular term is added to a learning function through approximate balance barrel conditions to enable more balanced data point partition by the Hash hyperplane, and accordingly high searching speed in neighbor searching is guaranteed. Regardless of a small quantity of data or massive data, high-accuracy and high-speed neighbor searching can be realized by the neighbor searching method and the neighbor searching system.

Description

The neighbor search method and system
Technical field
The present invention relates to technical field of information retrieval, particularly relate to a kind of neighbor search method and system.
Background technology
Along with developing rapidly of infotech, the raising of data acquisition ability causes each field data volume and dimension all to be exponential growth.Yet increasing rapidly of data volume and data dimension allows search become unusually difficult.For example, for the picture of an input, when we need to search with the same or analogous picture of this pictures in the mass picture storehouse, we not only will look for accurately, also will look for soon.In this example, we can become data point with image abstraction, and the similarity degree between the data point can be weighed with Euclidean distance usually, and neighbor search just refers to find and the similarity number strong point of data query point under this similarity measurement.
If with in data query point and the database have a few and compare one by one, although can guarantee retrieval precision, will become very slow for the magnanimity high dimensional data.Traditional neighbor search technology based on tree construction is if obtain higher accuracy rate, its can along with data dimension increase, retrieval rate descends rapidly.
Summary of the invention
Based on above-mentioned situation, the present invention proposes a kind of neighbor search method and system, to improve the speed of neighbor search, guarantee simultaneously accuracy rate.
A kind of neighbor search method comprises under the line on study and the line and searching for,
Study comprises the steps: under the described line
The data point of the random predetermined number of evenly choosing data centralization obtains nuclear matrix as anchor point by the distance between computational data point and the described anchor point, and this nuclear matrix of centralization;
Projection and the threshold value of the binary digit of study predetermined number, the study of each binary digit comprises: bulk density and equalization of complementary information, nuclear matrix after the use centralization and described density and equalization of complementary information learning go out projection and threshold value, objective function is the number that minimizes data point in the zonule, Hash lineoid both sides, and guarantees that bucket is approximate equalization;
By the nuclear matrix after the centralization, and the projection of the binary digit of the predetermined number of learning out and threshold value, convert the data point of data centralization to binary string, the data point of identical binary string is placed in the corresponding bucket, set up Hash table;
Search comprises the steps: on the described line
Use nuclear matrix after the average of identical described anchor point and nuclear matrix obtains data query dot center for each data query point.
Nuclear matrix after the use data query dot center, and the projection of learning out and threshold value convert each data query point to binary string.
According to the binary string of data query point conversion, in the corresponding bucket of described Hash table, find out the data point of predetermined number, as the neighbour of data query point.
A kind of neighbor search system comprises under the line search unit on the unit and line,
Unit comprises under the described line:
Training points nuclear matrix determination module is used for evenly choosing at random the data point of predetermined number of data centralization as anchor point, obtains nuclear matrix by the distance between computational data point and the described anchor point, and this nuclear matrix of centralization;
Projection and threshold learning module, the projection and the threshold value that are used for the binary digit of study predetermined number, the study of each binary digit comprises: bulk density and equalization of complementary information, nuclear matrix after the use centralization and described density and equalization of complementary information learning go out projection and threshold value, objective function is the number that minimizes data point in the zonule, Hash lineoid both sides, and guarantees that bucket is approximate equalization;
Hash table is set up module, be used for by the nuclear matrix after the centralization, and the projection of the binary digit of the predetermined number of learning out and threshold value, convert the data point of data centralization to binary string, the data point of identical binary string is placed in the corresponding bucket, sets up Hash table;
Search unit comprises on the described line:
Query point nuclear matrix determination module is for the nuclear matrix of using for each data query point after the average of identical described anchor point and nuclear matrix obtains data query dot center.
The binary string modular converter, for the nuclear matrix after the use data query dot center, and the projection of learning out and threshold value, convert each data query point to binary string.
The Hash bucket is searched module, is used for the binary string according to the conversion of data query point, finds out the data point of predetermined number in the corresponding bucket of described Hash table, as the neighbour of data query point.
Neighbor search method and system of the present invention, by minimizing the learning function of data point number in the zonule, Hash lineoid both sides, make the Hash lineoid pass the sparse region of data, thereby guarantee the high-accuracy of neighbor search, by approximate equalization bucket condition, add approximate equalization bucket regular terms to learning function, make the Hash lineoid divide more balancedly to data point, thereby guarantee the high search speed of neighbor search.No matter be for a small amount of or mass data, this method and system can both carry out high-accuracy and high-speed neighbor search.
Description of drawings
Fig. 1 is the schematic flow sheet of neighbor search method of the present invention;
Fig. 2 is the schematic flow sheet of learning under the neighbor search method line of the present invention;
Fig. 3 is the schematic flow sheet of searching on the neighbor search method line of the present invention;
Fig. 4 is the structural representation of neighbor search of the present invention system;
Fig. 5 is the schematic flow sheet of neighbor search one of them embodiment of system of the present invention.
Embodiment
The present invention is a kind of neighbor search method and system based on hash algorithm, by data-switching being become the binary string of short figure place, then sets up the purpose that Hash table reaches effective search.Owing to no matter being the data of how many dimensions, all be converted into the binary string (for example: 0110 is one 4 binary string) of a short figure place at last, thus insensitive to dimension based on the neighbor search of hash algorithm, can the quick-searching high dimensional data.Explain in detail the present invention below in conjunction with accompanying drawing and embodiment.
Neighbor search method of the present invention as shown in Figure 1, comprises under the line on study and the line and searching for for two steps.
Learning process is as shown in Figure 1 under the line:
Step S101, the random concentrated some data points of training data of evenly choosing obtain nuclear matrix as anchor point by the distance between calculation training data point and these anchor points, and the centralization nuclear matrix.
Training dataset hereinafter to be referred as data set, is to be made of the data point that n dimension is d.For example, all pixels of the gray level image of a 32*32 can be linked up the vector that becomes one 1024 dimension, this vector is exactly the data point of one 1024 dimension; Perhaps also can extract to a pictures feature of a d dimension, this feature is exactly the data point of a d dimension.The random purpose of evenly choosing anchor point is to make selected anchor point can not be distributed in certain zone of concentrating in the data space, that is to say that anchor point is evenly distributed in the whole data space.
Tentation data collection X consists of X=[x by the data point of n d dimension 1..., x n] ∈ R D * n, by evenly selecting at random m data point Δ 1..., Δ mAs anchor point, we calculate nuclear matrix K: K = k ( x 1 , Δ 1 ) · · · k ( x 1 , Δ m ) · · · · · · · · · k ( x n , Δ 1 ) · · · k ( x n , Δ m ) , K () is kernel function, select here gaussian kernel function k (x, y)=-|| x-y|| 2/ 2 σ 2, m=300, σ choose the mean value that the point of 3000 points is adjusted the distance at random.Then centralization nuclear matrix obtains
Figure BDA00002728261800042
K ‾ = k ( x 1 , Δ 1 ) - μ 1 · · · k ( x 1 , Δ m ) - μ m · · · · · · · · · k ( x n , Δ 1 ) - μ 1 · · · k ( x n , Δ m ) - μ m = k ( x 1 ) ‾ T · · · k ( x n ) ‾ T , μ i = 1 n Σ j = 1 n k ( x j , Δ i ) .
Projection and the threshold value of step S102, the some binary digits of study, study for each binary digit comprises: at first bulk density and equalization of complementary information, then use nuclear matrix and these two kinds of complementary informations after the centralization to learn out projection and threshold value, its objective function is the number that minimizes data point in the zonule, Hash lineoid both sides, and guarantees that simultaneously bucket is approximate equalization.
Need to suppose c hash function of study to convert data point to c position binary string, so for k hash function sgn ( f k ( x ) ) = sgn ( p k T k ( x ) ‾ - b k ) Just need to learn out projection p kWith threshold value b kAllow objective function Σ i = 1 n u i k sgn ( ϵ - f k ( x i ) sgn ( f k ( x i ) ) ) + α | | V k - 1 T v k | | 2 Minimum, wherein
Figure BDA00002728261800047
Figure BDA00002728261800048
Be called the density complementary information, H (x) is unit-step function, v k = [ sgn ( f k ( x 1 ) ) , . . . , sgn ( f k ( x n ) ) ] T , V K-1=[1, v 1..., v K-1] being called equalization of complementary information, sgn (x) is-symbol function, α and ε are the parameters of algorithm input, select α=0.1 here, ε=0.01s(ε is that have a few is to a mean value of dividing equally the distance of lineoid).
Step S103, by the nuclear matrix after the centralization, and the projection of some binary digits of learning out and threshold value convert the training data point to binary string, and the data point of identical binary string is placed in the corresponding bucket, set up Hash table.
Use c hash function that the data point x of each d dimension is become a c position binary string, k hash function with the process that data point x becomes k position 0 or 1 is:
Figure BDA00002728261800051
All convert in this manner all data points to binary string, the data point of identical binary string is put into (index of bucket is exactly this binary string) in the bucket, thereby sets up Hash table.
So far, the built Hash table that stood of study under the line, the result that using finishes classes and leave school to reach the standard grade practises carries out search on the line can guarantee the high-accuracy of neighbor search and high-speed, but in order further to improve accuracy rate and speed, can also as shown in Figure 2, may further comprise the steps:
Described step S104, concentrate all points (n point) to calculate their distances between any two to data, the distance of each data point according to it and other data point sorted from small to large, just obtain a sequence of this other data point of data point correspondence.Each data point is got front k point in their corresponding sequences, so just obtain the approximate KNN of a n * k to the correspondence table of accurate arest neighbors, here k=50.
Described step S105, data set is carried out principal component analysis (PCA) (PCA), obtain PCA dimensionality reduction matrix (d * d ' matrix), data set be multiply by information after this matrix has obtained the data set dimensionality reduction (n * d ' matrix), here d ' value on the different pieces of information collection needs to adjust, being 40 at GIST-1M, is 32 at SIFT-1M.
Search procedure is as shown in Figure 1 on the line:
Step S201, use nuclear matrix after identical anchor point obtains centralization for each data query point.
Data query point x to the d dimension uses identical anchor point Δ 1..., Δ mWith identical nuclear matrix average μ 1..., μ m, obtain the nuclear matrix of x: k ( x ) ‾ T = ( k ( x , Δ 1 ) - μ 1 · · · k ( x , Δ m ) - μ m ) .
Nuclear matrix after step S202, the use centralization, and the projection of learning out and threshold value convert each data query point to binary string.
Use c hash function learning out under the line and the nuclear matrix of x
Figure BDA00002728261800053
X is become c position binary string, and k hash function with the process that data point x becomes k position 0 or 1 is:
Figure BDA00002728261800054
Step S203, multiply by the PCA dimensionality reduction matrix of learning out under the line for each data query point, obtain the information (vector of a corresponding d ' dimension of data query point) behind the data query point dimensionality reduction.
According to the binary string that step S202 obtains, finding the Hamming radius in Hash table is all buckets (the Hamming radius is that r represents that the index of bucket allows to have at most the r position different from the binary string that step S202 obtains, here r=2) of r, takes out the data point in these barrels.
The data point of taking out from bucket namely can be used as the neighbour of data query point, and the neighbour of this moment is ordering not, and for the neighbour being sorted and further improving accuracy rate and the speed of search, search can also as shown in Figure 3, may further comprise the steps on the line:
Step S204, multiply by the PCA dimensionality reduction matrix of learning out under the line for each data query point, obtain the information (vector of a corresponding d ' dimension of data query point) behind the data query point dimensionality reduction.
Information behind step S205, use query point and these data point dimensionality reductions is carried out distance calculating and by from small to large ordering of distance, is then got front m 1Individual data point is carried out the distance calculating of original dimension, again gets front m after the ordering from small to large by distance 2Individual point.To this m 2Individual some inquiry approximate KNN is to the correspondence table of accurate arest neighbors, and each point is got m 3Individual candidate point has obtained new candidate data point set after removing the point of repetition, preferably, and m 1=100, m 2=10, m 3=50.
Step S206, use information behind query point and the new candidate data point dimensionality reduction to carry out distance to calculate and by apart from sorting from small to large, then get front m 4Individual data point uses the distance of original dimension to calculate and ordering, finally obtains the neighbour of query point, preferably, and m 4=100.
Table 1 and table 2 be respectively this method use 32 Hash on GIST-1M data set and SIFT-1M data set with Flann kdtree(at present based on the method for tree construction in a kind of method of main flow, get parameter nChecks=256 here) search accuracy rate and the contrast of search time.GIST-1M is the data set of 384 dimensions, 1,000,000 data volumes, and SIFT-1M is the data set of 128 dimensions, 1,000,000 data volumes.Table 1 and table 2 have all been showed average search accuracy rate (the 1nn accuracy rate represents to find the accuracy rate of arest neighbors in the table, and the 50nn accuracy rate represents to find front 50 neighbours' accuracy rate) and total search time of 1000 query point.Table 1 shows this method with table 2, and all the accuracy rate than Flann kdtree is high on two kinds of different data sets, and search time is few simultaneously.
Table 1
Figure BDA00002728261800062
Table 2
Corresponding to above-mentioned neighbor search method, neighbor search of the present invention system as shown in Figure 4, comprises under the line and searches for two parts in the study and line.
Unit comprises under the described line:
Training points nuclear matrix determination module is used for evenly choosing at random the data point of predetermined number of data centralization as anchor point, obtains nuclear matrix by the distance between computational data point and the described anchor point, and this nuclear matrix of centralization;
Projection and threshold learning module, the projection and the threshold value that are used for the binary digit of study predetermined number, the study of each binary digit comprises: bulk density and equalization of complementary information, nuclear matrix after the use centralization and described density and equalization of complementary information learning go out projection and threshold value, objective function is the number that minimizes data point in the zonule, Hash lineoid both sides, and guarantees that bucket is approximate equalization;
Hash table is set up module, be used for by the nuclear matrix after the centralization, and the projection of the binary digit of the predetermined number of learning out and threshold value, convert the data point of data centralization to binary string, the data point of identical binary string is placed in the corresponding bucket, sets up Hash table;
Search unit comprises on the described line:
Query point nuclear matrix determination module is for the nuclear matrix of using for each data query point after the average of identical described anchor point and nuclear matrix obtains data query dot center.
The binary string modular converter, for the nuclear matrix after the use data query dot center, and the projection of learning out and threshold value, convert each data query point to binary string.
The Hash bucket is searched module, is used for the binary string according to the conversion of data query point, finds out the data point of predetermined number in the corresponding bucket of described Hash table, as the neighbour of data query point.
As a preferred embodiment, as shown in Figure 5, unit can also comprise under the described line:
Corresponding table is set up module, be used for concentrating all data points to carry out distance calculating and ordering to data, determine to be scheduled to an accurate arest neighbors before each data point, set up approximate KNN to the correspondence table of accurate arest neighbors, described distance is calculated and the process of ordering is: computational data point distance between any two, sort from small to large to the distance of each data point according to itself and other data point;
Dimensionality reduction matrix determination module is used for data set is carried out principal component analysis (PCA), obtains PCA dimensionality reduction matrix, and uses this logm to carry out dimensionality reduction according to collection, obtains the information behind the data set dimensionality reduction,
As a preferred embodiment, as shown in Figure 5, search unit can also comprise on the described line:
Query point dimensionality reduction module is used for the described PCA dimensionality reduction matrix of each data query point use is carried out dimensionality reduction, obtains the information behind the data query point dimensionality reduction;
Candidate data point determination module is used for that the information behind the data query point dimensionality reduction and the information behind the data point dimensionality reduction of the predetermined number that the corresponding bucket of described Hash table finds out are carried out described distance and calculates and sort, to front m 1Individual data point is carried out described distance calculating and ordering again according to original dimension, gets front m 2Individual data point is also inquired about approximate KNN to the described corresponding table of accurate arest neighbors, and each data point is got m 3Individual candidate point removes the data point of repetition, obtains the set of candidate data point;
Neighbour's determination module is used for that the information behind the candidate data point dimensionality reduction of data query point and the set of described candidate data point is carried out described distance and calculates and ordering, gets front m 4Individual data point is carried out described distance calculating and ordering again according to original dimension, obtains the final neighbour of data query point.
As a preferred embodiment, m 1Get 100, m 2Get 100, m 3Get 100, m 4Get 100.
As a preferred embodiment, the corresponding bucket of described Hash table is all buckets of 2 for Hamming radius in the described Hash table.
Beneficial effect of the present invention is summarized as follows:
1. the present invention has promoted the accuracy rate of neighbor search: compare other neighbor search method or systems, the difficult problem how the Hash model obtains high-accuracy has been described intuitively by the learning function that minimizes data point number in the zonule, Hash lineoid both sides by this method and system, and use approximate KNN to the correspondence table of accurate arest neighbors, search accuracy rate is improved greatly.
2. the present invention has promoted the speed of neighbor search: compare other neighbor search method or systems, this method and system make Hash lineoid number of partitions strong point more balanced by add approximate equalization bucket regular terms to learning function, and use based on the search of PCA dimensionality reduction and filter, thereby greatly reduced search time.
3. the present invention has promoted the data compression ability of hash algorithm: compare other hash algorithms, this method and system are by learning out complementary projection with density complementary information and equalization of complementary information, each that makes hash function all has very strong discriminating power, accuracy rate and speed that this has not only further improved neighbor search have also promoted the compressed capability of hash algorithm to data.
The above embodiment has only expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to claim of the present invention.Should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims (8)

1. a neighbor search method is characterized in that, comprise under the line on study and the line searching for,
Study comprises the steps: under the described line
The data point of the random predetermined number of evenly choosing data centralization obtains nuclear matrix as anchor point by the distance between computational data point and the described anchor point, and this nuclear matrix of centralization;
Projection and the threshold value of the binary digit of study predetermined number, the study of each binary digit comprises: bulk density and equalization of complementary information, nuclear matrix after the use centralization and described density and equalization of complementary information learning go out projection and threshold value, objective function is the number that minimizes data point in the zonule, Hash lineoid both sides, and guarantees that bucket is approximate equalization;
By the nuclear matrix after the centralization, and the projection of the binary digit of the predetermined number of learning out and threshold value, convert the data point of data centralization to binary string, the data point of identical binary string is placed in the corresponding bucket, set up Hash table;
Search comprises the steps: on the described line
Use nuclear matrix after the average of identical described anchor point and nuclear matrix obtains data query dot center for each data query point.
Nuclear matrix after the use data query dot center, and the projection of learning out and threshold value convert each data query point to binary string.
According to the binary string of data query point conversion, in the corresponding bucket of described Hash table, find out the data point of predetermined number, as the neighbour of data query point.
2. neighbor search method according to claim 1 is characterized in that, study is further comprising the steps of under the described line:
Concentrate all data points to carry out distance calculating and ordering to data, determine to be scheduled to an accurate arest neighbors before each data point, set up approximate KNN to the correspondence table of accurate arest neighbors, described distance is calculated and the process of ordering is: computational data point distance between any two, sort from small to large to the distance of each data point according to itself and other data point;
Data set is carried out principal component analysis (PCA), obtains PCA dimensionality reduction matrix, and use this logm to carry out dimensionality reduction according to collection, obtain the information behind the data set dimensionality reduction,
Search is further comprising the steps of on the described line:
Use described PCA dimensionality reduction matrix to carry out dimensionality reduction to each data query point, obtain the information behind the data query point dimensionality reduction;
Information behind the data point dimensionality reduction of the information behind the data query point dimensionality reduction and the predetermined number that finds out in the corresponding bucket of described Hash table is carried out described distance calculate and ordering, to front m 1Individual data point is carried out described distance calculating and ordering again according to original dimension, gets front m 2Individual data point is also inquired about approximate KNN to the described corresponding table of accurate arest neighbors, and each data point is got m 3Individual candidate point removes the data point of repetition, obtains the set of candidate data point;
Information behind the candidate data point dimensionality reduction in data query point and the set of described candidate data point is carried out described distance calculate and ordering, get front m 4Individual data point is carried out described distance calculating and ordering again according to original dimension, obtains the final neighbour of data query point.
3. neighbor search method according to claim 2 is characterized in that m 1Get 100, m 2Get 100, m 3Get 100, m 4Get 100.
4. according to claim 1 and 2 or 3 described neighbor search methods, it is characterized in that, the corresponding bucket of described Hash table is all buckets of 2 for Hamming radius in the described Hash table.
5. a neighbor search system is characterized in that, comprises under the line search unit on the unit and line,
Unit comprises under the described line:
Training points nuclear matrix determination module is used for evenly choosing at random the data point of predetermined number of data centralization as anchor point, obtains nuclear matrix by the distance between computational data point and the described anchor point, and this nuclear matrix of centralization;
Projection and threshold learning module, the projection and the threshold value that are used for the binary digit of study predetermined number, the study of each binary digit comprises: bulk density and equalization of complementary information, nuclear matrix after the use centralization and described density and equalization of complementary information learning go out projection and threshold value, objective function is the number that minimizes data point in the zonule, Hash lineoid both sides, and guarantees that bucket is approximate equalization;
Hash table is set up module, be used for by the nuclear matrix after the centralization, and the projection of the binary digit of the predetermined number of learning out and threshold value, convert the data point of data centralization to binary string, the data point of identical binary string is placed in the corresponding bucket, sets up Hash table;
Search unit comprises on the described line:
Query point nuclear matrix determination module is for the nuclear matrix of using for each data query point after the average of identical described anchor point and nuclear matrix obtains data query dot center.
The binary string modular converter, for the nuclear matrix after the use data query dot center, and the projection of learning out and threshold value, convert each data query point to binary string.
The Hash bucket is searched module, is used for the binary string according to the conversion of data query point, finds out the data point of predetermined number in the corresponding bucket of described Hash table, as the neighbour of data query point.
6. neighbor search according to claim 5 system is characterized in that, unit also comprises under the described line:
Corresponding table is set up module, be used for concentrating all data points to carry out distance calculating and ordering to data, determine to be scheduled to an accurate arest neighbors before each data point, set up approximate KNN to the correspondence table of accurate arest neighbors, described distance is calculated and the process of ordering is: computational data point distance between any two, sort from small to large to the distance of each data point according to itself and other data point;
Dimensionality reduction matrix determination module is used for data set is carried out principal component analysis (PCA), obtains PCA dimensionality reduction matrix, and uses this logm to carry out dimensionality reduction according to collection, obtains the information behind the data set dimensionality reduction,
Search unit also comprises on the described line:
Query point dimensionality reduction module is used for the described PCA dimensionality reduction matrix of each data query point use is carried out dimensionality reduction, obtains the information behind the data query point dimensionality reduction;
Candidate data point determination module is used for that the information behind the data query point dimensionality reduction and the information behind the data point dimensionality reduction of the predetermined number that the corresponding bucket of described Hash table finds out are carried out described distance and calculates and sort, to front m 1Individual data point is carried out described distance calculating and ordering again according to original dimension, gets front m 2Individual data point is also inquired about approximate KNN to the described corresponding table of accurate arest neighbors, and each data point is got m 3Individual candidate point removes the data point of repetition, obtains the set of candidate data point;
Neighbour's determination module is used for that the information behind the candidate data point dimensionality reduction of data query point and the set of described candidate data point is carried out described distance and calculates and ordering, gets front m 4Individual data point is carried out described distance calculating and ordering again according to original dimension, obtains the final neighbour of data query point.
7. neighbor search according to claim 6 system is characterized in that m 1Get 100, m 2Get 100, m 3Get 100, m 4Get 100.
8. according to claim 5 or 6 or 7 described neighbor search systems, it is characterized in that, the corresponding bucket of described Hash table is all buckets of 2 for Hamming radius in the described Hash table.
CN201310011407.XA 2013-01-11 2013-01-11 Neighbor search method and system Expired - Fee Related CN103020321B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310011407.XA CN103020321B (en) 2013-01-11 2013-01-11 Neighbor search method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310011407.XA CN103020321B (en) 2013-01-11 2013-01-11 Neighbor search method and system

Publications (2)

Publication Number Publication Date
CN103020321A true CN103020321A (en) 2013-04-03
CN103020321B CN103020321B (en) 2015-08-19

Family

ID=47968924

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310011407.XA Expired - Fee Related CN103020321B (en) 2013-01-11 2013-01-11 Neighbor search method and system

Country Status (1)

Country Link
CN (1) CN103020321B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345491A (en) * 2013-06-26 2013-10-09 浙江大学 Method for quickly obtaining neighborhood by the utilization of Hash dividing barrels
WO2015165037A1 (en) * 2014-04-29 2015-11-05 中国科学院自动化研究所 Cascaded binary coding based image matching method
CN106777038A (en) * 2016-12-09 2017-05-31 厦门大学 A kind of ultralow complexity image search method for retaining Hash based on sequence
CN107729290A (en) * 2017-09-21 2018-02-23 北京大学深圳研究生院 A kind of expression learning method of ultra-large figure using the optimization of local sensitivity Hash
CN108171777A (en) * 2017-12-26 2018-06-15 广州泼墨神网络科技有限公司 The method of searching flanking sequence frame anchor point based on genetic algorithm
CN109299097A (en) * 2018-09-27 2019-02-01 宁波大学 A kind of online high dimensional data K-NN search method based on Hash study
CN113377294A (en) * 2021-08-11 2021-09-10 武汉泰乐奇信息科技有限公司 Big data storage method and device based on binary data conversion

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2310321C (en) * 1997-11-17 2004-11-16 Telcordia Technologies, Inc. Method and system for determining approximate hamming distance and approximate nearest neighbors in an electronic storage device
CN102422319A (en) * 2009-03-04 2012-04-18 公立大学法人大阪府立大学 Image retrieval method, image retrieval program, and image registration method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2310321C (en) * 1997-11-17 2004-11-16 Telcordia Technologies, Inc. Method and system for determining approximate hamming distance and approximate nearest neighbors in an electronic storage device
CN102422319A (en) * 2009-03-04 2012-04-18 公立大学法人大阪府立大学 Image retrieval method, image retrieval program, and image registration method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MOUNIA LALMAS: "《advances in information retrieval》", 30 April 2006, article "《advances in information retrieval》" *
ZHANG D: "《self-taught hashing for fast similarity search》", 《IN PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIVAL》, 31 December 2010 (2010-12-31) *
凌康: "《基于位置敏感哈希的相似性搜索技术研究》", 《万方学位论文》, 31 December 2012 (2012-12-31) *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103345491A (en) * 2013-06-26 2013-10-09 浙江大学 Method for quickly obtaining neighborhood by the utilization of Hash dividing barrels
CN103345491B (en) * 2013-06-26 2016-11-23 浙江大学 A kind of method applying Hash Hash division bucket quickly to obtain neighborhood
WO2015165037A1 (en) * 2014-04-29 2015-11-05 中国科学院自动化研究所 Cascaded binary coding based image matching method
CN106777038A (en) * 2016-12-09 2017-05-31 厦门大学 A kind of ultralow complexity image search method for retaining Hash based on sequence
CN106777038B (en) * 2016-12-09 2019-06-14 厦门大学 A kind of ultralow complexity image search method retaining Hash based on sequence
CN107729290A (en) * 2017-09-21 2018-02-23 北京大学深圳研究生院 A kind of expression learning method of ultra-large figure using the optimization of local sensitivity Hash
CN108171777A (en) * 2017-12-26 2018-06-15 广州泼墨神网络科技有限公司 The method of searching flanking sequence frame anchor point based on genetic algorithm
CN108171777B (en) * 2017-12-26 2021-08-10 广州泼墨神网络科技有限公司 Method for searching anchor points of adjacent sequence frames based on genetic algorithm
CN109299097A (en) * 2018-09-27 2019-02-01 宁波大学 A kind of online high dimensional data K-NN search method based on Hash study
CN113377294A (en) * 2021-08-11 2021-09-10 武汉泰乐奇信息科技有限公司 Big data storage method and device based on binary data conversion
CN113377294B (en) * 2021-08-11 2021-10-22 武汉泰乐奇信息科技有限公司 Big data storage method and device based on binary data conversion

Also Published As

Publication number Publication date
CN103020321B (en) 2015-08-19

Similar Documents

Publication Publication Date Title
CN103020321B (en) Neighbor search method and system
CN107577990B (en) Large-scale face recognition method based on GPU (graphics processing Unit) accelerated retrieval
CN106682233B (en) Hash image retrieval method based on deep learning and local feature fusion
Popat et al. Hierarchical document clustering based on cosine similarity measure
EP3867819A1 (en) Semi-supervised person re-identification using multi-view clustering
CN108280187B (en) Hierarchical image retrieval method based on depth features of convolutional neural network
CN105808709B (en) Recognition of face method for quickly retrieving and device
CN102236675B (en) Method for processing matched pairs of characteristic points of images, image retrieval method and image retrieval equipment
CN106407311A (en) Method and device for obtaining search result
CN104765768A (en) Mass face database rapid and accurate retrieval method
CN105589938A (en) Image retrieval system and retrieval method based on FPGA
CN109919084B (en) Pedestrian re-identification method based on depth multi-index hash
CN107291895B (en) Quick hierarchical document query method
CN103617217A (en) Hierarchical index based image retrieval method and system
CN104573130A (en) Entity resolution method based on group calculation and entity resolution device based on group calculation
US20220414144A1 (en) Multi-task deep hash learning-based retrieval method for massive logistics product images
Mohan et al. Environment selection and hierarchical place recognition
CN103258210A (en) High-definition image classification method based on dictionary learning
CN110751027B (en) Pedestrian re-identification method based on deep multi-instance learning
CN105320764A (en) 3D model retrieval method and 3D model retrieval apparatus based on slow increment features
CN104731882A (en) Self-adaptive query method based on Hash code weighting ranking
CN102163285A (en) Cross-domain video semantic concept detection method based on active learning
CN109871379A (en) A kind of online Hash K-NN search method based on data block study
CN104731884A (en) Query method based on multi-feature fusion type multiple Hashtables
CN112818859A (en) Deep hash-based multi-level retrieval pedestrian re-identification method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20150819

Termination date: 20170111

CF01 Termination of patent right due to non-payment of annual fee