CN103020321A

CN103020321A - Neighbor searching method and neighbor searching system

Info

Publication number: CN103020321A
Application number: CN201310011407XA
Authority: CN
Inventors: 钟海兰
Original assignee: GUANGDONG TUTUSOU NETWORK TECHNOLOGY Co Ltd
Current assignee: GUANGDONG TUTUSOU NETWORK TECHNOLOGY Co Ltd
Priority date: 2013-01-11
Filing date: 2013-01-11
Publication date: 2013-04-03
Anticipated expiration: 2033-01-11
Also published as: CN103020321B

Abstract

The invention discloses a neighbor searching method and a neighbor searching system. The neighbor searching method includes offline learning and online searching. A learning function of data point number in small regions on two sides of a Hash hyperplane is minimized to enable the Hash hyperplane to penetrate through a data sparse region, and accordingly high accuracy in neighbor searching can be guaranteed. An approximate balance barrel regular term is added to a learning function through approximate balance barrel conditions to enable more balanced data point partition by the Hash hyperplane, and accordingly high searching speed in neighbor searching is guaranteed. Regardless of a small quantity of data or massive data, high-accuracy and high-speed neighbor searching can be realized by the neighbor searching method and the neighbor searching system.

Description

The neighbor search method and system

Technical field

The present invention relates to technical field of information retrieval, particularly relate to a kind of neighbor search method and system.

Background technology

Along with developing rapidly of infotech, the raising of data acquisition ability causes each field data volume and dimension all to be exponential growth.Yet increasing rapidly of data volume and data dimension allows search become unusually difficult.For example, for the picture of an input, when we need to search with the same or analogous picture of this pictures in the mass picture storehouse, we not only will look for accurately, also will look for soon.In this example, we can become data point with image abstraction, and the similarity degree between the data point can be weighed with Euclidean distance usually, and neighbor search just refers to find and the similarity number strong point of data query point under this similarity measurement.

If with in data query point and the database have a few and compare one by one, although can guarantee retrieval precision, will become very slow for the magnanimity high dimensional data.Traditional neighbor search technology based on tree construction is if obtain higher accuracy rate, its can along with data dimension increase, retrieval rate descends rapidly.

Summary of the invention

Based on above-mentioned situation, the present invention proposes a kind of neighbor search method and system, to improve the speed of neighbor search, guarantee simultaneously accuracy rate.

A kind of neighbor search method comprises under the line on study and the line and searching for,

Study comprises the steps: under the described line

The data point of the random predetermined number of evenly choosing data centralization obtains nuclear matrix as anchor point by the distance between computational data point and the described anchor point, and this nuclear matrix of centralization;

Projection and the threshold value of the binary digit of study predetermined number, the study of each binary digit comprises: bulk density and equalization of complementary information, nuclear matrix after the use centralization and described density and equalization of complementary information learning go out projection and threshold value, objective function is the number that minimizes data point in the zonule, Hash lineoid both sides, and guarantees that bucket is approximate equalization;

By the nuclear matrix after the centralization, and the projection of the binary digit of the predetermined number of learning out and threshold value, convert the data point of data centralization to binary string, the data point of identical binary string is placed in the corresponding bucket, set up Hash table;

Search comprises the steps: on the described line

Use nuclear matrix after the average of identical described anchor point and nuclear matrix obtains data query dot center for each data query point.

Nuclear matrix after the use data query dot center, and the projection of learning out and threshold value convert each data query point to binary string.

According to the binary string of data query point conversion, in the corresponding bucket of described Hash table, find out the data point of predetermined number, as the neighbour of data query point.

A kind of neighbor search system comprises under the line search unit on the unit and line,

Unit comprises under the described line:

Training points nuclear matrix determination module is used for evenly choosing at random the data point of predetermined number of data centralization as anchor point, obtains nuclear matrix by the distance between computational data point and the described anchor point, and this nuclear matrix of centralization;

Projection and threshold learning module, the projection and the threshold value that are used for the binary digit of study predetermined number, the study of each binary digit comprises: bulk density and equalization of complementary information, nuclear matrix after the use centralization and described density and equalization of complementary information learning go out projection and threshold value, objective function is the number that minimizes data point in the zonule, Hash lineoid both sides, and guarantees that bucket is approximate equalization;

Hash table is set up module, be used for by the nuclear matrix after the centralization, and the projection of the binary digit of the predetermined number of learning out and threshold value, convert the data point of data centralization to binary string, the data point of identical binary string is placed in the corresponding bucket, sets up Hash table;

Search unit comprises on the described line:

Query point nuclear matrix determination module is for the nuclear matrix of using for each data query point after the average of identical described anchor point and nuclear matrix obtains data query dot center.

The binary string modular converter, for the nuclear matrix after the use data query dot center, and the projection of learning out and threshold value, convert each data query point to binary string.

The Hash bucket is searched module, is used for the binary string according to the conversion of data query point, finds out the data point of predetermined number in the corresponding bucket of described Hash table, as the neighbour of data query point.

Neighbor search method and system of the present invention, by minimizing the learning function of data point number in the zonule, Hash lineoid both sides, make the Hash lineoid pass the sparse region of data, thereby guarantee the high-accuracy of neighbor search, by approximate equalization bucket condition, add approximate equalization bucket regular terms to learning function, make the Hash lineoid divide more balancedly to data point, thereby guarantee the high search speed of neighbor search.No matter be for a small amount of or mass data, this method and system can both carry out high-accuracy and high-speed neighbor search.

Description of drawings

Fig. 1 is the schematic flow sheet of neighbor search method of the present invention;

Fig. 2 is the schematic flow sheet of learning under the neighbor search method line of the present invention;

Fig. 3 is the schematic flow sheet of searching on the neighbor search method line of the present invention;

Fig. 4 is the structural representation of neighbor search of the present invention system;

Fig. 5 is the schematic flow sheet of neighbor search one of them embodiment of system of the present invention.

Embodiment

The present invention is a kind of neighbor search method and system based on hash algorithm, by data-switching being become the binary string of short figure place, then sets up the purpose that Hash table reaches effective search.Owing to no matter being the data of how many dimensions, all be converted into the binary string (for example: 0110 is one 4 binary string) of a short figure place at last, thus insensitive to dimension based on the neighbor search of hash algorithm, can the quick-searching high dimensional data.Explain in detail the present invention below in conjunction with accompanying drawing and embodiment.

Neighbor search method of the present invention as shown in Figure 1, comprises under the line on study and the line and searching for for two steps.

Learning process is as shown in Figure 1 under the line:

Step S101, the random concentrated some data points of training data of evenly choosing obtain nuclear matrix as anchor point by the distance between calculation training data point and these anchor points, and the centralization nuclear matrix.

Training dataset hereinafter to be referred as data set, is to be made of the data point that n dimension is d.For example, all pixels of the gray level image of a 32*32 can be linked up the vector that becomes one 1024 dimension, this vector is exactly the data point of one 1024 dimension; Perhaps also can extract to a pictures feature of a d dimension, this feature is exactly the data point of a d dimension.The random purpose of evenly choosing anchor point is to make selected anchor point can not be distributed in certain zone of concentrating in the data space, that is to say that anchor point is evenly distributed in the whole data space.

Tentation data collection X consists of X=[x by the data point of n d dimension ₁..., x _n] ∈ R ^{D * n}, by evenly selecting at random m data point Δ ₁..., Δ _mAs anchor point, we calculate nuclear matrix K:

K = (\begin{matrix} k (x_{1} {, Δ}_{1}) & \cdot \cdot \cdot & k (x_{1}, Δ_{m}) \\ \cdot & \cdot \\ \cdot & \cdot \cdot \cdot & \cdot \\ \cdot & \cdot \\ k (x_{n}, Δ_{1}) & \cdot \cdot \cdot & k (x_{n}, Δ_{m}) \end{matrix}),

K () is kernel function, select here gaussian kernel function k (x, y)=-|| x-y|| ²/ 2 σ ², m=300, σ choose the mean value that the point of 3000 points is adjusted the distance at random.Then centralization nuclear matrix obtains

\overset{&OverBar;}{K} = (\begin{matrix} k (x_{1}, Δ_{1}) - μ_{1} & \cdot \cdot \cdot & k (x_{1}, Δ_{m}) - μ_{m} \\ \cdot & \cdot \\ \cdot & \cdot \cdot \cdot & \cdot \\ \cdot & \cdot \\ k (x_{n}, Δ_{1}) - μ_{1} & \cdot \cdot \cdot & k (x_{n}, Δ_{m}) - μ_{m} \end{matrix}) = (\begin{matrix} {\overset{&OverBar;}{k (x_{1})}}^{T} \\ \cdot \\ \cdot \\ \cdot \\ {\overset{&OverBar;}{k (x_{n})}}^{T} \end{matrix}),

μ_{i} = \frac{1}{n} Σ_{j = 1}^{n} k (x_{j}, Δ_{i}) .

Projection and the threshold value of step S102, the some binary digits of study, study for each binary digit comprises: at first bulk density and equalization of complementary information, then use nuclear matrix and these two kinds of complementary informations after the centralization to learn out projection and threshold value, its objective function is the number that minimizes data point in the zonule, Hash lineoid both sides, and guarantees that simultaneously bucket is approximate equalization.

Need to suppose c hash function of study to convert data point to c position binary string, so for k hash function

sgn (f_{k} (x)) = sgn (p_{k}^{T} \overset{&OverBar;}{k (x)} - b_{k})

Just need to learn out projection p _kWith threshold value b _kAllow objective function

Σ_{i = 1}^{n} u_{i}^{k} sgn (ϵ - f_{k} (x_{i}) sgn (f_{k} (x_{i}))) + α {| | V_{k - 1}^{T} v_{k} | |}^{2}

Minimum, wherein

Be called the density complementary information, H (x) is unit-step function,

v_{k} {= [sgn (f_{k} (x_{1})), . . ., sgn (f_{k} (x_{n}))]}^{T},

V _K-1=[1, v ₁..., v _K-1] being called equalization of complementary information, sgn (x) is-symbol function, α and ε are the parameters of algorithm input, select α=0.1 here, ε=0.01s(ε is that have a few is to a mean value of dividing equally the distance of lineoid).

Step S103, by the nuclear matrix after the centralization, and the projection of some binary digits of learning out and threshold value convert the training data point to binary string, and the data point of identical binary string is placed in the corresponding bucket, set up Hash table.

Use c hash function that the data point x of each d dimension is become a c position binary string, k hash function with the process that data point x becomes k position 0 or 1 is:

All convert in this manner all data points to binary string, the data point of identical binary string is put into (index of bucket is exactly this binary string) in the bucket, thereby sets up Hash table.

So far, the built Hash table that stood of study under the line, the result that using finishes classes and leave school to reach the standard grade practises carries out search on the line can guarantee the high-accuracy of neighbor search and high-speed, but in order further to improve accuracy rate and speed, can also as shown in Figure 2, may further comprise the steps:

Described step S104, concentrate all points (n point) to calculate their distances between any two to data, the distance of each data point according to it and other data point sorted from small to large, just obtain a sequence of this other data point of data point correspondence.Each data point is got front k point in their corresponding sequences, so just obtain the approximate KNN of a n * k to the correspondence table of accurate arest neighbors, here k=50.

Described step S105, data set is carried out principal component analysis (PCA) (PCA), obtain PCA dimensionality reduction matrix (d * d ' matrix), data set be multiply by information after this matrix has obtained the data set dimensionality reduction (n * d ' matrix), here d ' value on the different pieces of information collection needs to adjust, being 40 at GIST-1M, is 32 at SIFT-1M.

Search procedure is as shown in Figure 1 on the line:

Step S201, use nuclear matrix after identical anchor point obtains centralization for each data query point.

Data query point x to the d dimension uses identical anchor point Δ ₁..., Δ _mWith identical nuclear matrix average μ ₁..., μ _m, obtain the nuclear matrix of x:

{\overset{&OverBar;}{k (x)}}^{T} = (k (x, Δ_{1}) - μ_{1} \cdot \cdot \cdot k (x, Δ_{m}) - μ_{m}) .

Nuclear matrix after step S202, the use centralization, and the projection of learning out and threshold value convert each data query point to binary string.

Use c hash function learning out under the line and the nuclear matrix of x

X is become c position binary string, and k hash function with the process that data point x becomes k position 0 or 1 is:

Step S203, multiply by the PCA dimensionality reduction matrix of learning out under the line for each data query point, obtain the information (vector of a corresponding d ' dimension of data query point) behind the data query point dimensionality reduction.

According to the binary string that step S202 obtains, finding the Hamming radius in Hash table is all buckets (the Hamming radius is that r represents that the index of bucket allows to have at most the r position different from the binary string that step S202 obtains, here r=2) of r, takes out the data point in these barrels.

The data point of taking out from bucket namely can be used as the neighbour of data query point, and the neighbour of this moment is ordering not, and for the neighbour being sorted and further improving accuracy rate and the speed of search, search can also as shown in Figure 3, may further comprise the steps on the line:

Step S204, multiply by the PCA dimensionality reduction matrix of learning out under the line for each data query point, obtain the information (vector of a corresponding d ' dimension of data query point) behind the data query point dimensionality reduction.

Information behind step S205, use query point and these data point dimensionality reductions is carried out distance calculating and by from small to large ordering of distance, is then got front m ₁Individual data point is carried out the distance calculating of original dimension, again gets front m after the ordering from small to large by distance ₂Individual point.To this m ₂Individual some inquiry approximate KNN is to the correspondence table of accurate arest neighbors, and each point is got m ₃Individual candidate point has obtained new candidate data point set after removing the point of repetition, preferably, and m ₁=100, m ₂=10, m ₃=50.

Step S206, use information behind query point and the new candidate data point dimensionality reduction to carry out distance to calculate and by apart from sorting from small to large, then get front m ₄Individual data point uses the distance of original dimension to calculate and ordering, finally obtains the neighbour of query point, preferably, and m ₄=100.

Table 1 and table 2 be respectively this method use 32 Hash on GIST-1M data set and SIFT-1M data set with Flann kdtree(at present based on the method for tree construction in a kind of method of main flow, get parameter nChecks=256 here) search accuracy rate and the contrast of search time.GIST-1M is the data set of 384 dimensions, 1,000,000 data volumes, and SIFT-1M is the data set of 128 dimensions, 1,000,000 data volumes.Table 1 and table 2 have all been showed average search accuracy rate (the 1nn accuracy rate represents to find the accuracy rate of arest neighbors in the table, and the 50nn accuracy rate represents to find front 50 neighbours' accuracy rate) and total search time of 1000 query point.Table 1 shows this method with table 2, and all the accuracy rate than Flann kdtree is high on two kinds of different data sets, and search time is few simultaneously.

Table 1

Table 2

Corresponding to above-mentioned neighbor search method, neighbor search of the present invention system as shown in Figure 4, comprises under the line and searches for two parts in the study and line.

Unit comprises under the described line:

Search unit comprises on the described line:

As a preferred embodiment, as shown in Figure 5, unit can also comprise under the described line:

Corresponding table is set up module, be used for concentrating all data points to carry out distance calculating and ordering to data, determine to be scheduled to an accurate arest neighbors before each data point, set up approximate KNN to the correspondence table of accurate arest neighbors, described distance is calculated and the process of ordering is: computational data point distance between any two, sort from small to large to the distance of each data point according to itself and other data point;

Dimensionality reduction matrix determination module is used for data set is carried out principal component analysis (PCA), obtains PCA dimensionality reduction matrix, and uses this logm to carry out dimensionality reduction according to collection, obtains the information behind the data set dimensionality reduction,

As a preferred embodiment, as shown in Figure 5, search unit can also comprise on the described line:

Query point dimensionality reduction module is used for the described PCA dimensionality reduction matrix of each data query point use is carried out dimensionality reduction, obtains the information behind the data query point dimensionality reduction;

Candidate data point determination module is used for that the information behind the data query point dimensionality reduction and the information behind the data point dimensionality reduction of the predetermined number that the corresponding bucket of described Hash table finds out are carried out described distance and calculates and sort, to front m ₁Individual data point is carried out described distance calculating and ordering again according to original dimension, gets front m ₂Individual data point is also inquired about approximate KNN to the described corresponding table of accurate arest neighbors, and each data point is got m ₃Individual candidate point removes the data point of repetition, obtains the set of candidate data point;

Neighbour's determination module is used for that the information behind the candidate data point dimensionality reduction of data query point and the set of described candidate data point is carried out described distance and calculates and ordering, gets front m ₄Individual data point is carried out described distance calculating and ordering again according to original dimension, obtains the final neighbour of data query point.

As a preferred embodiment, m ₁Get 100, m ₂Get 100, m ₃Get 100, m ₄Get 100.

As a preferred embodiment, the corresponding bucket of described Hash table is all buckets of 2 for Hamming radius in the described Hash table.

Beneficial effect of the present invention is summarized as follows:

1. the present invention has promoted the accuracy rate of neighbor search: compare other neighbor search method or systems, the difficult problem how the Hash model obtains high-accuracy has been described intuitively by the learning function that minimizes data point number in the zonule, Hash lineoid both sides by this method and system, and use approximate KNN to the correspondence table of accurate arest neighbors, search accuracy rate is improved greatly.

2. the present invention has promoted the speed of neighbor search: compare other neighbor search method or systems, this method and system make Hash lineoid number of partitions strong point more balanced by add approximate equalization bucket regular terms to learning function, and use based on the search of PCA dimensionality reduction and filter, thereby greatly reduced search time.

3. the present invention has promoted the data compression ability of hash algorithm: compare other hash algorithms, this method and system are by learning out complementary projection with density complementary information and equalization of complementary information, each that makes hash function all has very strong discriminating power, accuracy rate and speed that this has not only further improved neighbor search have also promoted the compressed capability of hash algorithm to data.

The above embodiment has only expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to claim of the present invention.Should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims

1. a neighbor search method is characterized in that, comprise under the line on study and the line searching for,

Study comprises the steps: under the described line

Search comprises the steps: on the described line

2. neighbor search method according to claim 1 is characterized in that, study is further comprising the steps of under the described line:

Concentrate all data points to carry out distance calculating and ordering to data, determine to be scheduled to an accurate arest neighbors before each data point, set up approximate KNN to the correspondence table of accurate arest neighbors, described distance is calculated and the process of ordering is: computational data point distance between any two, sort from small to large to the distance of each data point according to itself and other data point;

Data set is carried out principal component analysis (PCA), obtains PCA dimensionality reduction matrix, and use this logm to carry out dimensionality reduction according to collection, obtain the information behind the data set dimensionality reduction,

Search is further comprising the steps of on the described line:

Use described PCA dimensionality reduction matrix to carry out dimensionality reduction to each data query point, obtain the information behind the data query point dimensionality reduction;

Information behind the data point dimensionality reduction of the information behind the data query point dimensionality reduction and the predetermined number that finds out in the corresponding bucket of described Hash table is carried out described distance calculate and ordering, to front m ₁Individual data point is carried out described distance calculating and ordering again according to original dimension, gets front m ₂Individual data point is also inquired about approximate KNN to the described corresponding table of accurate arest neighbors, and each data point is got m ₃Individual candidate point removes the data point of repetition, obtains the set of candidate data point;

Information behind the candidate data point dimensionality reduction in data query point and the set of described candidate data point is carried out described distance calculate and ordering, get front m ₄Individual data point is carried out described distance calculating and ordering again according to original dimension, obtains the final neighbour of data query point.

3. neighbor search method according to claim 2 is characterized in that m ₁Get 100, m ₂Get 100, m ₃Get 100, m ₄Get 100.

4. according to claim 1 and 2 or 3 described neighbor search methods, it is characterized in that, the corresponding bucket of described Hash table is all buckets of 2 for Hamming radius in the described Hash table.

5. a neighbor search system is characterized in that, comprises under the line search unit on the unit and line,

Unit comprises under the described line:

Search unit comprises on the described line:

6. neighbor search according to claim 5 system is characterized in that, unit also comprises under the described line:

Search unit also comprises on the described line:

7. neighbor search according to claim 6 system is characterized in that m ₁Get 100, m ₂Get 100, m ₃Get 100, m ₄Get 100.

8. according to claim 5 or 6 or 7 described neighbor search systems, it is characterized in that, the corresponding bucket of described Hash table is all buckets of 2 for Hamming radius in the described Hash table.