CN105574214B

CN105574214B - A kind of similarity retrieval method of the fine granularity position code filtering based on IDistance

Info

Publication number: CN105574214B
Application number: CN201610124087.2A
Authority: CN
Inventors: 袁鑫攀; 汪灿飞; 何岸; 向平; 向一平; 朱艳辉; 满君丰; 李长云
Original assignee: Hunan University of Technology
Current assignee: HUNAN YUN ZHI IOT NETWORKTECHNOLOGY Co.,Ltd.
Priority date: 2016-03-04
Filing date: 2016-03-04
Publication date: 2019-04-09
Anticipated expiration: 2036-03-04
Also published as: CN105574214A

Abstract

The present invention proposes a kind of fine granularity position code (fine grained bit code based on IDistance, abbreviation FGBC) filtering similarity retrieval method: this method establish index when, more fine-grained region is divided, the corresponding FGBC code in each region, realizes the Candidate Set that ring body is searched for using FGBC code and more accurately filters.The filtering of BC code is compared, FGBC-IDistance's at most can be reduced apart from calculation times to 1/2^2d, comparing in calculation times are as follows: FGBC-IDistance≤BC-IDistance≤IDistance.

Description

A kind of similarity retrieval method of the fine granularity position code filtering based on IDistance

Technical field

The present invention relates to data directory fields, more particularly, to a kind of fine granularity position code based on IDistance The similarity retrieval method of (fine grained bit code, abbreviation FGBC) filtering.

Background technique

IDistance is a kind of high dimension vector indexing means based on metric space.The basic thought that its index is established It is: chooses several anchor points in entire data space, each anchor point corresponds to a cluster subset.Each of data space to Amount is all divided into the cluster subset of the anchor point nearest from the vector.Then high dimension vector by being converted at a distance from anchor point One one-dimensional key value iDist that can be measured, utilizes B⁺- tree organizes the key value iDist of all high dimension vectors, The calculation formula of key value iDist are as follows:

IDist (x)=dist (P_i, x) and+i*c

Wherein x is any vector, P_iFor anchor point, dist () is Euclidean distance function, and iDist () is one-dimensional key value letter Number.

As shown in Figure 1, P₀、P₁、P₂For anchor point；C_iFor anchor point P_iVector subset in from anchor point P_iFarthest vector away from From i.e. anchor point P_iVector subset radius；C is a constant, greater than all C_i。

If complete or collected works are D, a similarity dimensions inquiry (query) is given, i.e. retrieval is less than radius r with query point q distance Vector set: Range (q, r)={ x ∈ D, dist (q, x) < r }, wherein function dist (q, x) indicate query point q take office The distance for vector x of anticipating.

The retrieving of IDistance are as follows:

(1) pass through and each anchor point P_iDistance calculate: the search circle of query point q whether with anchor point P_iTo quantum Collection intersection.

The judgment formula of intersection are as follows: dist (q, P_i)<C_i+r

Disjoint judgment formula are as follows: dist (q, P_i)>C_i+r

Without searched targets point in the vector subset of the anchor point if non-intersecting；

If intersection, it is determined that anchor point P_iDistance (dist) ring body range of search:

{x∈P_i,max(dist(P_i, q) and-r, 0) < dist (P_i, x) and < min (dist (P_i,q)+r,C_i)}

So that it is determined that the search range of iDist:

{x∈P_i,i*c+max(dist(P_i, q) and-r, 0) < iDist (P_i, x) and < i*c+min (dist (P_i,q)+r,C_i)} The vector set retrieved is then Candidate Set.

(2) each vector in Candidate Set is carried out with q apart from calculating respectively, if distance is less than r, enters final inspection Rope result set.

The index problem of high dimension vector is cleverly reduced on one-dimensional by IDistance by way of choosing anchor point, will One-dimensional index passes through B⁺Tree carries out tissue, has the characteristics that search is fast, has saved a large amount of distance and has calculated.

BC (bit Code)-IDistance increases the orientation code in reference axis, the code on the basis of IDistance It is made of binary digit (bit Code), abbreviation position code.By in one-dimensional key value structure increase BC code, to Candidate Set When filtering, has the characteristics that quickly to filter, can be calculated to avoid more distance.As shown in Fig. 2, the BC code difference of each region Are as follows: 00,01,10,11, i.e. BC₀=00, BC₁=01, BC₂=10, BC₃=11, the vector positioned at BC code region has corresponding BC code.

BC-IDistance increases the step of filtering (2.1) in the searching step 2 of IDistance:

(2.1) filtering of BC code is carried out to each vector in Candidate Set.

Judging whether the principle of filtering is: the search circle and certain anchor point P of q_iBC code region whether intersect, if intersection It does not filter then, is filtered if non-intersecting.

The definition of BC code distance lower bound:

Assuming that q (q₁,q₂,…q_d) it is query point, P_i(P_i1,P_i2,…P_id) it is anchor point, define q to certain anchor point P_iThe region k Apart from lower bound be minBC (P_i,k,q)。

Wherein, δ_jBe query point q in j dimension at a distance from Pi；q_jIt is coordinate of the query point q in j dimension；Pij is anchor point Pi Coordinate in j dimension；BC_jIt is value of the BC code of region k in j dimension；BC_qjIt is the BC code of the region query point q in j dimension Value.

By shown in the example in Fig. 2, it is assumed that P₀Coordinate be (1,1), the coordinate of q is (5,0), the position code 10 of q.

minBC(P₀, 0, q)=5；

minBC(P₀, 1, q)=(1²+4²)^1/2=17^1/2；

minBC(P₀, 2, q)=0；

minBC(P₀, 3, q)=1；

Judge the search circle and certain anchor point P of q_iThe region k intersection formula are as follows:

minBC(P_i,k,q)<r

(2.2) each vector in filtered Candidate Set is carried out with q apart from calculating, if distance is less than r, is entered most Whole retrieval set.

BC-IDistance can be filtered out in Candidate Set by BC code record position relationship by the rapid comparison of BC code 's.But the granularity of BC-IDistance codes is bigger, for it is every it is one-dimensional preferably also can only just filter a half data, usually can be by In the slightly bigger point of radius, to intersect with the axis of certain dimension of anchor point, so that this dimension loses filter effect.

Summary of the invention

The present invention is the shortcomings that overcoming above-mentioned prior art BC-IDistance, to propose a kind of based on IDistance's The similarity retrieval method of fine granularity position code filtering.

In order to solve the above technical problems, technical scheme is as follows:

A kind of similarity retrieval method of the fine granularity position code filtering based on IDistance, comprising the following steps:

S1, the index structure for establishing FGBC-IDistance；

S11, in anchor point P_i(P_i1,P_i2,…,P_ij,…,P_id) find 2 anchor points again per one-dimensional both sides as time anchors Point, secondary anchor point ((L₁,R₁),(L₂,R₂),…,(L_j,R_j)…,(L_d,R_d)) indicate, R_j>L_j, 1≤j≤d, P_ijIndicate anchor point P_i Value in jth dimension, R_jAnd L_jIndicate anchor point P_iJth dimension on two time anchor points；

S12, fine granularity position code FGBC, if vector S (S₁,S₂,…,S_d) belonging to cluster subspace anchor point be P_i(P_i1, P_i2,…,P_ij,…,P_id), the FGBC code table of vector S is shown as B_S(b_S11b_S12,b_S21b_S22,…,b_Sj1b_Sj2,…,b_Sd1b_Sd2), Middle b_Sj1b_Sj2Meet formula (1):

Wherein, b_Sj1b_Sj2It is vector S in anchor point P_iJth dimension on position code, S_jIt is value of the vector S in jth dimension；

S13, the index structure figure for establishing FGBC-IDistance；

S2, the index structure figure based on FGBC-IDistance are retrieved, retrieving are as follows:

S21, pass through and each anchor point P_iDistance calculate: the search circle of query point q whether with anchor point P_iTo quantum Collection intersection；

The judgment formula of intersection are as follows: dist (q, P_i)<C_i+r

Disjoint judgment formula are as follows: dist (q, P_i)>C_i+r

Wherein, function dist (q, P_i) indicate query point q to anchor point P_iDistance, C_iFor anchor point P_iVector subset in from Anchor point P_iThe distance of farthest vector, the radius for the search circle that r is query point q；

Wherein, x indicates any vector；

So that it is determined that the search range of iDist:

S22, the filtering of FGBC code is carried out to each vector in Candidate Set；

Judging whether the principle of filtering is: the search circle and anchor point P of query point q_iFGBC code region whether intersect, It is not filtered if intersection, if non-intersecting filter；

S23, in filtered Candidate Set each vector and q carry out apart from calculating, if distance is less than r, enter most Whole retrieval set.

Preferably, the step S12 is by anchor point P_iCluster subspace is divided into four regions per one-dimensional, and every dimension produces Raw position code length is 2, then the position code length that the data of d dimension generate is 2d, and position code will entirely cluster Subspace partition at 2^2d A zonule.

Preferably, the index structure figure of FGBC-IDistance is divided into B in the step S13⁺- Tree layers and FGBC code Layer.

Preferably, it also needs to determine FGBC code when the step S22 carries out the filtering of FGBC code to each vector in Candidate Set Apart from lower bound:

Assuming that q (q₁,q2,…q_d) it is query point, P_i(P_i1,P_i2,…P_id) it is anchor point, define q to certain anchor point P_iThe region k Apart from lower bound be minBC (P_i,k,q)；

Wherein, δ_jIt is query point q in j dimension and P_iDistance；

q_jIt is coordinate of the query point q in j dimension；

P_ijIt is anchor point P_iCoordinate in j dimension；

L_jIt is anchor point P_iThe coordinate of time anchor point in j dimension；

R_jIt is anchor point P_iAnother coordinate of secondary anchor point in j dimension；

b_sj1b_sj2It is value of the FGBC code of region k in j dimension；

b_qj1b_qj2It is value of the FGBC code of the region query point q in j dimension.

Compared with prior art, the beneficial effect of technical solution of the present invention is: a kind of fine granularity position based on IDistance The similarity retrieval method (FGBC-IDistance) of code filtering is that two-dimensional space is divided into 16 regions, each region corresponding one A FGBC code.Because FGBC-IDistance is the division of the granularity of more refinement on the basis of BC-IDistance, FGBC-IDistance can be more preferable relative to the filter effect of BC-IDistance.

Detailed description of the invention

Fig. 1 is the index structure schematic diagram of IDistance.

Fig. 2 is the position code schematic diagram of two-dimensional space.

Fig. 3 is the fine granularity position code schematic diagram of two-dimensional space.

Fig. 4 is FGBC-IDistance index structure figure.

Fig. 5 is two-dimensional space IDistance filter effect figure.

Fig. 6 is two-dimensional space BC-IDistance filter effect figure.

Fig. 7 is two-dimensional space FGBC-IDistance filter effect figure.

Fig. 8 is IDistance, and the distance of the measuring and calculating of tri- methods of BC-IDistance, FGBC-IDistance calculates secondary Number histogram.

Fig. 9 is implementation flow chart of the invention.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；In order to better illustrate this embodiment, attached Scheme certain components to have omission, zoom in or out, does not represent the size of actual product；

To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

A kind of similarity retrieval method of the fine granularity position code filtering based on IDistance, such as Fig. 9, comprising the following steps:

Step 1 establishes FGBC index

Define 1 anchor point

Secondary anchor point refers to anchor point P_i(P_i1,P_i2,…P_id) find 2 anchor points, secondary anchor point ((L again per one-dimensional both sides₁, R₁), (L₂,R₂) ..., (L_d,R_d)) indicate, R_j>L_j, (1≤j≤d), j expression jth dimension.As shown in Fig. 3 (a), pass through these three Point can will be divided into four parts per one-dimensional.P₀For anchor point, L₁, R₁As anchor point P₀Two anchor points in the 0th dimension.

Define 2 fine granularity positions code FGBC

Assuming that vector S (S₁,S₂,…,S_d) belonging to cluster subspace anchor point be P_i(P_i1,P_i2,…P_id), L_j,R_j(1≤ J≤d) it is anchor point P_iJth dimension on two time anchor points.The FGBC code table of vector S is shown as B_S(b_S11b_S12,b_S21b_S22,…, b_Sj1b_Sj2,…,b_Sd1b_Sd2).Wherein b_Sj1b_Sj2Meet formula (1):

Wherein, b_Sj1b_Sj2It is vector S in anchor point P_iJth dimension on position code, S_jIt is value of the vector S in jth dimension.

FGBC code is by anchor point P_iCluster subspace be divided into four regions per one-dimensional.So the position code that every dimension generates Length is 2.The position code length that the data of d dimension generate is 2d.Position code will entirely cluster Subspace partition at 2^2dA zonule.Such as Shown in Fig. 3 (a), P₀For the center for entirely clustering subspace, this is divided into 4 sub-spaces, their position code is respectively (00,01,10,11)。

Fig. 3 (b), position code has 4 in two-dimensional space.Wherein the position code of black portions is the coding in the first dimension, red portion Quartile code is the coding in the second dimension.This cluster Subspace partition is 16 regions by 4 codings, the coding in each region its Real is exactly FGBC code.

Index structure figure

The index structure figure of FGBC-IDistance is established as shown in figure 4, can be seen that FGBC- from structure chart The index structure of IDistance is divided into B⁺- Tree layers and FGBC code layer.IO and Euclidean distance calculating are two than relatively time-consuming step Suddenly.Position code layer is smaller than data Layer the space occupied, can achieve two purposes by the filtering of position code layer, first, reducing IO's Amount of access.Second, reducing the number that Euclidean distance calculates.

Retrieval of the step 2 based on FGBC-IDistance

(1) pass through and each anchor point P_iDistance calculate: the search circle of q whether with anchor point P_iVector subset intersection.

The judgment formula of intersection are as follows: dist (q, P_i)<C_i+r

Disjoint judgment formula are as follows: dist (q, P_i)>C_i+r

(2) if non-intersecting in the vector subset of the anchor point without searched targets point；If intersection, it is determined that the ring body model of search It encloses.The ring body range of search are as follows:

(3) search range of iDist is determined, to quickly be searched on B+ tree, the vector found enters candidate Collection.The search range of iDist:

{x∈P_i,i*c+max(dist(P_i, q) and-r, 0) < iDist (P_i, x) and < i*c+min (dist (P_i,q)+r,C_i)}

(4) filtering of FGBC code is carried out to each vector in Candidate Set.

Judging whether the principle of filtering is: the search circle and certain anchor point P of q_iFGBC code region whether intersect, if phase Friendship is not filtered then, is filtered if non-intersecting.

The definition apart from lower bound of FGBC code:

Assuming that q (q₁,q₂,…q_d) it is query point, P_i(P_i1,P_i2,…P_id) it is anchor point, define query point q to certain anchor point P_i's The region k apart from lower bound be minBC (P_i,k,q)。

Wherein, δ_jBe query point q in j dimension at a distance from Pi；

q_jIt is coordinate of the query point q in j dimension；

Pij is coordinate of the anchor point Pi in j dimension；

L_jIt is one coordinate of the anchor point in j dimension of anchor point Pi；

R_jIt is another coordinate of secondary anchor point in j dimension of anchor point Pi；

b_sj1b_sj2It is value of the FGBC code of region k in j dimension；

The correctness of formula (2) and formula (3) is proved below:

1) work as b_qj1b_qj2=b_sj1b_sj2When, it indicates identical as the FGBC code of S in jth dimension q, belongs to same area in jth dimension Domain, therefore be 0 apart from lower bound；

2) work as b_qj1b_qj2≠b_sj1b_sj2and b_qj1=b_sj1When=1, (b_qj1b_qj2, b_sj1b_sj2) value be (10,11) or (11,10).As shown in Fig. 3 (a), q_jWith s_jNecessarily in R_jTwo sides.(q at this time_j-s_j)²> (q_j-R_j)².Work as b_sj1b_sj2=11and b_qj1b_qj2When=00, then q_j< L_j, s_j≥R_j, and L_j< R_j, so s_j-q_j> R_j-q_j, thus (q_j-s_j)²> (q_j-R_j)²；

3) work as b_qj1b_qj2≠b_sj1b_sj2and b_qj1=b_sj1When=0, (b_qj1b_qj2, b_sj1b_sj2) value be necessarily (00,01) Or (01,00).As shown in Fig. 3 (a), q_jWith s_jNecessarily in L_jTwo sides.(q at this time_j-s_j)²> (q_j-L_j)².Work as b_sj1b_sj2= 00and b_qj1b_qj2When=11, then q_j≥R_j, s_j< L_j, and L_j< R_j, so q_j-s_j> q_j-L_j, thus (q_j-s_j)²> (q_j-L_j)²；

4) remaining situation is (b_qj1b_qj2=01and b_sj1b_sj2=10) or (b_qj1b_qj2=10and b_sj1b_sj2=01), such as Shown in Fig. 3 (a), q_jWith s_jNecessarily in p_ijTwo sides, (q at this time_j-s_j)²> (q_j-p_ij)²。

Since the FGBC of q and S is not identical, thus at least conform to it is above 2), 3), 4) one of.Therefore it must demonstrate,prove.

minBC(P_i,k,q)<r

Filter effect is as shown in Figure 7.

In addition a suboptimization can also be being done on the basis of the above filtering: judging that certain vector S in the region of intersection is It is no to filter, it is only necessary to calculate the FGBC code of query point q to S apart from lower bound, calculated without doing time-consuming distance.If looked into It askes radius to be less than apart from lower bound, then can filter out S.

(5) each vector in filtered Candidate Set is carried out with q apart from calculating, if distance is less than r, is entered finally Retrieval set.

The performance advantage of FGBC-IDistance

IDistance is filtered based on triangle inequality.As shown in figure 5, the Candidate Set that range query obtains is Vector on the annulus of blue, institute's directed quantity requires to carry out with query point apart from calculating in ring body.The area of ring body is maximum, needs It is most apart from calculating.

BC-IDistance is directed to this problem, proposes the location information that data point is recorded with BC code.As shown in fig. 6, Two-dimensional space in figure is divided into 4 regions, the region that when range query does not intersect with inquiry circle can filter out.Compared to IDistance, the distance for reducing a part calculate.

Two-dimensional space is divided into 16 regions, the corresponding FGBC code in each region by FGBC-IDistance in Fig. 7.Because FGBC-IDistance is the division of the granularity of more refinement on the basis of BC-IDistance, so FGBC-IDistance phase It can be more preferable for the filter effect of BC-IDistance.

The division in space is also not more thinner better, divides increase and the space that carefully will lead to very much additional computation complexity The increase of complexity, so as to cause the decline of performance.

In IDistance, BC-IDistance, FGBC-IDistance in three methods, before calculate the poly- of intersection Class subspace, passes through B⁺- Tree need to look for the node within the scope of first, and the step is the same.The node that they find Number is also identical.Euclidean distance calculating is one than relatively time-consuming process, if regarding Euclidean distance calculating as a consumption When atomic operation, then the number of nodes that the distance of BC-IDistance calculates at most can be reduced to 1/2^d, worst is exactly all Need to calculate distance, that is, the distance of IDistance calculates number.Due to the special coding of FGBC-IDistance, The number that the distance of FGBC-IDistance calculates at most can be reduced to 1/2^2d, it is 1/2 at least^d, i.e. BC-IDistance's Apart from calculation times, comparing in calculation times: FGBC-IDistance≤BC-IDistance≤IDistance, Actually calculate apart from calculation times, as shown in Figure 8.

The same or similar label correspond to the same or similar components；Described in attached drawing positional relationship for being only used for showing Example property explanation, should not be understood as the limitation to this patent；

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. a kind of similarity retrieval method of the fine granularity position code filtering based on IDistance, which is characterized in that including following step It is rapid:

S1, the index structure figure for establishing FGBC-IDistance；

S11, in anchor point P_i(P_i1,P_i2,…,P_ij,…,P_id) find 2 anchor points again per one-dimensional both sides as time anchor points, it is secondary Anchor point ((L₁,R₁),(L₂,R₂),…,(L_j,R_j)…,(L_d,R_d)) indicate, R_j>L_j, 1≤j≤d, P_ijIndicate anchor point P_iIn jth Value in dimension, R_jAnd L_jIndicate anchor point P_iJth dimension on two time anchor points；

S13, index structure figure is established；

S21, acquisition Candidate Set is retrieved by IDistance

By with each anchor point P_iDistance calculate: the search circle of query point q whether with anchor point P_iVector subset intersection；

The judgment formula of intersection are as follows: dist (q, P_i)<C_i+r

Disjoint judgment formula are as follows: dist (q, P_i)>C_i+r

Wherein, function dist (q, P_i) indicate query point q to anchor point P_iDistance, C_iFor anchor point P_iVector subset in from anchor point P_i The distance of farthest vector, the radius for the search circle that r is query point q；

Wherein, x indicates any vector；

So that it is determined that the search range of iDist:

{x∈P_i,i*c+max(dist(P_i, q) and-r, 0) < iDist (P_i, x) and < i*c+min (dist (P_i,q)+r,C_i) retrieval To vector set be then Candidate Set；

Judging whether the principle of filtering is: the search circle and anchor point P of query point q_iFGBC code region whether intersect, if phase Friendship is not filtered then, is filtered if non-intersecting；

FGBC code region is FGBC code by anchor point P_iCluster subspace is divided into 4 regions per one-dimensional, what every dimension generated Position code length is 2, then the position code length that the data of d dimension generate is 2d, and position code will entirely cluster Subspace partition at 2^2dIt is a small Region；

S23, in filtered Candidate Set each vector and query point q carry out apart from calculating, if distance is less than r, enter Final retrieval set.

2. the method according to claim 1, wherein the step S12 is by anchor point P_iCluster each of subspace Dimension is divided into 4 regions, and the position code length that every dimension generates is 2, then the position code length that the data of d dimension generate is 2d, and position code will Entire cluster Subspace partition is at 2^2dA zonule.

3. the method according to claim 1, wherein in the step S13 FGBC-IDistance index knot Composition is divided into B⁺- Tree layers and FGBC code layer.

4. the method according to claim 1, wherein the step S22 carries out each vector in Candidate Set FGBC code filter when also need determine FGBC code apart from lower bound:

Assuming that q (q₁,q₂,…q_d) it is query point, P_i(P_i1,P_i2,…P_id) it is anchor point, define q to certain anchor point P_iThe region k away from It is minBC (P from lower bound_i,k,q)；

Wherein, δ_jIt is query point q in j dimension and P_iDistance；

q_jIt is coordinate of the query point q in j dimension；

P_ijIt is anchor point P_iCoordinate in j dimension；

L_jIt is anchor point P_iThe coordinate of time anchor point in j dimension；

b_sj1b_sj2It is value of the FGBC code of region k in j dimension；