CN103500165B

CN103500165B - A kind of combination cluster and the high-dimensional vector quantity search method of double key value

Info

Publication number: CN103500165B
Application number: CN201310365592.2A
Authority: CN
Inventors: 吕锐; 杨丽芳; 曹学会; 黄祥林; 成鹏; 龚昊; 史欣萍
Original assignee: XINHUA NEWS AGENCY; Communication University of China
Current assignee: XINHUA NEWS AGENCY; Communication University of China
Priority date: 2013-08-21
Filing date: 2013-08-21
Publication date: 2016-08-31
Anticipated expiration: 2033-08-21
Also published as: CN103500165A

Abstract

The present invention is a kind of combination cluster and the high-dimensional vector quantity search method of double key value.In the present invention, it is proposed that double key value index structure CDKB tree of a kind of combination cluster, it carries out a bunch division initially with clustering algorithm to high dimension vector collection, is then that each cluster data builds double key value extension B⁺Tree, forms CDKB tree.When retrieving, only the cluster data intersected with query context need to be scanned for, realize filtering for the first time by cluster, and by main key and auxiliary key (double key value), realize twice key value to filter, only main key and auxiliary key all need to be carried out similarity mode calculating between the high dimension vector of those in hunting zone and query vector.The index structure that the present invention proposes is compared by cluster and simple double key value size, greatly reduces the operand of similarity mode, greatly accelerates retrieval rate.

Description

A kind of combination cluster and the high-dimensional vector quantity search method of double key value

Technical field

The invention belongs to the data processing field such as multimedia information retrieval, Intelligent Information Processing, data mining, Particularly relate to a kind of combination cluster and the high-dimensional vector quantity search method of double key value.

Background technology

Along with computer and the development of information technology, create the multi-medium data of magnanimity, how in magnanimity Multimedia database in be quickly found out required information be one of current multimedia data storehouse area research Important Problems.Traditional method is by being manually labeled multi-medium data, then passing through text retrieval Realize multimedia information retrieval.But artificial mark exists the defect that workload is big and subjectivity is strong, right For the multi-medium data of explosive growth, the most artificial mark is the most attainable, it is therefore desirable to grind Study carefully multimedia information retrieval technology based on content.

The technology path realizing multimedia information retrieval based on content is: by eigentransformation, by many matchmakers The point characteristic vector that volume data is mapped in higher dimensional space, describes multimedia pair with this feature vector As, obtain feature database；Then the characteristic vector of query object is extracted by same eigentransformation method, The similar to search of multimedia messages is realized finally by the similarity mode between characteristic vector.The most matchmakers The similar to search of body information is changed into finds the point set nearest with giving query point in high-dimensional feature space Process.

In higher dimensional space, just to find and to give the point set that query point is the most close, the method for simple, intuitive It is sequential scan, the most successively each feature (high dimension vector) in feature database is carried out similar to query point Degree coupling, returns those feature point sets mated most, obtains retrieving result.Sequential scan is along with feature database Middle number of features and the increase of characteristic dimension, calculate elapsed time and linearly increase, when the feature in feature database During large number, sequential scan can not meet real-time demand.In order to accelerate retrieval rate, the most frequently used Method be just by mean of High-dimensional Index Technology.

In order to realize the management to magnanimity high dimension vector, researchers propose substantial amounts of index structure, its In the most classical be the R-tree family series index structure with R-tree as representative.R-tree is 20 generation Discipline is proposed by Guttman the eighties, a kind of index knot designed for managing multidimensional tile data Structure, it is a kind of height balanced tree utilizing tree construction management data, and each node is with all in this node The minimum enclosed rectangle (MBR:Minimal Bounding Rectangle) of data represents, real data Only occur in leaf node.This index structure can also be used for the pipe of higher dimensional space middle data by extension Reason.In query script, search for downwards to leaf node layer from root node level, inquired about by calculating Minimum range between each node M BR of vector sum judges whether query context intersects to come real with certain node Existing beta pruning is filtered, and searches only for comprising the subtree of result, thus accelerates retrieval rate.This index structure Allow the space overlap between node, have impact on its search efficiency.In order to improve the performance of R-tree, grind The persons of studying carefully continuously propose R⁺-tree、R^*-tree, SS-tree, SR-tree, X-tree, A-tree etc. index Structure.But these tree index structures drastically decline, even along with the increase of characteristic dimension, search efficiency Not as sequential scan, here it is so-called " dimension disaster ".

In addition to tree, there is also the higher-dimension index structure to one-dimensional conversion, such as: pyramid Technology, NB-tree, iDistance, iMinMax etc..Higher-dimension passes through to the index structure of one-dimensional conversion Certain rule, is mapped as high dimension vector one-dimensional data (referred to as key value), then uses one-dimensional B⁺-tree Managing these key values, key value is at B⁺The leaf node layer ordered arrangement of-tree.When inquiring about, First pass through identical higher-dimension and calculate the inquiry key value of query vector to one-dimensional transformation rule, finally according to Query context, determines key value original position and the end position of search, and it is right to scan these key values successively The high dimension vector answered, calculates the similitude between query vector and these high dimension vectors, returns those most like High dimension vector collection, obtain retrieve result.From query script, the index knot of higher-dimension to one-dimensional conversion Structure under any circumstance performance is superior to or is equivalent to sequential scan, and great many of experiments based on forefathers shows, This kind of index structure reduces slowly with dimension and the increase of data volume, performance.

Pyramid technology, these higher-dimensions such as NB-tree, iDistance, iMinMax are to one-dimensional conversion index Structure filters beta pruning by the realization of simply comparing of single key value, although the distance that need not complexity calculates And have a higher recall precision, but higher-dimension can cause substantial amounts of data message to the process of one-dimensional conversion Lose, cause different vector to be likely to be of identical one-dimensional k ey value, be only capable of filtering out by single key value A part of data that ratio is little, the operand causing final similarity mode process is the biggest, looks into Ask expense the least.

Summary of the invention

It is an object of the invention to propose a kind of combination cluster and the high-dimensional vector quantity search method of double key value, The method uses clustering algorithm higher dimensional space to carry out a bunch division, then by each higher-dimension in each cluster data DUAL PROBLEMS OF VECTOR MAPPING is double one-dimensional k ey values.In query script, filter out a part and interrogation zone by cluster The disjoint cluster data in territory, for each cluster data by increasing by one layer of key value filter course, uses further Simple key value compares realization and again filters beta pruning, considerably reduces final vector similarity coupling Operand, dramatically speed up inquiry velocity.

The overall thought of the present invention is as follows: carry out high dimension vector collection bunch dividing initially with clustering algorithm, Then choose two reference points for each cluster data, and use high dimension vector to this cluster data two each cluster data High dimension vector in this cluster data is mapped as double one-dimensional k ey value by the distance of individual reference point, and unification chooses this The key value that in cluster data, a certain employing same reference points obtains is as main key, and another is as auxiliary key, After, the main key being respectively adopted each cluster data is that each cluster data builds B⁺-tree, the most each B⁺-tree leaf The each main key of node layer binds a pointer pointing to its corresponding auxiliary key, and each auxiliary key binds One pointer pointing to its corresponding high dimension vector.When retrieving, first only need to intersect with query context Each cluster data scan for, then in each cluster data intersected with query region, use identical two Query vector is mapped as inquiring about main key and inquiring about auxiliary key by reference point and mapping method, by inquiring about main key The main key hunting zone determined in each cluster data with query context, and by inquiring about auxiliary key and inquiry model Enclose the hunting zone determining auxiliary key, its auxiliary key after main key filters need to be searched at auxiliary key for last In the range of those high dimension vector and query vectors between carry out similarity mode calculating, return those most like Vector set, obtain retrieve result.

Concrete innovative point: higher dimensional space is carried out a bunch division, selects for each high dimension vector in each cluster data Take two reference points and obtain double one-dimensional k ey value, compared, greatly by cluster and twice simple key value Decrease the final high dimension vector number participating in similarity mode computing, significantly speed up inquiry velocity.

The concrete grammar step of the present invention is: (1) uses clustering algorithm that high dimension vector collection carries out a bunch division, Obtain cluster centre and the cluster radius of each cluster data；(2) it is that each cluster data builds double key value extension B⁺-tree, builds double key value extension B for every cluster data⁺The process of-tree is: first select for this cluster data Take two reference points, and use high dimension vector to the distance of the two reference point by the higher-dimension in this cluster data DUAL PROBLEMS OF VECTOR MAPPING is double one-dimensional k ey values, and unification chooses what a certain employing same reference points in this cluster data obtained Key value is main key, and another is as auxiliary key, and the main key then using this cluster data is this cluster data structure Build B⁺-tree, simultaneously this B⁺It is auxiliary that each main key of-tree leaf node layer binds its correspondence of sensing The pointer of key, each auxiliary key binds a pointer pointing to its corresponding high dimension vector, B⁺-tree leaf The all main key of node layer forms main key layer, and all auxiliary key form auxiliary key layer；(3) by each number of clusters According to cluster centre and cluster radius all bind one and point to double key values constructed by its corresponding cluster data and extend B⁺The pointer of-tree, forms CDKB-tree；(4), when retrieving, that is filtered out by query context A little and the disjoint each cluster data of query region, scans for each cluster data intersected with query context, Searching method in each cluster data is: use identical reference point and mapping method to be mapped by query vector For inquiring about main key and inquiring about auxiliary key, determine at the main key of this cluster data by inquiring about main key and query context The original position of layer search and end position, then determined at this bunch by the auxiliary key of inquiry and query context The auxiliary key hunting zone of data auxiliary key layer, then to main key layer from searching for original position to end position Between each main key be scanned one by one, it is judged that whether auxiliary key corresponding for this main key searches at auxiliary key In the range of, if in hunting zone, then carry out between this high dimension vector corresponding for auxiliary key and query vector Similarity mode calculates, and is returned by the high dimension vector meeting query context, obtains retrieving result.

Further, the clustering algorithm described in step 1 includes that Kmeans clusters.

Further, described in step 2, choose two reference points, including choosing in initial point and cluster The heart is reference point.

Further, the distance of the two reference point can be used European by the high dimension vector described in step 2 Distance or city block distance.

Further, when the CDKB-tree described in step 3 carries out high dimension vector insertion, first basis This high dimension vector is to the distance value of each cluster data cluster centre, the number of clusters that this high dimension vector of selected distance is nearest According to carrying out update, update cluster radius, then according to being inserted into vector to two references of this cluster data The distance of point obtains being inserted into the main key of vector and auxiliary key value, positions it according to the size of this main key value This cluster data correspondence B should be inserted into⁺In a certain leaf node of-tree: if this leaf node less than, then Directly this main key value being inserted in this leaf node, its auxiliary key is inserted into this position corresponding for main key Place, is inserted into characteristic vector and is inserted into this position corresponding for auxiliary key, and it is right to make main key produce sensing Answering the pointer of auxiliary key, its corresponding auxiliary key produces and points to the pointer being inserted into high dimension vector, updates this leaf The key value that father node is corresponding；If this leaf node is the fullest, the mode of process is as follows:

1) if the left and right brotgher of node of this leaf node exist less than situation, then combine brother around Node, carries out to be inserted becoming owner of key, auxiliary key and the insertion of high dimension vector, and it is corresponding to update its father node Key value；

2) if the brotgher of node is the fullest around, then the main key value being inserted into high dimension vector is combined, directly This leaf node is divided, newly generated leaf node after division is inserted in its father node, with Time its auxiliary key and high dimension vector be inserted into corresponding storage position, update the key that its father node is corresponding Value, if father node is the fullest, fission process continues up transmission, and updates the key value of correspondence.

Further, when the carrying out described in step 4 is retrieved, the retrieval mode of employing had both included that scope was looked into Ask and also include k NN Query.

Further, the query context described in step 4, for range query, is by inquiry half Footpath determines, is to be determined by the inquiry radius being incremented by by a certain step-length for k NN Query, Distance value until kth neighbour to query vector is less than inquiry radius.

Accompanying drawing explanation

Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes of the present invention Point, the schematic embodiment of the present invention and explanation thereof are used for explaining the present invention, are not intended that the present invention Improper restriction.In the accompanying drawings:

The flow chart of Fig. 1 (a) the method for the invention

The exemplary plot of Fig. 1 (b) index structure of the present invention

The block diagram that Fig. 2 inquires about at the enterprising line range of index structure of the present invention

Fig. 3 carries out the block diagram of k NN Query on index structure of the present invention

Detailed description of the invention

In order to solve the technical problem that needed for making the present invention, technical scheme is clearer, understand, below knot Close accompanying drawing and the detailed description of the invention of the present invention is described further by embodiment.

A kind of combination cluster that embodiment of the present invention provides and the high-dimensional vector quantity search method of double key values its Shown in flow chart such as Fig. 1 (a) that index structure builds:

First, use clustering algorithm that high dimension vector collection carries out space bunch and divide, obtain each bunch of high dimensional data； Secondly calculate cluster centre and the radius of each cluster data, and choose two reference points for every cluster data；Again Calculate each high dimension vector in each cluster data and the distance between two reference points of this cluster data one by one, obtain Double one-dimensional k ey values that each high dimension vector is corresponding；Then a certain employing same reference in each cluster data is chosen The key value that point obtains is as main key, and another is as auxiliary key, and uses the main key of each cluster data for being somebody's turn to do Cluster data builds B⁺-tree, inserts auxiliary key corresponding for the main key of each cluster data and high dimension vector data To corresponding auxiliary key and high dimension vector storage position, the finger of its corresponding auxiliary key is pointed in each main key binding Pin, each auxiliary key binding is pointed to the pointer of its corresponding high dimension vector, is obtained double key that each cluster data is corresponding Value extension B⁺-tree；Finally the cluster centre of each cluster data and cluster radius are bound double key of this cluster data Value extension B⁺-tree, forms CDKB-tree index structure.(as shown in Fig. 1 (b), upper strata is clustering information Layer, centre is each B built by the main key of each cluster data⁺-tree, bottom for store auxiliary key and higher-dimension to The auxiliary key layer of amount and characteristic vector layer, each B⁺The each main key of-tree leaf node layer binds one Pointing to the pointer of its corresponding auxiliary key, its auxiliary key binds a pointer pointing to corresponding high dimension vector.) When retrieving, judge whether each cluster data intersects with this inquiry by query vector and query context, Then the cluster data for intersecting with this inquiry searches further for, and the method for search is: first adopt With identical reference point and mapping ruler, calculate the distance between query vector and two reference points of this cluster data, Obtain the main key of the inquiry in this cluster data and inquire about auxiliary key；Then according to inquiring about main key and inquiry model Enclose, determine at double key values extension B that this cluster data is corresponding⁺-tree index structure main key layer (i.e. B⁺-tree Leaf node layer) main key hunting zone, obtain scanning starting position and the end position of main key layer, And according to inquiring about auxiliary key and query context, determine at double key values extension B that this cluster data is corresponding⁺-tree The auxiliary key hunting zone of index structure auxiliary key layer；Finally, from the scanning starting position of main key layer to knot Bundle position (main key hunting zone), carries out key scan value, it is judged that the auxiliary key that this main key is corresponding one by one Whether within auxiliary key hunting zone, if within hunting zone, then calculate this higher-dimension corresponding for auxiliary key Distance between vector and query vector, returns the high dimension vector of satisfied retrieval result, obtain similar to Quantity set.

The retrieval mode carrying out retrieving of the present invention includes range query and k NN Query, and scope is looked into Ask flow chart as in figure 2 it is shown, the flow chart of k NN Query as shown in Figure 3.From the figure 3, it may be seen that k is near Adjacent inquiry is to be realized by range query.

Above-mentioned high dimension vector can be the characteristic vector of image, video, audio frequency.

It should be appreciated that the above-mentioned description for embodiment is more concrete, can not therefore think Being the restriction to scope of patent protection of the present invention, the scope of patent protection of the present invention should be with claims It is as the criterion.

Claims

1. the high-dimensional vector quantity search method combining cluster and double key value, it is characterised in that concrete steps are such as Under:

1) use clustering algorithm to carry out high dimension vector collection bunch dividing, obtain each cluster data cluster centre and Cluster radius；

2) it is that each cluster data builds double key value extension B⁺-tree, builds double key value extension for every cluster data B⁺The process of-tree is: first choose two reference points for this cluster data, and use high dimension vector to this two High dimension vector in this cluster data is mapped as double one-dimensional k ey value by the distance of individual reference point, and unification chooses this The key value that in cluster data, a certain employing same reference points obtains is main key, and another is as auxiliary key, then The main key using this cluster data is that this cluster data builds B⁺-tree, simultaneously this B⁺-tree leaf node layer Each main key binds a pointer pointing to its corresponding auxiliary key, and each auxiliary key binds a sensing The pointer of its corresponding high dimension vector, B⁺The all main key of-tree leaf node layer forms main key layer, institute Auxiliary key is had to form auxiliary key layer；

3) cluster centre and the cluster radius of each cluster data are all bound one and pointed to its corresponding cluster data institute structure Build double key value extension B⁺The pointer of-tree, forms CDKB-tree；

4), when retrieving, those and the disjoint each cluster data of query region are filtered out by query context, Scanning for each cluster data intersected with query context, the searching method in crossing each cluster data is: Identical reference point and mapping method is used to be mapped as query vector inquiring about main key and inquiring about auxiliary key, logical Cross and inquire about main key and query context determines at the original position of this cluster data main key layer search and stop bits Put, then determined by the auxiliary key of inquiry and query context and search for model at the auxiliary key of this cluster data auxiliary key layer Enclose, then main key layer be scanned to each main key end position one by one from search original position, Judge auxiliary key corresponding for this main key whether in auxiliary key hunting zone, if in hunting zone, the most right Carry out similarity mode calculating between this high dimension vector corresponding for auxiliary key and query vector, inquiry model will be met The high dimension vector enclosed returns, and obtains retrieving result.

2. the method for claim 1, it is characterised in that: the clustering algorithm described in step 1 includes Kmeans clusters.

3. the method for claim 1, it is characterised in that: choose two reference points described in step 2, Including choosing initial point and cluster centre is reference point.

4. the method for claim 1, it is characterised in that: the high dimension vector described in step 2 to this two The distance of individual reference point can use Euclidean distance or city block distance.

5. the method for claim 1, it is characterised in that: CDKB-tree described in step 3 is being entered When row high dimension vector inserts, first according to the distance value of this high dimension vector to each cluster data cluster centre, choosing Take and carry out update apart from the cluster data that this high dimension vector is nearest, update cluster radius, then according to treating Insert the vectorial distance to these two reference points of cluster data to obtain being inserted into the main key of vector and auxiliary key value, Size according to this main key value positions it should be inserted into this cluster data correspondence B⁺The a certain leaf node of-tree In: if this leaf node less than, then directly this main key value is inserted in this leaf node, it is auxiliary Key is inserted into the position that this is corresponding for main key, is inserted into characteristic vector and is inserted into this position corresponding for auxiliary key Putting place, and make main key produce the pointer pointing to its corresponding auxiliary key, it is to be inserted that its corresponding auxiliary key produces sensing Enter the pointer of high dimension vector, update the key value that this leaf node father node is corresponding；If this leaf node The fullest, the mode of process is as follows:

Step one: if the left and right brotgher of node of this leaf node exist less than situation, then combine it left The right brotgher of node, carries out to be inserted becoming owner of key, auxiliary key and the insertion of high dimension vector, and updates its father node Corresponding key value；

Step 2: if the brotgher of node is the fullest around, then combine the main key value being inserted into high dimension vector, Directly this leaf node is divided, newly generated leaf node after division is inserted in its father node, Its auxiliary key and high dimension vector are inserted into corresponding storage position simultaneously, update the key that its father node is corresponding Value, if father node is the fullest, fission process continues up transmission, and updates the key value of correspondence.

6. the method for claim 1, it is characterised in that: when the carrying out described in step 4 is retrieved, adopt Retrieval mode both included that range query also included k NN Query.

7. the method for claim 1, it is characterised in that: the query context described in step 4, for For range query, inquiry radius determine, be by by a certain step for k NN Query The long inquiry radius being incremented by determines, the distance value until kth neighbour to query vector is less than inquiry Till radius.