CN102999542A - Multimedia data high-dimensional indexing and k-nearest neighbor (kNN) searching method - Google Patents

Multimedia data high-dimensional indexing and k-nearest neighbor (kNN) searching method Download PDF

Info

Publication number
CN102999542A
CN102999542A CN2012102094945A CN201210209494A CN102999542A CN 102999542 A CN102999542 A CN 102999542A CN 2012102094945 A CN2012102094945 A CN 2012102094945A CN 201210209494 A CN201210209494 A CN 201210209494A CN 102999542 A CN102999542 A CN 102999542A
Authority
CN
China
Prior art keywords
node
data
distance
center
data object
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012102094945A
Other languages
Chinese (zh)
Other versions
CN102999542B (en
Inventor
杜小勇
张孝
王珊
李晖
Original Assignee
杜小勇
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 杜小勇 filed Critical 杜小勇
Priority to CN201210209494.5A priority Critical patent/CN102999542B/en
Publication of CN102999542A publication Critical patent/CN102999542A/en
Application granted granted Critical
Publication of CN102999542B publication Critical patent/CN102999542B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a multimedia data high-dimensional indexing and k-nearest neighbor (kNN) searching method, which comprises the following steps of establishing high-dimensional index of multiple multimedia data, wherein the high-dimensional index comprises multiple nodes and multiple targets of the multimedia data, and each node comprises a distance from a node center of the node to a node center of a subnode corresponding to the node or the data target; determining a search range of the data to be searched and a candidate index node set in the high-dimensional index according to a kNN searching algorithm; and pruning the candidate index node set to obtain a search result of the data to be searched according to the search range and the distance contained by each node inside the candidate index node set. The distance from the node center of each node to the node center of the subnode corresponding to the node or the data target is stored when the high-dimensional index of the multimedia data is established, the obtained candidate search node set is pruned according to the stored distance, so that the calculation expenditure of the pruning treatment in the search process can be effectively reduced.

Description

Multi-medium data high dimensional indexing and kNN search method
Technical field
The present invention relates to data directory and retrieval technique, relate in particular to a kind of multi-medium data high dimensional indexing and kNN search method.
Background technology
Along with popularizing gradually of digital entertainment equipment, the multi-medium datas such as image, video, audio frequency have occurred increasing substantially in data volume, and the technology of retrieving based on the content of multi-medium data, especially kNN(k-Nearest Neighbor, the k neighbour) retrieval technique, it is more and more important also to become.In field of image search, image to be retrieved can find a plurality of images the most similar to it with the kNN retrieval technique by the multimedia high dimensional indexing.In the video frequency searching field, system can be according to the higher-dimension visual signature of video segment to be retrieved, and the higher-dimension audio frequency characteristics of video segment to be retrieved, based on high dimensional indexing and kNN retrieval technique, find fast the most similar a plurality of video segments to video segment to be retrieved.Therefore, the High-dimensional Index Technology of multi-medium data and kNN search method thereof are the core technologies of multimedia data retrieval, and prospect has a very wide range of applications.
At present, the high dimensional indexing of existing multi-medium data and kNN retrieval technique exist certain limitation, are mainly manifested in:
1, existing High-dimensional Index Technology is only considered to reduce input and output (I/O) expense in the kNN retrieving by the mode of setting up the hierarchical high dimensional indexing usually, and the large problem of computing cost during unresolved higher-dimension kNN retrieval.
2, existing kNN search method based on technology of prunning branches, such as Branch-and-Bound and INN method, adopted complicated, the beta pruning tolerance rule that computing cost is very large is namely calculated each both candidate nodes by the bee-line algorithm or data object carries out data filtering and beta pruning to the distance of data to be tested.Whole beta pruning processing procedure expense is very large.
Summary of the invention
The invention provides a kind of multi-medium data high dimensional indexing and kNN search method, to solve the large problem of computing cost in the prior art.
The embodiment of the invention provides a kind of multi-medium data high dimensional indexing and kNN search method, comprising:
Make up the high dimensional indexing of a plurality of multi-medium datas, wherein, each multi-medium data comprises a plurality of data objects, described high dimensional indexing comprises the data object of a plurality of nodes and described a plurality of multi-medium datas, and each node comprises that the node center of described node is to the node center of child node corresponding to described node or the distance of data object;
According to the kNN searching algorithm, in described high dimensional indexing, determine range of search and the candidate index node set of data to be retrieved;
According to the described distance that each node in described range of search and the described candidate index node set comprises, described candidate index node set is carried out beta pruning process, draw the result for retrieval of described data to be retrieved.
By foregoing as can be known, the embodiment of the invention is by depositing the node center of each node in to the node center of child node corresponding to node or the distance of data object when making up the high dimensional indexing of multi-medium data, and according to the distance of this storage, the candidate index node set that obtains according to the kNN searching algorithm is carried out beta pruning to be processed, needing in the prior art to have avoided each both candidate nodes or data object are carried out processing based on the beta pruning of bee-line algorithm, effectively reduced the computing cost that beta pruning is processed in the retrieving.
Description of drawings
The multi-medium data high dimensional indexing that Fig. 1 provides for the embodiment of the invention and the schematic flow sheet of search method;
The synoptic diagram of one instantiation of the multi-medium data high dimensional indexing that Fig. 2 provides for the embodiment of the invention;
Fig. 3 adopts the lower limit distance to carry out the exemplary plot that beta pruning is processed for multi-medium data high dimensional indexing and the search method that the employing embodiment of the invention provides to intermediate node;
Fig. 4 adopts upper distance limit to carry out the exemplary plot that beta pruning is processed for multi-medium data high dimensional indexing and the search method that the employing embodiment of the invention provides to intermediate node;
Fig. 5 adopts the lower limit distance to carry out the exemplary plot that beta pruning is processed for multi-medium data high dimensional indexing and the search method that the employing embodiment of the invention provides to leaf node.
Embodiment
As shown in Figure 1, the multi-medium data high dimensional indexing that provides of the embodiment of the invention and the schematic flow sheet of kNN search method.The described method of present embodiment comprises:
The high dimensional indexing of step S1, a plurality of multi-medium datas of structure, wherein, each multi-medium data comprises a plurality of data objects, described high dimensional indexing comprises the data object of a plurality of nodes and described a plurality of multi-medium datas, and each node comprises that the node center of described node is to the node center of child node corresponding to described node or the distance of data object.
Wherein, described multi-medium data comprises the data such as image, Audio and Video.Each multi-medium data comprises a plurality of data objects.In the practical application, for quick-searching arrives multi-medium data, multi-medium data is represented as the form of high dimensional feature vector usually, and sets up high dimensional indexing according to the high dimensional feature vector, and described high dimensional indexing comprises the data object of a plurality of nodes and described a plurality of multi-medium datas.As shown in Figure 2, described high dimensional indexing is hierarchical, comprises one deck root node layer that level from top to bottom arranges, at least one deck intermediate level of nodes 1, one deck leaf node layer 2 and data objects layer 3.Upper layer node is the father node of lower level node, and lower level node is the child node of upper layer node.High dimensional indexing instance graph shown in Figure 2 only shows one deck intermediate level of nodes 1, one deck leaf node layer 2 and one deck data objects layer, not shown root node layer.In the present embodiment in the constructed high dimensional indexing each node all stored the node center of node to the node center of child node corresponding to node or the distance of data object.As shown in Figure 2, root node R comprises that the node center of node R is to the distance B of the node center of next level intermediate node R2 of node R R2Intermediate node R1 comprises that the node center of intermediate node R1 is to the distance B of the node center of the child node R4 of R1 R4Described intermediate node R1 comprises that also the node center of intermediate node R1 is to the distance B of the node center of the child node R5 of R1 R5Similarly, described leaf node R3 comprises that the node center of leaf node R3 is to the distance B of data object I IThe distance of each node storage will be used for carrying out beta pruning and process in the later retrieval process, only adopt the bee-line method to carry out the computing cost that beta pruning is processed in the prior art to reduce.
Step S2, according to the kNN searching algorithm, in described high dimensional indexing, determine range of search and the candidate index node set of data to be retrieved.
Particularly, described searching algorithm can be the kNN retrieval.Search engine at first adopts the kNN search method to determine the retrieval radius of described data to be retrieved, then determines take described retrieval radius as radius according to described retrieval radius, and described data to be retrieved are the range of search in the center of circle.Wherein, the node in the described candidate index node set is the node overlapping with described range of search.
Step S3, the described distance that comprises according to each node in described range of search and the described candidate index node set are carried out beta pruning to described candidate index node set and are processed, and draw the result for retrieval of described data to be retrieved.
Wherein, described beta pruning processing procedure is exactly to judge whether node in the described candidate index node set corresponding child node or data object be overlapping with described range of search, will not cut off with overlapping child node or the data object of described range of search.In fact, draw the result for retrieval of described data to be retrieved, search engine need to be carried out repeatedly beta pruning to described candidate index node set and be processed.The number of times that described beta pruning is processed has determined the retrieval precision of the result for retrieval of described data to be retrieved.
Described multi-medium data high dimensional indexing and kNN search method that present embodiment provides, the high dimensional indexing that comprises range information by making up each node reduces the computing cost that the beta pruning in the retrieving is processed, and has effectively improved the recall precision of multi-medium data.
Further, step 1 in above-described embodiment, the following steps that can adopt that the high dimensional indexing of described structure multi-medium data is concrete realize, comprising:
Step S101, according to Data Partition Strategy the retrieve data characteristic vector data is divided, generated high dimensional indexing.
Wherein, the technology of the high dimensional indexing of based on data partition strategy structure multi-medium data can be divided into two large classes.The first kind is minimum binding rectangle (Minimum Bounding Rectangle, MBR) partition strategy, and the MBR partition strategy is the characteristics of the multimedia vector data to be carried out rectangle divide the high dimensional indexing that forms hierarchical.Equations of The Second Kind is minimum binding spheroid (Minimum Bounding Sphere, MBS) partition strategy, and the MBS partition strategy is that the characteristics of the multimedia vector data is carried out the spherical high dimensional indexing that forms hierarchical of dividing.Because first kind MBR partition strategy has less volume in higher dimensional space, thereby so that usually have less overlapping possibility between object, therefore, when making up basic multimedia high dimensional indexing, the preferred high dimensional indexing that makes up multi-medium data based on the MBR partition strategy that adopts.
Step S102, calculate in the described high dimensional indexing node node center to the node center of child node corresponding to described node or the distance of data object, and described distance is stored in the described node.
Wherein, described high dimensional indexing comprises one deck root node layer, at least one deck intermediate level of nodes, leaf node layer and data objects layer.Each node in the described high dimensional indexing in root node layer, intermediate level of nodes and the leaf node layer includes range information.Particularly, high dimensional indexing makes up engine after making up described high dimensional indexing, calculate respectively the node center of root node in the high dimensional indexing to the distance of the node center of next level intermediate node corresponding to described root node, the node center of intermediate node is to the distance of the node center of the node center of next level intermediate node corresponding to described intermediate node or leaf node, and the node center of leaf node is to the distance of data object corresponding to described leaf node, and the distance that will calculate stores in corresponding root node, intermediate node or the leaf node.
Described high dimensional indexing comprises: root node, intermediate node, leaf node and the data object of level arrangement from top to bottom.Wherein, the child node that described root node is corresponding is intermediate node, and the father node of described leaf node is described intermediate node, and next level that described leaf node is corresponding is data object.Described root node, intermediate node and leaf node include: node identification separately, node center coordinate separately, the quantity of each self-contained data object and described distance, the quantity of the data object that leaf node comprises is the quantity of data object corresponding to described leaf node, the quantity sum of the data object that the quantity of the data object that root node, intermediate node comprise comprises for all child nodes.Particularly, as shown in Figure 2, store following content in the storage area of root node R and intermediate node R1 in the high dimensional indexing, be embodied in following form:
(ID,C,LB,RU,#objects,#subregions,dists,D fd,E i)
E i: (LB i, RU i, #objects i, Pointer i), 1≤i≤n wherein, N Min≤ n≤N Max, and n=#subregions.
Wherein, ID is the identification information of root node R or intermediate node R1.C is the centre coordinate of root node R or intermediate node R1.LB is generally the lower left corner coordinate of dividing rectangle for the root node R that is divided into based on the MBR partition strategy or the first border of intermediate node R1 rectangle.RU is generally the upper right corner coordinate of dividing rectangle for the root node R that is divided into based on the MBR partition strategy or the second boundary of intermediate node R1 rectangle.#objects is the quantity of the data object that comprises of root node R or intermediate node R1.Dists is that the center C of root node R or intermediate node R1 is to the distance of the node center of the node center of intermediate node or leaf node.If next level intermediate node or the leaf node of root node R or intermediate node R1 are two or more, the node center that includes root node R or intermediate node R1 among the dists divides the distance of the node center of the node center that is clipped to each intermediate node or leaf node.D FdBe the node center data object farthest of distance root node R or the intermediate node R1 distance to the node center of root node R or intermediate node R1.E iChild node E for root node R or intermediate node R1 1...., E n(N Min≤ n≤N Max) wherein, N MaxThat root node R or intermediate node R1 allow the intermediate node that comprises or the upper limit of leaf node number.
Figure BDA00001799506500051
M NIt is the fan-out (fanout) of root node R or intermediate node R1.Each tuple Ei correspondence a node, and this node may be intermediate node or leaf node.
E iFormed by four parts: the first border LB of the rectangle of corresponding next the level intermediate node of the root node R that is divided into based on the MBR partition strategy or intermediate node R1 or leaf node i, i.e. the lower left corner coordinate of rectangle; The first border RU of the rectangle of corresponding next the level intermediate node of the root node R that is divided into based on the MBR partition strategy or intermediate node R1 or leaf node i, i.e. the upper right corner coordinate of rectangle; The quantity #objects of the data object that node comprises iAnd the pointer Pointer that points to child node i
As shown in Figure 2, store following content in the high dimensional indexing in the storage area of leaf node R3, be embodied in following form:
(ID,C,LB,RU,#objects,dists,D fd,E i)
E i: (Pointer i), 1≤i≤n wherein, L Min≤ n≤L Max, and n=#objects
Wherein, ID is the identification information of leaf node R3.C is the centre coordinate of leaf node R3.LB is the first border based on the rectangle of the leaf node R3 of MB partition strategy division, the i.e. lower left corner of rectangle.RU is the second boundary based on the rectangle of the leaf node R3 of MB partition strategy division, the i.e. upper right corner of rectangle.#objects is the quantity of the data object that comprises of leaf node R3.Dists is that the center C of leaf node R3 is to the distance of data object corresponding to leaf node R3.If the data object that described leaf node R3 is corresponding is two or more, dists should comprise that the center C of leaf node R3 is to the distance of all data objects corresponding to leaf node R3.E iBe data object E corresponding to leaf node R3 1...., E n(L Min≤ n≤L Max), L wherein MaxIt is the upper limit that leaf node R3 allows the data object that comprises;
Figure BDA00001799506500061
M LIt is the fan-out (fanout) of leaf node R3.Each E iComprised the pointer that points to respective data object.
Further, step 2 in above-described embodiment, described according to the kNN searching algorithm, in described high dimensional indexing, determine range of search and the candidate index node set of data to be retrieved, specifically can adopt following steps to realize, comprising:
Step S201, according to the bee-line algorithm, calculate the distance of the node center of described data to be retrieved and intermediate node, described intermediate node is the arbitrary intermediate node in next level of described root node.
In the actual retrieval process, the child node of search engine optional root node in described high dimensional indexing, namely intermediate node calculates the center of this intermediate node of selecting to the distance of described data to be retrieved.Wherein, selecting the purpose of intermediate node in this step is to occur for fear of undetected problem.
Step S202, according to the distance of the node center of described data to be retrieved and described intermediate node, deterministic retrieval scope.
Particularly, search engine is according to the distance that calculates in the above-mentioned steps, determines distance take the node center of described data to be retrieved and described intermediate node as radius, the range of search centered by the described data to be retrieved.
Step S203, judge whether the node in the retrieval server internal memory is overlapping with described retrieval initial range, if described node is stored in the candidate index node set.
Generally speaking, store the node of calling in the retrieving last time in the described internal memory in the described retrieval server internal memory.At least have a node in the described retrieval server internal memory, in the worst situation, only have a root node in the internal memory.Search engine selects at first to judge the node in the internal memory, can avoid again transferring the expense of the node in the high dimensional indexing, reduces the input expense in the retrieving.Particularly, search engine is calculated the node center of the node in the internal memory to the distance of described data to be retrieved, if apart from the retrieval radius less than described range of search, then this node and described range of search are overlapping.To store in the candidate index node set with the overlapping node of described range of search.
Step S204, judge whether the quantity summation of the data object that all nodes comprise in the described candidate index node set equals preset value, if not, then enlarge according to preset ratio or dwindle described range of search, determine in the described high dimensional indexing with enlarge or dwindle after the overlapping node of described range of search, and overlapping node is stored in the described candidate index node set successively, until the quantity summation of the data object that all nodes comprise in the described candidate index node set equals described preset value.
Particularly, the quantity of the data object that comprises according to each node in the described candidate index node set, search engine calculate the quantity summation of all data objects in the described candidate index node set.Then, search engine judges whether the quantity summation of described data object equals the parameter K in the kNN retrieval.Wherein, described parameter K is a preset value, and this value can artificially be set according to the retrieval experience.
If the quantity summation of all data objects equals described parameter K in the described candidate index node set, then described range of search and described candidate index node set are range of search and the candidate index node set that search engine is finally determined.
If the quantity summation of all data objects is not equal to described parameter K in the described candidate index node set, then be divided into two kinds of situations: situation one, the quantity summation of described data object is less than parameter K, and search engine enlarges described range of search according to default ratio, determines new range of search.Then search engine is again according to the bee-line algorithm, it is the MinDist calculation procedure, determine the center of which node in the described high dimensional indexing to the radius of the distance between the described data to be retrieved less than the range of search after enlarging, and the node of determining stored in the described candidate index node set, then continue to carry out this step.The quantity summation of situation two, described data object is greater than parameter K, and search engine is dwindled described range of search according to default ratio, determines new range of search.Then search engine is again according to the bee-line algorithm, it is the MinDist calculation procedure, determine the center of which node in the described high dimensional indexing to the radius of the distance between the described data to be retrieved less than the described range of search after dwindling, and the node of determining stored in the described candidate index node set, then continue this step.By above-mentioned steps, the quantity summation that search engine finally can be determined all data objects that comprise equals the candidate index node set of parameter K, and through dwindle or enlarge after the range of search finally determined.
Further, step 3 described in above-described embodiment, the described described distance that comprises according to described range of search and described candidate index node set interior nodes is carried out beta pruning to described candidate index node set and is processed, draw the result for retrieval of described data to be retrieved, comprising:
Step S301, according to the bee-line algorithm, calculate the node center of each node in the described candidate index node set to the distance of described data to be retrieved.
Step S302, according to the node center of each node to the described distance that distance and each node of described data to be retrieved comprises, described candidate index node set is carried out first pruning processes.
Wherein, describedly described candidate index node set carried out first pruning process, comprising:
Step S3021, the set in corresponding each child node or the data object of each node add in the set.
Step S3022, according to following formula, calculate described data to be retrieved to the lower limit of the node center of child node corresponding to each node or data object apart from d Low
d low = d QC R - dist ( C R , C R i )
Wherein,
Figure BDA00001799506500082
Be the distance of data to be retrieved to the center of described node,
Figure BDA00001799506500083
The node center of the described node that comprises for described node is to the node center of child node corresponding to described node or the distance of data object.
Step S3023, the described lower limit of judgement are apart from d LowWhether greater than the node center of the radius of described range of search and described node to the node center of the node center distance child node farthest of described node or data object apart from sum, if, determine that then described child node or described data object are not both candidate nodes or candidate data object, delete described child node or described data object from described candidate index node set.
Further, if the node in the described candidate index node set is root node or intermediate node, then above-mentionedly described candidate index node set carried out first pruning processes, also comprise:
Step S3023, according to following formula, calculate described data to be retrieved to the upper distance limit d of the node center of the child node of described node Up
d up = d Q C R + dist ( C R , C R i )
Step S3024, judge described upper distance limit whether less than and the center that equals the radius of described range of search and described node to the center of the centre distance child node farthest of described node apart from sum, if not, then described child node is uncertain both candidate nodes, adopt the bee-line algorithm to calculate the node center of described uncertain both candidate nodes to the distance of described data to be retrieved, if this distance is greater than the radius of described range of search, then described uncertain both candidate nodes is not both candidate nodes, and described uncertain both candidate nodes is deleted from described candidate index node set.
Carry out data filtering and beta pruning by node center or the data object that the large bee-line algorithm of computing cost calculates the child node of all nodes in the described candidate index node set to the distance of data to be retrieved than prior art, the equation that passes through low expense that the embodiment of the invention provides is judged the beta pruning processing of carrying out, computing cost obviously reduces, and has effectively improved recall precision.
After processing based on above-mentioned first pruning, may also there be the node of processing without above-mentioned first pruning in the described candidate index node set, therefore, above-mentioned described candidate index node set be carried out also comprising after first pruning processes:
Step S303, described candidate index node set is carried out second time beta pruning process.
Wherein, describedly described candidate index node set carried out second time beta pruning process, comprising:
Step S3031, according to the bee-line algorithm, calculate the node center of child node of described candidate index node set interior nodes or data object to the distance of described data to be retrieved.
If the node center of the described child node of step S3032 or data object arrive the distance of described data to be retrieved greater than the radius of described range of search, then described child node or data object are not both candidate nodes or candidate data object, and described child node or data object are deleted from described candidate index node set.
In the retrieving of reality, in order further to improve the accuracy of the result for retrieval of described data to be retrieved, so that include N node or the data object nearest with described retrieve data in the described result for retrieval.Wherein, the setting that N can be artificial.Described search engine need to be carried out repeatedly above-mentioned beta pruning processing procedure to described candidate index node set.For example, the high dimensional indexing that makes up described multi-medium data has 4 layers, comprises 1 layer of root node layer, 1 layer of intermediate level of nodes, 1 layer of leaf node layer and 1 layer data object layer.Described search engine is determined the candidate index node set of data to be retrieved according to the kNN searching algorithm in described high dimensional indexing.If include a plurality of intermediate nodes in the described candidate index node set.The described distance that described search engine comprises according to intermediate node in described range of search and the described candidate index node set is carried out first round beta pruning to described candidate index node set and is processed.Described first round beta pruning is processed the first pruning that comprises in above-described embodiment and is processed and for the second time beta pruning processing.Intermediate node in the candidate index node set after process first round beta pruning is processed all is updated to the leaf node in the corresponding leaf layer of each intermediate node.Described search engine is again carried out second to described candidate index node set and is taken turns the beta pruning processing.Leaf node in the candidate index node set after taking turns beta pruning and process through second all is updated to the data object in the corresponding data objects layer of each leaf node.The described candidate index node set that is made of data object after processing through above-mentioned two-wheeled beta pruning is the result for retrieval of described data to be retrieved.
Below in conjunction with concrete example, be described further described in the present embodiment described candidate index node set being carried out the process that beta pruning processes.
If the node in the described candidate index node set is intermediate node
At first, calculate each child node of described node to upper distance limit and the lower limit distance of data to be retrieved.
Particularly, as shown in Figure 3, Q is data to be retrieved, Q rRetrieval radius for the range of search of Q.R is described node.The child node of R comprises R 1, R 2, R 3And R 4 Comprise: the node R center C RTo R 1Center C R1Distance, the node R center C RTo R 2Center C R2Distance, the node R center C RTo R 3Center C R3Distance and node R center C RTo R 4Center C R4Distance.According to described bee-line algorithm, calculate Q to the node R center C RDistance
Figure BDA00001799506500102
Then according to Triangle inequality, calculate Q to R iUpper distance limit d Up(Q, R i) and lower limit apart from d Low(Q, R i), that is:
d low = d QC R - dist ( C R , C R i )
d up = d Q C R + dist ( C R , C R i )
Then, judge respectively each node R 1, R 2, R 3And R 4D Low(Q, R i) whether satisfy:
d low(Q,R i)>Q r+D fd?Ri
Wherein, D FdRiBe R iCenter C RiArrive and described R iCenter C RiDistance data object P farthest FdDistance.If satisfy, illustrate that then Q is to R iThe distance of minimum also greater than inquiry radius Q rAnd R iCenter C RiArrive and described R iCenter C RiDistance data object P farthest FdApart from sum, i.e. R iDo not have overlapping with the range of search of Q.With described R iFrom described candidate index node set, delete.By above-mentioned based on d Low(Q, R i) judgement carried out, R 1, R 2And R 4All do not have overlappingly with the range of search of Q, therefore, search engine is just determined R with minimum calculation cost 1, R 2And R 4It or not both candidate nodes.Search engine has just avoided in the prior art each node being carried out the expense of minimum distance calculation like this.
Subsequently, judge the d of the child node of remaining node in the described candidate index node set UpWhether satisfy:
d up(Q,R i)≤Q r+D fd?Ri
If satisfy, illustrate that then Q is to R iThe distance of maximum also be less than or equal to inquiry radius Q rAnd R iCenter C RiArrive and described R iCenter C RiDistance data object P farthest FdApart from sum, i.e. R iOverlapping with the range of search of Q.Therefore, search engine is determined R by this determining step with minimum calculation cost iBe both candidate nodes, and will be defined as the R of both candidate nodes iJoin in the described candidate index node set.As shown in Figure 4, by above-mentioned based on d Up(Q, R i) judgement carried out, determine R 1And R 3All overlapping with the range of search of Q, therefore, R 1And R 3All be added in the described candidate index node set.
At last, if also store intermediate node in the described candidate index node set, and whether described intermediate node fails to determine by above-mentioned judgement next level child node of intermediate node overlapping with the range of search of described data to be retrieved, needs that then the intermediate node in the described candidate index node set is carried out the beta pruning second time and process.
Particularly, search engine adopts existing bee-line algorithm, calculates respectively each child node of intermediate node to the MinDist distance of data to be retrieved.If the MinDist(Q that calculates, R i) less than the retrieval radius Q of described data to be retrieved R, then show this child node R iOverlapping with the range of search of data to be retrieved.Search engine will be overlapping with the range of search of data to be retrieved child node R iJoin in the described candidate index node set.
After above-mentioned beta pruning processing procedure, the described candidate index node set that before is made of intermediate node has been processed into the described candidate index node set that is made of the intermediate node child node.Described child node can be the intermediate node of described next level of intermediate node, or leaf node.
Further, if the node in the described candidate index node set is root node, also can adopts above-mentioned steps to carry out beta pruning and process.Through namely drawing the described candidate index node set that is consisted of by intermediate node after the above-mentioned beta pruning processing.
If the node in the two described candidate index node set is leaf node.
With the beta pruning processing procedure of above-mentioned intermediate node, calculate data to be retrieved to the distance of leaf node.Then according to Triangle inequality, draw respectively data to be retrieved to the lower limit distance of all data objects corresponding to leaf node.Here it should be noted that: different from the beta pruning processing procedure of above-mentioned intermediate node is, when leaf node is carried out first pruning and processes, only need the lower limit distance of each data object corresponding to leaf node whether to satisfy: the lower limit distance〉the retrieval radius of data to be retrieved.If satisfy, then described data object is outside the range of search of described data to be retrieved.As shown in Figure 5, data object P2, P3, P4 and P6 are all outside the range of search of described data to be retrieved.Search engine is just determined P2, P3, P4 and P6 not as the candidate data object by the judgement of lower limit distance take minimum calculation cost like this.Search engine has just avoided in the prior art each data object being carried out the expense of minimum distance calculation like this.
In this step, to the beta pruning of leaf node only the applications distances lower bound filter unnecessary input and output expense, reason is: even search engine has been used upper distance limit, also can only obtain the conclusion of certain data object in range of search., still need further to calculate each data object to the distance of data to be retrieved according to the bee-line algorithm whether apart from one of k nearest object of data to be retrieved as for this data object.Therefore, nonsensical in the judgement of the leaf node layer applications distances upper limit, so in the beta pruning process of leaf node, only adopt the lower limit distance.
Further, by above-mentioned beta pruning processing procedure, if the quantity of the data object in the described candidate index node set greater than default candidate value, then search engine needs further to calculate data to be retrieved each data object P in the described candidate index node set iApart from dist (Q, P i), and judge P iWhether in the range of search of data to be retrieved.If P iLess to data to be retrieved distance farthest than the data object in the candidate index node set to the distance of data to be retrieved, then described this data object in the candidate index node set is updated to P i
Process by above-mentioned beta pruning, namely can draw the result for retrieval of described data to be retrieved.
One of ordinary skill in the art will appreciate that: all or part of step that realizes above-mentioned each embodiment of the method can be finished by the relevant hardware of programmed instruction.Aforesaid program can be stored in the computer read/write memory medium.This program is carried out the step that comprises above-mentioned each embodiment of the method when carrying out; And aforesaid storage medium comprises: the various media that can be program code stored such as ROM, RAM, magnetic disc or CD.
It should be noted that at last: above each embodiment is not intended to limit only in order to technical scheme of the present invention to be described; Although with reference to aforementioned each embodiment the present invention is had been described in detail, those of ordinary skill in the art is to be understood that: it still can be made amendment to the technical scheme that aforementioned each embodiment puts down in writing, and perhaps some or all of technical characterictic wherein is equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution break away from the scope of various embodiments of the present invention technical scheme.

Claims (7)

1. a multi-medium data high dimensional indexing and kNN search method is characterized in that, comprising:
Make up the high dimensional indexing of a plurality of multi-medium datas, wherein, each multi-medium data comprises a plurality of data objects, described high dimensional indexing comprises the data object of a plurality of nodes and described a plurality of multi-medium datas, and each node comprises that the node center of described node is to the node center of child node corresponding to described node or the distance of data object;
According to the kNN searching algorithm, in described high dimensional indexing, determine range of search and the candidate index node set of data to be retrieved;
According to the described distance that each node in described range of search and the described candidate index node set comprises, described candidate index node set is carried out beta pruning process, draw the result for retrieval of described data to be retrieved.
2. multi-medium data high dimensional indexing according to claim 1 and kNN search method is characterized in that, the high dimensional indexing of described structure retrieve data comprises:
According to Data Partition Strategy multi-medium data is divided, generated high dimensional indexing;
Calculate the node center of each node in the described high dimensional indexing to the node center of child node corresponding to described node or the distance of data object, and described distance is stored in the described node.
3. multi-medium data high dimensional indexing according to claim 1 and 2 and kNN search method is characterized in that, described high dimensional indexing comprises: root node, intermediate node, leaf node and the data object of level arrangement from top to bottom; Wherein,
The child node that described root node is corresponding is intermediate node, and the father node of described leaf node is described intermediate node, and next level that described leaf node is corresponding is data object;
Described root node, intermediate node and leaf node include: node identification separately, node center coordinate separately, the quantity of each self-contained data object and described distance, the quantity of the data object that leaf node comprises is the quantity of data object corresponding to described leaf node, the quantity sum of the data object that the quantity of the data object that root node, intermediate node comprise comprises for all child nodes.
4. multi-medium data high dimensional indexing according to claim 3 and kNN search method is characterized in that, and be described according to the kNN searching algorithm, determines range of search and the candidate index node set of data to be retrieved in described high dimensional indexing, comprising:
According to the bee-line algorithm, calculate the distance of the node center of described data to be retrieved and intermediate node, described intermediate node is the arbitrary intermediate node in next level of described root node;
According to the distance of the node center of described data to be retrieved and described intermediate node, deterministic retrieval scope;
Judge whether the node in the retrieval server internal memory is overlapping with described range of search, if described node is stored in the candidate index node set; Store the node of calling in the retrieval process process last time in the described internal memory in the described retrieval server internal memory;
Whether the quantity summation of judging the data object that all nodes comprise in the described candidate index node set equals preset value, if not, then enlarge according to preset ratio or dwindle described range of search, determine in the described high dimensional indexing with enlarge or dwindle after the overlapping node of described range of search, and overlapping node is stored in the described candidate index node set successively, until the quantity summation of the data object that all nodes comprise in the described candidate index node set equals described preset value.
5. multi-medium data high dimensional indexing according to claim 3 and kNN search method, it is characterized in that, the described described distance that comprises according to described range of search and described candidate index node set interior nodes is carried out beta pruning to described candidate index node set and is processed, and comprising:
According to the bee-line algorithm, calculate the node center of each node in the described candidate index node set to the distance of described data to be retrieved;
, described candidate index node set is carried out first pruning process to the described distance that distance and each node of described data to be retrieved comprises according to the node center of each node;
Wherein, described candidate index node set is carried out first pruning processes, comprising:
Corresponding each child node or the data object of each node in the set added in the set;
According to following formula, calculate described data to be retrieved and arrive the lower limit of child node corresponding to each node or data object apart from d Low:
d low = d QC R - dist ( C R , C R i )
Wherein,
Figure FDA00001799506400022
Be the distance of data to be retrieved to the node center of described node, The node center of the described node that comprises for described node is to the node center of child node corresponding to described node or the distance of data object;
Judge that described lower limit is apart from d LowWhether greater than the node center of the radius of described range of search and described node to the node center of the node center distance child node farthest of described node or data object apart from sum, if, determine that then described child node or described data object are not both candidate nodes or candidate data object, delete described child node or described data object from described candidate index node set.
6. multi-medium data high dimensional indexing according to claim 5 and kNN search method is characterized in that, describedly described candidate index node set is carried out first pruning process, and also comprise:
If the node in the described candidate index node set is root node or intermediate node, then
According to following formula, calculate described data to be retrieved to the upper distance limit d of the node center of the child node of described node Up
d up = d Q C R + dist ( C R , C R i )
Judge described upper distance limit whether less than and the center that equals the radius of described range of search and described node to the center of the centre distance child node farthest of described node apart from sum, if not, then described child node is uncertain both candidate nodes, adopt the bee-line algorithm to calculate the node center of described uncertain both candidate nodes to the distance of described data to be retrieved, if this distance is greater than the radius of described range of search, then described uncertain both candidate nodes is not both candidate nodes, and described uncertain both candidate nodes is deleted from described candidate index node set.
7. according to claim 5 or 6 described multi-medium data high dimensional indexing and kNN search methods, it is characterized in that, the described distance that described center according to described node comprises to distance and the described node of described data to be retrieved, described candidate index node set is carried out also comprising after first pruning processes:
Candidate index node set after the described first pruning processing is carried out the beta pruning second time to be processed;
Describedly described candidate index node set carried out second time beta pruning process, comprising:
According to the bee-line algorithm, calculate the node center of child node of described candidate index node set interior nodes or data object to the distance of described data to be retrieved;
If the node center of described child node or data object arrive the distance of described data to be retrieved greater than the radius of described range of search, then described child node or data object are not both candidate nodes or candidate data object, and described child node or data object are deleted from described candidate index node set.
CN201210209494.5A 2012-06-21 2012-06-21 Multi-medium data high dimensional indexing and kNN search method Active CN102999542B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210209494.5A CN102999542B (en) 2012-06-21 2012-06-21 Multi-medium data high dimensional indexing and kNN search method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210209494.5A CN102999542B (en) 2012-06-21 2012-06-21 Multi-medium data high dimensional indexing and kNN search method

Publications (2)

Publication Number Publication Date
CN102999542A true CN102999542A (en) 2013-03-27
CN102999542B CN102999542B (en) 2015-12-16

Family

ID=47928115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210209494.5A Active CN102999542B (en) 2012-06-21 2012-06-21 Multi-medium data high dimensional indexing and kNN search method

Country Status (1)

Country Link
CN (1) CN102999542B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105051725A (en) * 2013-12-30 2015-11-11 华为技术有限公司 Graph data query method and device
CN107832456A (en) * 2017-11-24 2018-03-23 云南大学 A kind of parallel KNN file classification methods based on the division of critical Value Data
CN108460123A (en) * 2018-02-24 2018-08-28 湖南视觉伟业智能科技有限公司 High dimensional data search method, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6389424B1 (en) * 1998-10-28 2002-05-14 Electronics And Telecommunications Research Institute Insertion method in a high-dimensional index structure for content-based image retrieval
US20090157624A1 (en) * 2007-12-17 2009-06-18 Electronic And Telecommunications Research Institute System and method for indexing high-dimensional data in cluster system
CN101853304A (en) * 2010-06-08 2010-10-06 河海大学 Remote sensing image retrieval method based on feature selection and semi-supervised learning

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6389424B1 (en) * 1998-10-28 2002-05-14 Electronics And Telecommunications Research Institute Insertion method in a high-dimensional index structure for content-based image retrieval
US20090157624A1 (en) * 2007-12-17 2009-06-18 Electronic And Telecommunications Research Institute System and method for indexing high-dimensional data in cluster system
CN101853304A (en) * 2010-06-08 2010-10-06 河海大学 Remote sensing image retrieval method based on feature selection and semi-supervised learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梁俊杰等: "BC-iDistance基于位码的优化高维索引", 《小型微型计算机系统》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105051725A (en) * 2013-12-30 2015-11-11 华为技术有限公司 Graph data query method and device
US10068033B2 (en) 2013-12-30 2018-09-04 Huawei Technologies Co., Ltd. Graph data query method and apparatus
CN105051725B (en) * 2013-12-30 2018-11-20 华为技术有限公司 A kind of graph data query method and device
CN107832456A (en) * 2017-11-24 2018-03-23 云南大学 A kind of parallel KNN file classification methods based on the division of critical Value Data
CN107832456B (en) * 2017-11-24 2021-11-26 云南大学 Parallel KNN text classification method based on critical value data division
CN108460123A (en) * 2018-02-24 2018-08-28 湖南视觉伟业智能科技有限公司 High dimensional data search method, computer equipment and storage medium
CN108460123B (en) * 2018-02-24 2020-09-08 湖南视觉伟业智能科技有限公司 High-dimensional data retrieval method, computer device, and storage medium

Also Published As

Publication number Publication date
CN102999542B (en) 2015-12-16

Similar Documents

Publication Publication Date Title
EP1832990B1 (en) Computer readable medium storing a map data updating program
US9479508B2 (en) Efficient indexing and searching of access control listed documents
CN102693266B (en) Search for method, the navigation equipment and method of generation index structure of database
US7668817B2 (en) Method and system for data processing with spatial search
US9158803B2 (en) Incremental schema consistency validation on geographic features
US20150106352A1 (en) Aggregation of data from disparate sources into an efficiently accessible format
CN103678661A (en) Image searching method and terminal
JP2010503117A (en) Dynamic fragment mapping
US11307049B2 (en) Methods, apparatuses, systems, and storage media for storing and loading visual localization maps
CN102243660A (en) Data access method and device
Yang et al. Pase: Postgresql ultra-high-dimensional approximate nearest neighbor search extension
US6745198B1 (en) Parallel spatial join index
CN110941754A (en) Vector nearest neighbor search strategy based on reinforcement learning generation
CN102999542B (en) Multi-medium data high dimensional indexing and kNN search method
KR100806115B1 (en) Design method of query classification component in multi-level dbms
US8239391B2 (en) Hierarchical merging for optimized index
CN113468080B (en) Caching method, system and related device for full-flash metadata
CN111221813B (en) Database index and processing method, device and equipment for database query
CN111752986A (en) Data query method and device, equipment and storage medium
Vu et al. R*-grove: Balanced spatial partitioning for large-scale datasets
CN102831169B (en) Plane figure relation determining method and system in geographical information system
CN112559483A (en) HDFS-based data management method and device, electronic equipment and medium
CN110880005B (en) Vector index establishing method and device and vector retrieving method and device
US10372917B1 (en) Uniquely-represented B-trees
CN111258955A (en) File reading method and system, storage medium and computer equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant