CN102999542A

CN102999542A - Multimedia data high-dimensional indexing and k-nearest neighbor (kNN) searching method

Info

Publication number: CN102999542A
Application number: CN2012102094945A
Authority: CN
Inventors: 杜小勇; 张孝; 王珊; 李晖
Original assignee: 杜小勇
Priority date: 2012-06-21
Filing date: 2012-06-21
Publication date: 2013-03-27
Anticipated expiration: 2032-06-21
Also published as: CN102999542B

Abstract

The invention provides a multimedia data high-dimensional indexing and k-nearest neighbor (kNN) searching method, which comprises the following steps of establishing high-dimensional index of multiple multimedia data, wherein the high-dimensional index comprises multiple nodes and multiple targets of the multimedia data, and each node comprises a distance from a node center of the node to a node center of a subnode corresponding to the node or the data target; determining a search range of the data to be searched and a candidate index node set in the high-dimensional index according to a kNN searching algorithm; and pruning the candidate index node set to obtain a search result of the data to be searched according to the search range and the distance contained by each node inside the candidate index node set. The distance from the node center of each node to the node center of the subnode corresponding to the node or the data target is stored when the high-dimensional index of the multimedia data is established, the obtained candidate search node set is pruned according to the stored distance, so that the calculation expenditure of the pruning treatment in the search process can be effectively reduced.

Description

Multi-medium data high dimensional indexing and kNN search method

Technical field

The present invention relates to data directory and retrieval technique, relate in particular to a kind of multi-medium data high dimensional indexing and kNN search method.

Background technology

Along with popularizing gradually of digital entertainment equipment, the multi-medium datas such as image, video, audio frequency have occurred increasing substantially in data volume, and the technology of retrieving based on the content of multi-medium data, especially kNN(k-Nearest Neighbor, the k neighbour) retrieval technique, it is more and more important also to become.In field of image search, image to be retrieved can find a plurality of images the most similar to it with the kNN retrieval technique by the multimedia high dimensional indexing.In the video frequency searching field, system can be according to the higher-dimension visual signature of video segment to be retrieved, and the higher-dimension audio frequency characteristics of video segment to be retrieved, based on high dimensional indexing and kNN retrieval technique, find fast the most similar a plurality of video segments to video segment to be retrieved.Therefore, the High-dimensional Index Technology of multi-medium data and kNN search method thereof are the core technologies of multimedia data retrieval, and prospect has a very wide range of applications.

At present, the high dimensional indexing of existing multi-medium data and kNN retrieval technique exist certain limitation, are mainly manifested in:

1, existing High-dimensional Index Technology is only considered to reduce input and output (I/O) expense in the kNN retrieving by the mode of setting up the hierarchical high dimensional indexing usually, and the large problem of computing cost during unresolved higher-dimension kNN retrieval.

2, existing kNN search method based on technology of prunning branches, such as Branch-and-Bound and INN method, adopted complicated, the beta pruning tolerance rule that computing cost is very large is namely calculated each both candidate nodes by the bee-line algorithm or data object carries out data filtering and beta pruning to the distance of data to be tested.Whole beta pruning processing procedure expense is very large.

Summary of the invention

The invention provides a kind of multi-medium data high dimensional indexing and kNN search method, to solve the large problem of computing cost in the prior art.

The embodiment of the invention provides a kind of multi-medium data high dimensional indexing and kNN search method, comprising:

Make up the high dimensional indexing of a plurality of multi-medium datas, wherein, each multi-medium data comprises a plurality of data objects, described high dimensional indexing comprises the data object of a plurality of nodes and described a plurality of multi-medium datas, and each node comprises that the node center of described node is to the node center of child node corresponding to described node or the distance of data object;

According to the kNN searching algorithm, in described high dimensional indexing, determine range of search and the candidate index node set of data to be retrieved;

According to the described distance that each node in described range of search and the described candidate index node set comprises, described candidate index node set is carried out beta pruning process, draw the result for retrieval of described data to be retrieved.

By foregoing as can be known, the embodiment of the invention is by depositing the node center of each node in to the node center of child node corresponding to node or the distance of data object when making up the high dimensional indexing of multi-medium data, and according to the distance of this storage, the candidate index node set that obtains according to the kNN searching algorithm is carried out beta pruning to be processed, needing in the prior art to have avoided each both candidate nodes or data object are carried out processing based on the beta pruning of bee-line algorithm, effectively reduced the computing cost that beta pruning is processed in the retrieving.

Description of drawings

The multi-medium data high dimensional indexing that Fig. 1 provides for the embodiment of the invention and the schematic flow sheet of search method;

The synoptic diagram of one instantiation of the multi-medium data high dimensional indexing that Fig. 2 provides for the embodiment of the invention;

Fig. 3 adopts the lower limit distance to carry out the exemplary plot that beta pruning is processed for multi-medium data high dimensional indexing and the search method that the employing embodiment of the invention provides to intermediate node;

Fig. 4 adopts upper distance limit to carry out the exemplary plot that beta pruning is processed for multi-medium data high dimensional indexing and the search method that the employing embodiment of the invention provides to intermediate node;

Fig. 5 adopts the lower limit distance to carry out the exemplary plot that beta pruning is processed for multi-medium data high dimensional indexing and the search method that the employing embodiment of the invention provides to leaf node.

Embodiment

As shown in Figure 1, the multi-medium data high dimensional indexing that provides of the embodiment of the invention and the schematic flow sheet of kNN search method.The described method of present embodiment comprises:

The high dimensional indexing of step S1, a plurality of multi-medium datas of structure, wherein, each multi-medium data comprises a plurality of data objects, described high dimensional indexing comprises the data object of a plurality of nodes and described a plurality of multi-medium datas, and each node comprises that the node center of described node is to the node center of child node corresponding to described node or the distance of data object.

Wherein, described multi-medium data comprises the data such as image, Audio and Video.Each multi-medium data comprises a plurality of data objects.In the practical application, for quick-searching arrives multi-medium data, multi-medium data is represented as the form of high dimensional feature vector usually, and sets up high dimensional indexing according to the high dimensional feature vector, and described high dimensional indexing comprises the data object of a plurality of nodes and described a plurality of multi-medium datas.As shown in Figure 2, described high dimensional indexing is hierarchical, comprises one deck root node layer that level from top to bottom arranges, at least one deck intermediate level of nodes 1, one deck leaf node layer 2 and data objects layer 3.Upper layer node is the father node of lower level node, and lower level node is the child node of upper layer node.High dimensional indexing instance graph shown in Figure 2 only shows one deck intermediate level of nodes 1, one deck leaf node layer 2 and one deck data objects layer, not shown root node layer.In the present embodiment in the constructed high dimensional indexing each node all stored the node center of node to the node center of child node corresponding to node or the distance of data object.As shown in Figure 2, root node R comprises that the node center of node R is to the distance B of the node center of next level intermediate node R2 of node R _R2Intermediate node R1 comprises that the node center of intermediate node R1 is to the distance B of the node center of the child node R4 of R1 _R4Described intermediate node R1 comprises that also the node center of intermediate node R1 is to the distance B of the node center of the child node R5 of R1 _R5Similarly, described leaf node R3 comprises that the node center of leaf node R3 is to the distance B of data object I _IThe distance of each node storage will be used for carrying out beta pruning and process in the later retrieval process, only adopt the bee-line method to carry out the computing cost that beta pruning is processed in the prior art to reduce.

Step S2, according to the kNN searching algorithm, in described high dimensional indexing, determine range of search and the candidate index node set of data to be retrieved.

Particularly, described searching algorithm can be the kNN retrieval.Search engine at first adopts the kNN search method to determine the retrieval radius of described data to be retrieved, then determines take described retrieval radius as radius according to described retrieval radius, and described data to be retrieved are the range of search in the center of circle.Wherein, the node in the described candidate index node set is the node overlapping with described range of search.

Step S3, the described distance that comprises according to each node in described range of search and the described candidate index node set are carried out beta pruning to described candidate index node set and are processed, and draw the result for retrieval of described data to be retrieved.

Wherein, described beta pruning processing procedure is exactly to judge whether node in the described candidate index node set corresponding child node or data object be overlapping with described range of search, will not cut off with overlapping child node or the data object of described range of search.In fact, draw the result for retrieval of described data to be retrieved, search engine need to be carried out repeatedly beta pruning to described candidate index node set and be processed.The number of times that described beta pruning is processed has determined the retrieval precision of the result for retrieval of described data to be retrieved.

Described multi-medium data high dimensional indexing and kNN search method that present embodiment provides, the high dimensional indexing that comprises range information by making up each node reduces the computing cost that the beta pruning in the retrieving is processed, and has effectively improved the recall precision of multi-medium data.

Further, step 1 in above-described embodiment, the following steps that can adopt that the high dimensional indexing of described structure multi-medium data is concrete realize, comprising:

Step S101, according to Data Partition Strategy the retrieve data characteristic vector data is divided, generated high dimensional indexing.

Wherein, the technology of the high dimensional indexing of based on data partition strategy structure multi-medium data can be divided into two large classes.The first kind is minimum binding rectangle (Minimum Bounding Rectangle, MBR) partition strategy, and the MBR partition strategy is the characteristics of the multimedia vector data to be carried out rectangle divide the high dimensional indexing that forms hierarchical.Equations of The Second Kind is minimum binding spheroid (Minimum Bounding Sphere, MBS) partition strategy, and the MBS partition strategy is that the characteristics of the multimedia vector data is carried out the spherical high dimensional indexing that forms hierarchical of dividing.Because first kind MBR partition strategy has less volume in higher dimensional space, thereby so that usually have less overlapping possibility between object, therefore, when making up basic multimedia high dimensional indexing, the preferred high dimensional indexing that makes up multi-medium data based on the MBR partition strategy that adopts.

Step S102, calculate in the described high dimensional indexing node node center to the node center of child node corresponding to described node or the distance of data object, and described distance is stored in the described node.

Wherein, described high dimensional indexing comprises one deck root node layer, at least one deck intermediate level of nodes, leaf node layer and data objects layer.Each node in the described high dimensional indexing in root node layer, intermediate level of nodes and the leaf node layer includes range information.Particularly, high dimensional indexing makes up engine after making up described high dimensional indexing, calculate respectively the node center of root node in the high dimensional indexing to the distance of the node center of next level intermediate node corresponding to described root node, the node center of intermediate node is to the distance of the node center of the node center of next level intermediate node corresponding to described intermediate node or leaf node, and the node center of leaf node is to the distance of data object corresponding to described leaf node, and the distance that will calculate stores in corresponding root node, intermediate node or the leaf node.

Described high dimensional indexing comprises: root node, intermediate node, leaf node and the data object of level arrangement from top to bottom.Wherein, the child node that described root node is corresponding is intermediate node, and the father node of described leaf node is described intermediate node, and next level that described leaf node is corresponding is data object.Described root node, intermediate node and leaf node include: node identification separately, node center coordinate separately, the quantity of each self-contained data object and described distance, the quantity of the data object that leaf node comprises is the quantity of data object corresponding to described leaf node, the quantity sum of the data object that the quantity of the data object that root node, intermediate node comprise comprises for all child nodes.Particularly, as shown in Figure 2, store following content in the storage area of root node R and intermediate node R1 in the high dimensional indexing, be embodied in following form:

(ID,C,LB,RU,#objects,#subregions,dists,D _fd,E _i)

E _i: (LB _i, RU _i, #objects _i, Pointer _i), 1≤i≤n wherein, N _Min≤ n≤N _Max, and n=#subregions.

Wherein, ID is the identification information of root node R or intermediate node R1.C is the centre coordinate of root node R or intermediate node R1.LB is generally the lower left corner coordinate of dividing rectangle for the root node R that is divided into based on the MBR partition strategy or the first border of intermediate node R1 rectangle.RU is generally the upper right corner coordinate of dividing rectangle for the root node R that is divided into based on the MBR partition strategy or the second boundary of intermediate node R1 rectangle.#objects is the quantity of the data object that comprises of root node R or intermediate node R1.Dists is that the center C of root node R or intermediate node R1 is to the distance of the node center of the node center of intermediate node or leaf node.If next level intermediate node or the leaf node of root node R or intermediate node R1 are two or more, the node center that includes root node R or intermediate node R1 among the dists divides the distance of the node center of the node center that is clipped to each intermediate node or leaf node.D _FdBe the node center data object farthest of distance root node R or the intermediate node R1 distance to the node center of root node R or intermediate node R1.E _iChild node E for root node R or intermediate node R1 ₁...., E _n(N _Min≤ n≤N _Max) wherein, N _MaxThat root node R or intermediate node R1 allow the intermediate node that comprises or the upper limit of leaf node number.

M _NIt is the fan-out (fanout) of root node R or intermediate node R1.Each tuple Ei correspondence a node, and this node may be intermediate node or leaf node.

E _iFormed by four parts: the first border LB of the rectangle of corresponding next the level intermediate node of the root node R that is divided into based on the MBR partition strategy or intermediate node R1 or leaf node _i, i.e. the lower left corner coordinate of rectangle; The first border RU of the rectangle of corresponding next the level intermediate node of the root node R that is divided into based on the MBR partition strategy or intermediate node R1 or leaf node _i, i.e. the upper right corner coordinate of rectangle; The quantity #objects of the data object that node comprises _iAnd the pointer Pointer that points to child node _i

As shown in Figure 2, store following content in the high dimensional indexing in the storage area of leaf node R3, be embodied in following form:

(ID,C,LB,RU,#objects,dists,D _fd,E _i)

E _i: (Pointer _i), 1≤i≤n wherein, L _Min≤ n≤L _Max, and n=#objects

Wherein, ID is the identification information of leaf node R3.C is the centre coordinate of leaf node R3.LB is the first border based on the rectangle of the leaf node R3 of MB partition strategy division, the i.e. lower left corner of rectangle.RU is the second boundary based on the rectangle of the leaf node R3 of MB partition strategy division, the i.e. upper right corner of rectangle.#objects is the quantity of the data object that comprises of leaf node R3.Dists is that the center C of leaf node R3 is to the distance of data object corresponding to leaf node R3.If the data object that described leaf node R3 is corresponding is two or more, dists should comprise that the center C of leaf node R3 is to the distance of all data objects corresponding to leaf node R3.E _iBe data object E corresponding to leaf node R3 ₁...., E _n(L _Min≤ n≤L _Max), L wherein _MaxIt is the upper limit that leaf node R3 allows the data object that comprises;

M _LIt is the fan-out (fanout) of leaf node R3.Each E _iComprised the pointer that points to respective data object.

Further, step 2 in above-described embodiment, described according to the kNN searching algorithm, in described high dimensional indexing, determine range of search and the candidate index node set of data to be retrieved, specifically can adopt following steps to realize, comprising:

Step S201, according to the bee-line algorithm, calculate the distance of the node center of described data to be retrieved and intermediate node, described intermediate node is the arbitrary intermediate node in next level of described root node.

In the actual retrieval process, the child node of search engine optional root node in described high dimensional indexing, namely intermediate node calculates the center of this intermediate node of selecting to the distance of described data to be retrieved.Wherein, selecting the purpose of intermediate node in this step is to occur for fear of undetected problem.

Step S202, according to the distance of the node center of described data to be retrieved and described intermediate node, deterministic retrieval scope.

Particularly, search engine is according to the distance that calculates in the above-mentioned steps, determines distance take the node center of described data to be retrieved and described intermediate node as radius, the range of search centered by the described data to be retrieved.

Step S203, judge whether the node in the retrieval server internal memory is overlapping with described retrieval initial range, if described node is stored in the candidate index node set.

Generally speaking, store the node of calling in the retrieving last time in the described internal memory in the described retrieval server internal memory.At least have a node in the described retrieval server internal memory, in the worst situation, only have a root node in the internal memory.Search engine selects at first to judge the node in the internal memory, can avoid again transferring the expense of the node in the high dimensional indexing, reduces the input expense in the retrieving.Particularly, search engine is calculated the node center of the node in the internal memory to the distance of described data to be retrieved, if apart from the retrieval radius less than described range of search, then this node and described range of search are overlapping.To store in the candidate index node set with the overlapping node of described range of search.

Step S204, judge whether the quantity summation of the data object that all nodes comprise in the described candidate index node set equals preset value, if not, then enlarge according to preset ratio or dwindle described range of search, determine in the described high dimensional indexing with enlarge or dwindle after the overlapping node of described range of search, and overlapping node is stored in the described candidate index node set successively, until the quantity summation of the data object that all nodes comprise in the described candidate index node set equals described preset value.

Particularly, the quantity of the data object that comprises according to each node in the described candidate index node set, search engine calculate the quantity summation of all data objects in the described candidate index node set.Then, search engine judges whether the quantity summation of described data object equals the parameter K in the kNN retrieval.Wherein, described parameter K is a preset value, and this value can artificially be set according to the retrieval experience.

If the quantity summation of all data objects equals described parameter K in the described candidate index node set, then described range of search and described candidate index node set are range of search and the candidate index node set that search engine is finally determined.

If the quantity summation of all data objects is not equal to described parameter K in the described candidate index node set, then be divided into two kinds of situations: situation one, the quantity summation of described data object is less than parameter K, and search engine enlarges described range of search according to default ratio, determines new range of search.Then search engine is again according to the bee-line algorithm, it is the MinDist calculation procedure, determine the center of which node in the described high dimensional indexing to the radius of the distance between the described data to be retrieved less than the range of search after enlarging, and the node of determining stored in the described candidate index node set, then continue to carry out this step.The quantity summation of situation two, described data object is greater than parameter K, and search engine is dwindled described range of search according to default ratio, determines new range of search.Then search engine is again according to the bee-line algorithm, it is the MinDist calculation procedure, determine the center of which node in the described high dimensional indexing to the radius of the distance between the described data to be retrieved less than the described range of search after dwindling, and the node of determining stored in the described candidate index node set, then continue this step.By above-mentioned steps, the quantity summation that search engine finally can be determined all data objects that comprise equals the candidate index node set of parameter K, and through dwindle or enlarge after the range of search finally determined.

Further, step 3 described in above-described embodiment, the described described distance that comprises according to described range of search and described candidate index node set interior nodes is carried out beta pruning to described candidate index node set and is processed, draw the result for retrieval of described data to be retrieved, comprising:

Step S301, according to the bee-line algorithm, calculate the node center of each node in the described candidate index node set to the distance of described data to be retrieved.

Step S302, according to the node center of each node to the described distance that distance and each node of described data to be retrieved comprises, described candidate index node set is carried out first pruning processes.

Wherein, describedly described candidate index node set carried out first pruning process, comprising:

Step S3021, the set in corresponding each child node or the data object of each node add in the set.

Step S3022, according to following formula, calculate described data to be retrieved to the lower limit of the node center of child node corresponding to each node or data object apart from d _Low

d_{low} = d_{{QC}_{R}} - dist (C_{R}, C_{R_{i}})

Wherein,

Be the distance of data to be retrieved to the center of described node,

The node center of the described node that comprises for described node is to the node center of child node corresponding to described node or the distance of data object.

Step S3023, the described lower limit of judgement are apart from d _LowWhether greater than the node center of the radius of described range of search and described node to the node center of the node center distance child node farthest of described node or data object apart from sum, if, determine that then described child node or described data object are not both candidate nodes or candidate data object, delete described child node or described data object from described candidate index node set.

Further, if the node in the described candidate index node set is root node or intermediate node, then above-mentionedly described candidate index node set carried out first pruning processes, also comprise:

Step S3023, according to following formula, calculate described data to be retrieved to the upper distance limit d of the node center of the child node of described node _Up

d_{up} = d_{Q C_{R}} + dist (C_{R}, C_{R_{i}})

Step S3024, judge described upper distance limit whether less than and the center that equals the radius of described range of search and described node to the center of the centre distance child node farthest of described node apart from sum, if not, then described child node is uncertain both candidate nodes, adopt the bee-line algorithm to calculate the node center of described uncertain both candidate nodes to the distance of described data to be retrieved, if this distance is greater than the radius of described range of search, then described uncertain both candidate nodes is not both candidate nodes, and described uncertain both candidate nodes is deleted from described candidate index node set.

Carry out data filtering and beta pruning by node center or the data object that the large bee-line algorithm of computing cost calculates the child node of all nodes in the described candidate index node set to the distance of data to be retrieved than prior art, the equation that passes through low expense that the embodiment of the invention provides is judged the beta pruning processing of carrying out, computing cost obviously reduces, and has effectively improved recall precision.

After processing based on above-mentioned first pruning, may also there be the node of processing without above-mentioned first pruning in the described candidate index node set, therefore, above-mentioned described candidate index node set be carried out also comprising after first pruning processes:

Step S303, described candidate index node set is carried out second time beta pruning process.

Wherein, describedly described candidate index node set carried out second time beta pruning process, comprising:

Step S3031, according to the bee-line algorithm, calculate the node center of child node of described candidate index node set interior nodes or data object to the distance of described data to be retrieved.

If the node center of the described child node of step S3032 or data object arrive the distance of described data to be retrieved greater than the radius of described range of search, then described child node or data object are not both candidate nodes or candidate data object, and described child node or data object are deleted from described candidate index node set.

In the retrieving of reality, in order further to improve the accuracy of the result for retrieval of described data to be retrieved, so that include N node or the data object nearest with described retrieve data in the described result for retrieval.Wherein, the setting that N can be artificial.Described search engine need to be carried out repeatedly above-mentioned beta pruning processing procedure to described candidate index node set.For example, the high dimensional indexing that makes up described multi-medium data has 4 layers, comprises 1 layer of root node layer, 1 layer of intermediate level of nodes, 1 layer of leaf node layer and 1 layer data object layer.Described search engine is determined the candidate index node set of data to be retrieved according to the kNN searching algorithm in described high dimensional indexing.If include a plurality of intermediate nodes in the described candidate index node set.The described distance that described search engine comprises according to intermediate node in described range of search and the described candidate index node set is carried out first round beta pruning to described candidate index node set and is processed.Described first round beta pruning is processed the first pruning that comprises in above-described embodiment and is processed and for the second time beta pruning processing.Intermediate node in the candidate index node set after process first round beta pruning is processed all is updated to the leaf node in the corresponding leaf layer of each intermediate node.Described search engine is again carried out second to described candidate index node set and is taken turns the beta pruning processing.Leaf node in the candidate index node set after taking turns beta pruning and process through second all is updated to the data object in the corresponding data objects layer of each leaf node.The described candidate index node set that is made of data object after processing through above-mentioned two-wheeled beta pruning is the result for retrieval of described data to be retrieved.

Below in conjunction with concrete example, be described further described in the present embodiment described candidate index node set being carried out the process that beta pruning processes.

If the node in the described candidate index node set is intermediate node

At first, calculate each child node of described node to upper distance limit and the lower limit distance of data to be retrieved.

Particularly, as shown in Figure 3, Q is data to be retrieved, Q _rRetrieval radius for the range of search of Q.R is described node.The child node of R comprises R ₁, R ₂, R ₃And R ₄ Comprise: the node R center C _RTo R ₁Center C _R1Distance, the node R center C _RTo R ₂Center C _R2Distance, the node R center C _RTo R ₃Center C _R3Distance and node R center C _RTo R ₄Center C _R4Distance.According to described bee-line algorithm, calculate Q to the node R center C _RDistance

Then according to Triangle inequality, calculate Q to R _iUpper distance limit d _Up(Q, R _i) and lower limit apart from d _Low(Q, R _i), that is:

d_{low} = d_{{QC}_{R}} - dist (C_{R}, C_{R_{i}})

d_{up} = d_{Q C_{R}} + dist (C_{R}, C_{R_{i}})

Then, judge respectively each node R ₁, R ₂, R ₃And R ₄D _Low(Q, R _i) whether satisfy:

d _low(Q,R _i)>Q _r+D _fd?Ri

Wherein, D _FdRiBe R _iCenter C _RiArrive and described R _iCenter C _RiDistance data object P farthest _FdDistance.If satisfy, illustrate that then Q is to R _iThe distance of minimum also greater than inquiry radius Q _rAnd R _iCenter C _RiArrive and described R _iCenter C _RiDistance data object P farthest _FdApart from sum, i.e. R _iDo not have overlapping with the range of search of Q.With described R _iFrom described candidate index node set, delete.By above-mentioned based on d _Low(Q, R _i) judgement carried out, R ₁, R ₂And R ₄All do not have overlappingly with the range of search of Q, therefore, search engine is just determined R with minimum calculation cost ₁, R ₂And R ₄It or not both candidate nodes.Search engine has just avoided in the prior art each node being carried out the expense of minimum distance calculation like this.

Subsequently, judge the d of the child node of remaining node in the described candidate index node set _UpWhether satisfy:

d _up(Q,R _i)≤Q _r+D _fd?Ri

If satisfy, illustrate that then Q is to R _iThe distance of maximum also be less than or equal to inquiry radius Q _rAnd R _iCenter C _RiArrive and described R _iCenter C _RiDistance data object P farthest _FdApart from sum, i.e. R _iOverlapping with the range of search of Q.Therefore, search engine is determined R by this determining step with minimum calculation cost _iBe both candidate nodes, and will be defined as the R of both candidate nodes _iJoin in the described candidate index node set.As shown in Figure 4, by above-mentioned based on d _Up(Q, R _i) judgement carried out, determine R ₁And R ₃All overlapping with the range of search of Q, therefore, R ₁And R ₃All be added in the described candidate index node set.

At last, if also store intermediate node in the described candidate index node set, and whether described intermediate node fails to determine by above-mentioned judgement next level child node of intermediate node overlapping with the range of search of described data to be retrieved, needs that then the intermediate node in the described candidate index node set is carried out the beta pruning second time and process.

Particularly, search engine adopts existing bee-line algorithm, calculates respectively each child node of intermediate node to the MinDist distance of data to be retrieved.If the MinDist(Q that calculates, R _i) less than the retrieval radius Q of described data to be retrieved _R, then show this child node R _iOverlapping with the range of search of data to be retrieved.Search engine will be overlapping with the range of search of data to be retrieved child node R _iJoin in the described candidate index node set.

After above-mentioned beta pruning processing procedure, the described candidate index node set that before is made of intermediate node has been processed into the described candidate index node set that is made of the intermediate node child node.Described child node can be the intermediate node of described next level of intermediate node, or leaf node.

Further, if the node in the described candidate index node set is root node, also can adopts above-mentioned steps to carry out beta pruning and process.Through namely drawing the described candidate index node set that is consisted of by intermediate node after the above-mentioned beta pruning processing.

If the node in the two described candidate index node set is leaf node.

With the beta pruning processing procedure of above-mentioned intermediate node, calculate data to be retrieved to the distance of leaf node.Then according to Triangle inequality, draw respectively data to be retrieved to the lower limit distance of all data objects corresponding to leaf node.Here it should be noted that: different from the beta pruning processing procedure of above-mentioned intermediate node is, when leaf node is carried out first pruning and processes, only need the lower limit distance of each data object corresponding to leaf node whether to satisfy: the lower limit distance〉the retrieval radius of data to be retrieved.If satisfy, then described data object is outside the range of search of described data to be retrieved.As shown in Figure 5, data object P2, P3, P4 and P6 are all outside the range of search of described data to be retrieved.Search engine is just determined P2, P3, P4 and P6 not as the candidate data object by the judgement of lower limit distance take minimum calculation cost like this.Search engine has just avoided in the prior art each data object being carried out the expense of minimum distance calculation like this.

In this step, to the beta pruning of leaf node only the applications distances lower bound filter unnecessary input and output expense, reason is: even search engine has been used upper distance limit, also can only obtain the conclusion of certain data object in range of search., still need further to calculate each data object to the distance of data to be retrieved according to the bee-line algorithm whether apart from one of k nearest object of data to be retrieved as for this data object.Therefore, nonsensical in the judgement of the leaf node layer applications distances upper limit, so in the beta pruning process of leaf node, only adopt the lower limit distance.

Further, by above-mentioned beta pruning processing procedure, if the quantity of the data object in the described candidate index node set greater than default candidate value, then search engine needs further to calculate data to be retrieved each data object P in the described candidate index node set _iApart from dist (Q, P _i), and judge P _iWhether in the range of search of data to be retrieved.If P _iLess to data to be retrieved distance farthest than the data object in the candidate index node set to the distance of data to be retrieved, then described this data object in the candidate index node set is updated to P _i

Process by above-mentioned beta pruning, namely can draw the result for retrieval of described data to be retrieved.

One of ordinary skill in the art will appreciate that: all or part of step that realizes above-mentioned each embodiment of the method can be finished by the relevant hardware of programmed instruction.Aforesaid program can be stored in the computer read/write memory medium.This program is carried out the step that comprises above-mentioned each embodiment of the method when carrying out; And aforesaid storage medium comprises: the various media that can be program code stored such as ROM, RAM, magnetic disc or CD.

It should be noted that at last: above each embodiment is not intended to limit only in order to technical scheme of the present invention to be described; Although with reference to aforementioned each embodiment the present invention is had been described in detail, those of ordinary skill in the art is to be understood that: it still can be made amendment to the technical scheme that aforementioned each embodiment puts down in writing, and perhaps some or all of technical characterictic wherein is equal to replacement; And these modifications or replacement do not make the essence of appropriate technical solution break away from the scope of various embodiments of the present invention technical scheme.

Claims

1. a multi-medium data high dimensional indexing and kNN search method is characterized in that, comprising:

2. multi-medium data high dimensional indexing according to claim 1 and kNN search method is characterized in that, the high dimensional indexing of described structure retrieve data comprises:

According to Data Partition Strategy multi-medium data is divided, generated high dimensional indexing;

Calculate the node center of each node in the described high dimensional indexing to the node center of child node corresponding to described node or the distance of data object, and described distance is stored in the described node.

3. multi-medium data high dimensional indexing according to claim 1 and 2 and kNN search method is characterized in that, described high dimensional indexing comprises: root node, intermediate node, leaf node and the data object of level arrangement from top to bottom; Wherein,

The child node that described root node is corresponding is intermediate node, and the father node of described leaf node is described intermediate node, and next level that described leaf node is corresponding is data object;

Described root node, intermediate node and leaf node include: node identification separately, node center coordinate separately, the quantity of each self-contained data object and described distance, the quantity of the data object that leaf node comprises is the quantity of data object corresponding to described leaf node, the quantity sum of the data object that the quantity of the data object that root node, intermediate node comprise comprises for all child nodes.

4. multi-medium data high dimensional indexing according to claim 3 and kNN search method is characterized in that, and be described according to the kNN searching algorithm, determines range of search and the candidate index node set of data to be retrieved in described high dimensional indexing, comprising:

According to the bee-line algorithm, calculate the distance of the node center of described data to be retrieved and intermediate node, described intermediate node is the arbitrary intermediate node in next level of described root node;

According to the distance of the node center of described data to be retrieved and described intermediate node, deterministic retrieval scope;

Judge whether the node in the retrieval server internal memory is overlapping with described range of search, if described node is stored in the candidate index node set; Store the node of calling in the retrieval process process last time in the described internal memory in the described retrieval server internal memory;

Whether the quantity summation of judging the data object that all nodes comprise in the described candidate index node set equals preset value, if not, then enlarge according to preset ratio or dwindle described range of search, determine in the described high dimensional indexing with enlarge or dwindle after the overlapping node of described range of search, and overlapping node is stored in the described candidate index node set successively, until the quantity summation of the data object that all nodes comprise in the described candidate index node set equals described preset value.

5. multi-medium data high dimensional indexing according to claim 3 and kNN search method, it is characterized in that, the described described distance that comprises according to described range of search and described candidate index node set interior nodes is carried out beta pruning to described candidate index node set and is processed, and comprising:

According to the bee-line algorithm, calculate the node center of each node in the described candidate index node set to the distance of described data to be retrieved;

, described candidate index node set is carried out first pruning process to the described distance that distance and each node of described data to be retrieved comprises according to the node center of each node;

Wherein, described candidate index node set is carried out first pruning processes, comprising:

Corresponding each child node or the data object of each node in the set added in the set;

According to following formula, calculate described data to be retrieved and arrive the lower limit of child node corresponding to each node or data object apart from d _Low:

d_{low} = d_{{QC}_{R}} - dist (C_{R}, C_{R_{i}})

Wherein,

Be the distance of data to be retrieved to the node center of described node, The node center of the described node that comprises for described node is to the node center of child node corresponding to described node or the distance of data object;

Judge that described lower limit is apart from d _LowWhether greater than the node center of the radius of described range of search and described node to the node center of the node center distance child node farthest of described node or data object apart from sum, if, determine that then described child node or described data object are not both candidate nodes or candidate data object, delete described child node or described data object from described candidate index node set.

6. multi-medium data high dimensional indexing according to claim 5 and kNN search method is characterized in that, describedly described candidate index node set is carried out first pruning process, and also comprise:

If the node in the described candidate index node set is root node or intermediate node, then

According to following formula, calculate described data to be retrieved to the upper distance limit d of the node center of the child node of described node _Up

d_{up} = d_{Q C_{R}} + dist (C_{R}, C_{R_{i}})

Judge described upper distance limit whether less than and the center that equals the radius of described range of search and described node to the center of the centre distance child node farthest of described node apart from sum, if not, then described child node is uncertain both candidate nodes, adopt the bee-line algorithm to calculate the node center of described uncertain both candidate nodes to the distance of described data to be retrieved, if this distance is greater than the radius of described range of search, then described uncertain both candidate nodes is not both candidate nodes, and described uncertain both candidate nodes is deleted from described candidate index node set.

7. according to claim 5 or 6 described multi-medium data high dimensional indexing and kNN search methods, it is characterized in that, the described distance that described center according to described node comprises to distance and the described node of described data to be retrieved, described candidate index node set is carried out also comprising after first pruning processes:

Candidate index node set after the described first pruning processing is carried out the beta pruning second time to be processed;

Describedly described candidate index node set carried out second time beta pruning process, comprising:

According to the bee-line algorithm, calculate the node center of child node of described candidate index node set interior nodes or data object to the distance of described data to be retrieved;

If the node center of described child node or data object arrive the distance of described data to be retrieved greater than the radius of described range of search, then described child node or data object are not both candidate nodes or candidate data object, and described child node or data object are deleted from described candidate index node set.