CN102831225A

CN102831225A - Multi-dimensional index structure under cloud environment, construction method thereof and similarity query method

Info

Publication number: CN102831225A
Application number: CN2012103076075A
Authority: CN
Inventors: 程春玲; 孙春菊; 张登银; 徐小龙
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University; Nanjing University of Posts and Telecommunications
Priority date: 2012-08-27
Filing date: 2012-08-27
Publication date: 2012-12-19

Abstract

The invention discloses a multi-dimensional index structure under a cloud environment, a construction method thereof and a similarity query method. The index structure disclosed by the invention comprises a global index and local indexes which are respectively positioned at all storage nodes, the cloud environment uses an overlay network to organize the storage nodes, and the local indexes are of clustering results obtained by clustering approximate vectors of all vector data in the storage nodes where the local indexes are located; and the global index is of information of clustering centers of all the local indexes, which are distributed to the whole overlay network and addresses of the storage nodes where the clustering centers are located. The index structure disclosed by the invention has the advantages of reducing index storage space, reducing resource consumption, effectively supporting multi-dimensional data index and similarity query under the cloud environment, using the clustering information obtained by clustering all the approximate vectors as the local indexes and improving query efficiency by only performing query on corresponding categories through the information of the clustering centers without scanning all the approximate vectors during the query of the local indexes.

Description

Multi-dimensional indexing structure under the cloud environment, its construction method and similarity querying method

Technical field

The present invention relates to the multi-dimensional indexing method under a kind of cloud computing environment, relate in particular to multi-dimensional indexing method and construction method thereof under a kind of cloud environment of supporting similarity inquiry, belong to the computer information retrieval technical field.

Background technology

Along with the universal day by day and technological fast development of IT of internet, internet data sharply expands, and how storing and manage mass data has become a challenge that needs to be resolved hurrily.The notion of cloud computing is arisen at the historic moment, and cloud computing has brought new method of service for user and enterprise, has occurred some cloud computing system of achieving success and using at present, as: the elasticity cloud (EC2) of Amazon, the blue cloud of IBM and the cloud computing platform of Google etc.These cloud computing system comprise a large amount of computer nodes, are storing the data of magnanimity, are supporting application such as large-scale data processing and data retrieval.

In cloud computing system, data storage depends on bottom distributed file system (DFS) mostly and comes management data, carries out mass data processing through the mode based on key-value.For example, in the YouTube of Google, video is stored with the key-value mode; Key is unique video id; Value comprises video name, uploads the time, browses number etc., and this storage mode is fit to inquire about through key word key very much, but can not carry out complicated query effectively; Like range query or similarity inquiry, be difficult to the retrieval as required of supporting that cloud computing is personalized.To this problem, some documents propose to set up tree index to realize the range query of different dimensional logarithmic data.

Indexing means under the existing cloud environment can be divided into one dimension index and multi-dimensional indexing according to the data dimension that is directed against.Data to the one-dimensional degree can be set up the one dimension index, like B+ tree index, hash index etc.They all use the two-stage index pattern; On the physical node of actual storage data, set up partial indexes, set up global index at server end then, when the user inquiring data; Global index through server end navigates to local index; Carry out the inquiry of local data then, can significantly reduce the time of inquiry like this, but they can't handle the multidimensional data index efficiently.Index structure to multidimensional data also all is to use two-stage index, mainly be to set up tree index for local node, and the utilization structure nerve of a covering is organized computing node.

The deficiency of above-mentioned indexing means is to support similarity inquiry efficiently, and along with the increase of data dimension, the query performance of tree index structure can descend rapidly, so-called " dimension disaster " problem promptly can occur.

Under traditional distributed is calculated; Propositions such as Weber R quantize compression to improve search efficiency to high dimensional data; And realization VA-File (Vector Approximation File; Vector is similar to file), VA-File has two remarkable advantages: 1. the index file size is much smaller than source document, and the I/O cost of disk significantly reduces when carrying out sequential scanning; 2. reduced computation complexity.The approximate vector after must at first quantizing all when but its deficiency is to inquire about scans, and the cost that when data volume is big, is spent is also relatively large.To " dimension disaster " problem; Dong Daoguo etc. have proposed a kind of new index structure VAR-Tree; It organically combines VA-File and R-Tree; Manage and organize the approximate data among the VA-File with R-Tree, and realize inquiry, improved retrieval performance based on VAR-Tree with the similar search algorithm of the R-Tree class that has proposed.(application number is 03129687.4 to one piece of Chinese invention patent document; Granted publication number is CN1477563A) in the quick similar to search method of a kind of higher-dimension vector data Ordered VA-File is disclosed; It is to the reorganization of sorting of the approximate vector among the VA-File; The data that will in higher dimensional space, flock together are stored in the adjacent position of file as far as possible; And Ordered VA-File self-adaptation being divided into the class of some according to practical application, the data of each type are continuously storage hereof.During inquiry, only several types of nearest data of chosen distance query vector are carried out query processing, thereby improve the efficient of inquiry, but calculated amount is bigger on the one hand in this invention, the class quantity of selecting when depending on inquiry on the other hand, and type quantity does not provide concrete calculating.

Summary of the invention

Technical matters to be solved by this invention is to overcome the deficiency that existing cloud environment index structure is not supported similarity inquiry efficiently, and multi-dimensional indexing method and construction method thereof under a kind of cloud environment are provided, and can under cloud environment, realize similarity inquiry efficiently.

The present invention is concrete to adopt following technical scheme to solve the problems of the technologies described above:

Multi-dimensional indexing structure under a kind of cloud environment; Comprise global index and lay respectively at the partial indexes of each memory node; Said cloud environment uses nerve of a covering to organize memory node, and said partial indexes is carried out the resulting cluster result of cluster for the approximate vector to all vector datas in its place memory node; Said global index is the address of cluster centre information and each cluster centre place memory node that is published to all partial indexes of whole overlay network.

The construction method of the multi-dimensional indexing structure under the cloud environment as stated may further comprise the steps:

Step 1, use VA-File method quantize compression to the initialization vector data of storing in each memory node, obtain the approximate set of vectors of each memory node respectively;

Step 2, respectively the approximate set of vectors of each memory node is carried out cluster, the cluster result of each memory node is the partial indexes of this memory node;

Step 3, extract the cluster centre information in all partial indexes, and the cluster centre information of all partial indexes is published to whole overlay network with the address of each cluster centre place memory node, form global index.

Similarity querying method under a kind of cloud environment, said cloud environment adopt multi-dimensional indexing structure as stated, may further comprise the steps:

Step 1, use VA-File method are treated the query vector data and are quantized compression, obtain the approximate vector of vector data to be checked;

Step 2, from global index, orient and the minimum cluster centre of approximate vector distance of vector data to be checked;

Step 3, confirm the class at the cluster centre place that step 2 orients, it is carried out the KNN inquiry according to partial indexes;

Step 4, judge whether the result of KNN inquiry meets the demands, in this way, then it is exported as final Query Result; As not, then from global index, orient and the inferior little cluster centre of the approximate vector distance of vector data to be checked, and go to step 3.

Compare prior art, the present invention has following beneficial effect:

One, the present invention has introduced in the index structure under the cloud environment and has quantized compression, has significantly reduced the index stores space on the one hand, has reduced resource consumption; On the other hand, can effectively support multidimensional data index and similarity under the cloud environment to inquire about;

Two, the present invention uses all approximate vectors is carried out the resulting clustering information of cluster as partial indexes; To local search index the time; Only need corresponding type to be inquired about, need not scan all approximate vectors, improved search efficiency through cluster centre information.

Description of drawings

Fig. 1 is the structural representation of the multi-dimensional indexing structure under the cloud environment of the present invention;

Fig. 2 is the instance graph that quantizes compression;

Fig. 3 is the instance graph of pairing approximation vector cluster, and Fig. 3 (a) is the cluster principle synoptic diagram, two classes of Fig. 3 (b) for obtaining after the cluster;

Fig. 4 is the similarity querying method process flow diagram under the cloud environment of the present invention.

Embodiment

Below in conjunction with accompanying drawing technical scheme of the present invention is elaborated:

Thinking of the present invention is that the VA-File method under the traditional distributed computing environment is introduced cloud environment; Raw data in each memory node is quantized compression; Then resulting approximate set of vectors is carried out cluster respectively; As partial indexes, the cluster centre information of all partial indexes is published to whole overlay network with the address of each cluster centre place memory node through the nerve of a covering interface with cluster result.Like this, when carrying out the similarity inquiry, only need inquire about, and need all approximate vectors not scanned, dwindle query context greatly, improve search efficiency the class that belongs to apart from the nearest cluster centre of the approximate vector of vector data to be checked.

For the ease of public understanding, be example to adopt the CAN nerve of a covering to organize the memory node under the cloud environment below, technical scheme of the present invention is elaborated.

As shown in Figure 1, the data in the cloud environment of employing CAN institutional framework are that distributed earth is stored on the different server of data center.Each server is born two kinds of roles: memory node and nerve of a covering node.Memory node is a node in the cloud data center distributed memory system, is used to store data and index information.Each memory node logically organizes together according to CAN, managed together global index.Therefore memory node logically also is a nerve of a covering node, the subregion of corresponding CAN.

At first make up the multi-dimensional indexing structure under the cloud environment of the present invention, comprise foundation and the foundation of global index of the partial indexes of each memory node.

The step that partial indexes is set up comprises:

Step 1) quantizes compression to the initialization vector data in the memory node; Quantize compression method and VA-File (but detailed content list of references [Weber R; Schek H J; Blott S.A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces.Proceedings of the 24thVLDB Conference.New York, USA.1998:194-205]) unanimity; If the approximate set of vectors that quantizes after the compression is V={V ₁..., V _m; Fig. 2 has shown an instance that quantizes compression; The original vector data that multi-medium data is abstracted into is classified on the left side one as among the figure; Vector data dimension n=2, totally 6 records, these 6 original vector datas resulting approximate set of vectors after quantizing compression is classified on the right one as among the figure;

Step 2) pairing approximation set of vectors V carries out cluster; Can adopt existing various clustering algorithm in this step, adopt the most frequently used k-means clustering algorithm in this embodiment, specifically according to following steps:

Step 201) from approximate vector set V, selects k approximate vector as initial cluster center arbitrarily, be respectively: u ₁, u ₂..., u _k∈ V;

Step 202) to the approximate vector of remaining m-k among the V, calculate they and this k distances of clustering centers respectively, adopt Euclidean distance in this embodiment:

d_{ij} - sqrt (Σ_{r = 1}^{n} {(V_{ir} - u_{jr})}^{2}) - - - (1)

Wherein, 1≤i≤m-k, 1≤j≤k, 1≤r≤n, d _IjRepresent the distance between i approximate vector and j the cluster centre, V _IrThe r dimension data of representing i approximate vector, u _JrThe r dimension data of representing j cluster centre;

Step 203) with the class of each approximate vector assignment to the shortest with its distance, promptly the most similar cluster centre place with it; For each approximate vector V _i, calculate the class that it should belong to:

c_{i} = \arg \min_{j} (d_{ij}) - - - (2)

In the formula (2), 1≤i≤m-k, 1≤j≤k, c _iThe approximate vector V of representative _iWith k type of that type that middle distance is nearest;

Step 204) upgrades each cluster centre u _j:

u_{j} = \frac{1}{N_{j}} Σ_{i = 1}^{N_{j}} V_{ij} - - - (3)

1≤i≤N _j, N _jBe the approximate vector number in j the cluster, u _jRepresent j cluster centre, V _IjBe i approximate vector in j the cluster;

Step 205) repeating step 202)-203), till the canonical measure function begins convergence.Here adopt mean square deviation as the canonical measure function, computing formula is following:

σ_{j} - sqrt (\frac{1}{N_{j}} Σ_{i = 1}^{N_{j}} {(V_{i} - u_{j})}^{2}) - - - (4)

σ _jThe mean square deviation of representing j cluster, N _jIt is the approximate vector number in j the cluster.

The cluster process of the approximate set of vectors among Fig. 2 is shown in Fig. 3 (a), and the document instance after the cluster is shown in Fig. 3 (b).

The cluster result that finally obtains is the partial indexes of place memory node.

After the partial indexes of all memory nodes is set up, the memory node address at cluster centre and its place is published to whole nerve of a covering through the CAN interface, as global index.

In this embodiment, global index announces according to following method: to each memory node, according to the node mapping algorithm of CAN, with each cluster centre (ip, the u on this node _j) key word u _jBe mapped on 1 P in virtual coordinates space according to DHT, then (ip, u _j) promptly be stored on the CAN node of P region, wherein ip refers to the IP address of partial indexes place memory node, u _jRepresent j cluster centre on this memory node.Because the clustering information of each memory node all comes forth, therefore can navigate to any vector of partial indexes according to global index.

Two clusters with shown in Fig. 3 (b) are example, and its global index announces that process is following: because first cluster centre is < 0.75,2.5 >; Second cluster centre is < 3.0,0.5 >, therefore with (ip1; < 0.75,2.5 >) and (ip1, < 3.0; 0.5) clauses and subclauses are published to global index's node, the ip1 IP address of data place memory node for this reason wherein.

After accomplishing above-mentioned index structure structure, can on its basis, carry out the similarity inquiry, for the given querying condition of user < key, K >, promptly inquire about K the data the most similar with vector data key, similarity querying method of the present invention is as shown in Figure 4, and step is following:

Step 1) is for the vector data key that will inquire about, and earlier it quantized compression, obtain its approximate vector V ';

Step 2) through CAN routing mechanism inquiry global node, calculate approximate vector V ' to the C of global index (ip, u) in each distances of clustering centers d, the cluster centre u that selected distance is minimum _jAnd return corresponding C _i(ip, u _j);

The step 3) partial indexes is according to step 2) the ip address and the u that return _jNavigate to corresponding memory node and class, and at cluster centre u _jKNN (K-Nearest Neighbor algorithm, K arest neighbors node algorithm) inquiry is carried out in affiliated type inside;

The data number K that step 4) is returned as if step 3) ' less than K, the data number that expression has inquired is less than the data number of requirement inquiry, then upgrades C=C-C _i, K=K-K ', and jump to step 2) continue to inquire about; Otherwise, poll-final.

Claims

1. the multi-dimensional indexing structure under the cloud environment; Comprise global index and lay respectively at the partial indexes of each memory node; Said cloud environment uses nerve of a covering to organize memory node; It is characterized in that said partial indexes is carried out the resulting cluster result of cluster for the approximate vector to all vector datas in its place memory node; Said global index is the address of cluster centre information and each cluster centre place memory node that is published to all partial indexes of whole overlay network.

2. the construction method of the multi-dimensional indexing structure under the cloud environment according to claim 1 is characterized in that, may further comprise the steps:

3. like the construction method of the multi-dimensional indexing structure under the said cloud environment of claim 2, it is characterized in that said cluster adopts the k-means clustering method.

4. the similarity querying method under the cloud environment, said cloud environment adopts the said multi-dimensional indexing structure of claim 1, it is characterized in that, may further comprise the steps: