CN102831225A - Multi-dimensional index structure under cloud environment, construction method thereof and similarity query method - Google Patents

Multi-dimensional index structure under cloud environment, construction method thereof and similarity query method Download PDF

Info

Publication number
CN102831225A
CN102831225A CN2012103076075A CN201210307607A CN102831225A CN 102831225 A CN102831225 A CN 102831225A CN 2012103076075 A CN2012103076075 A CN 2012103076075A CN 201210307607 A CN201210307607 A CN 201210307607A CN 102831225 A CN102831225 A CN 102831225A
Authority
CN
China
Prior art keywords
memory node
cloud environment
cluster
approximate
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2012103076075A
Other languages
Chinese (zh)
Inventor
程春玲
孙春菊
张登银
徐小龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post and Telecommunication University
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing Post and Telecommunication University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post and Telecommunication University filed Critical Nanjing Post and Telecommunication University
Priority to CN2012103076075A priority Critical patent/CN102831225A/en
Publication of CN102831225A publication Critical patent/CN102831225A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multi-dimensional index structure under a cloud environment, a construction method thereof and a similarity query method. The index structure disclosed by the invention comprises a global index and local indexes which are respectively positioned at all storage nodes, the cloud environment uses an overlay network to organize the storage nodes, and the local indexes are of clustering results obtained by clustering approximate vectors of all vector data in the storage nodes where the local indexes are located; and the global index is of information of clustering centers of all the local indexes, which are distributed to the whole overlay network and addresses of the storage nodes where the clustering centers are located. The index structure disclosed by the invention has the advantages of reducing index storage space, reducing resource consumption, effectively supporting multi-dimensional data index and similarity query under the cloud environment, using the clustering information obtained by clustering all the approximate vectors as the local indexes and improving query efficiency by only performing query on corresponding categories through the information of the clustering centers without scanning all the approximate vectors during the query of the local indexes.

Description

Multi-dimensional indexing structure under the cloud environment, its construction method and similarity querying method
Technical field
The present invention relates to the multi-dimensional indexing method under a kind of cloud computing environment, relate in particular to multi-dimensional indexing method and construction method thereof under a kind of cloud environment of supporting similarity inquiry, belong to the computer information retrieval technical field.
Background technology
Along with the universal day by day and technological fast development of IT of internet, internet data sharply expands, and how storing and manage mass data has become a challenge that needs to be resolved hurrily.The notion of cloud computing is arisen at the historic moment, and cloud computing has brought new method of service for user and enterprise, has occurred some cloud computing system of achieving success and using at present, as: the elasticity cloud (EC2) of Amazon, the blue cloud of IBM and the cloud computing platform of Google etc.These cloud computing system comprise a large amount of computer nodes, are storing the data of magnanimity, are supporting application such as large-scale data processing and data retrieval.
In cloud computing system, data storage depends on bottom distributed file system (DFS) mostly and comes management data, carries out mass data processing through the mode based on key-value.For example, in the YouTube of Google, video is stored with the key-value mode; Key is unique video id; Value comprises video name, uploads the time, browses number etc., and this storage mode is fit to inquire about through key word key very much, but can not carry out complicated query effectively; Like range query or similarity inquiry, be difficult to the retrieval as required of supporting that cloud computing is personalized.To this problem, some documents propose to set up tree index to realize the range query of different dimensional logarithmic data.
Indexing means under the existing cloud environment can be divided into one dimension index and multi-dimensional indexing according to the data dimension that is directed against.Data to the one-dimensional degree can be set up the one dimension index, like B+ tree index, hash index etc.They all use the two-stage index pattern; On the physical node of actual storage data, set up partial indexes, set up global index at server end then, when the user inquiring data; Global index through server end navigates to local index; Carry out the inquiry of local data then, can significantly reduce the time of inquiry like this, but they can't handle the multidimensional data index efficiently.Index structure to multidimensional data also all is to use two-stage index, mainly be to set up tree index for local node, and the utilization structure nerve of a covering is organized computing node.
The deficiency of above-mentioned indexing means is to support similarity inquiry efficiently, and along with the increase of data dimension, the query performance of tree index structure can descend rapidly, so-called " dimension disaster " problem promptly can occur.
Under traditional distributed is calculated; Propositions such as Weber R quantize compression to improve search efficiency to high dimensional data; And realization VA-File (Vector Approximation File; Vector is similar to file), VA-File has two remarkable advantages: 1. the index file size is much smaller than source document, and the I/O cost of disk significantly reduces when carrying out sequential scanning; 2. reduced computation complexity.The approximate vector after must at first quantizing all when but its deficiency is to inquire about scans, and the cost that when data volume is big, is spent is also relatively large.To " dimension disaster " problem; Dong Daoguo etc. have proposed a kind of new index structure VAR-Tree; It organically combines VA-File and R-Tree; Manage and organize the approximate data among the VA-File with R-Tree, and realize inquiry, improved retrieval performance based on VAR-Tree with the similar search algorithm of the R-Tree class that has proposed.(application number is 03129687.4 to one piece of Chinese invention patent document; Granted publication number is CN1477563A) in the quick similar to search method of a kind of higher-dimension vector data Ordered VA-File is disclosed; It is to the reorganization of sorting of the approximate vector among the VA-File; The data that will in higher dimensional space, flock together are stored in the adjacent position of file as far as possible; And Ordered VA-File self-adaptation being divided into the class of some according to practical application, the data of each type are continuously storage hereof.During inquiry, only several types of nearest data of chosen distance query vector are carried out query processing, thereby improve the efficient of inquiry, but calculated amount is bigger on the one hand in this invention, the class quantity of selecting when depending on inquiry on the other hand, and type quantity does not provide concrete calculating.
Summary of the invention
Technical matters to be solved by this invention is to overcome the deficiency that existing cloud environment index structure is not supported similarity inquiry efficiently, and multi-dimensional indexing method and construction method thereof under a kind of cloud environment are provided, and can under cloud environment, realize similarity inquiry efficiently.
The present invention is concrete to adopt following technical scheme to solve the problems of the technologies described above:
Multi-dimensional indexing structure under a kind of cloud environment; Comprise global index and lay respectively at the partial indexes of each memory node; Said cloud environment uses nerve of a covering to organize memory node, and said partial indexes is carried out the resulting cluster result of cluster for the approximate vector to all vector datas in its place memory node; Said global index is the address of cluster centre information and each cluster centre place memory node that is published to all partial indexes of whole overlay network.
The construction method of the multi-dimensional indexing structure under the cloud environment as stated may further comprise the steps:
Step 1, use VA-File method quantize compression to the initialization vector data of storing in each memory node, obtain the approximate set of vectors of each memory node respectively;
Step 2, respectively the approximate set of vectors of each memory node is carried out cluster, the cluster result of each memory node is the partial indexes of this memory node;
Step 3, extract the cluster centre information in all partial indexes, and the cluster centre information of all partial indexes is published to whole overlay network with the address of each cluster centre place memory node, form global index.
Similarity querying method under a kind of cloud environment, said cloud environment adopt multi-dimensional indexing structure as stated, may further comprise the steps:
Step 1, use VA-File method are treated the query vector data and are quantized compression, obtain the approximate vector of vector data to be checked;
Step 2, from global index, orient and the minimum cluster centre of approximate vector distance of vector data to be checked;
Step 3, confirm the class at the cluster centre place that step 2 orients, it is carried out the KNN inquiry according to partial indexes;
Step 4, judge whether the result of KNN inquiry meets the demands, in this way, then it is exported as final Query Result; As not, then from global index, orient and the inferior little cluster centre of the approximate vector distance of vector data to be checked, and go to step 3.
Compare prior art, the present invention has following beneficial effect:
One, the present invention has introduced in the index structure under the cloud environment and has quantized compression, has significantly reduced the index stores space on the one hand, has reduced resource consumption; On the other hand, can effectively support multidimensional data index and similarity under the cloud environment to inquire about;
Two, the present invention uses all approximate vectors is carried out the resulting clustering information of cluster as partial indexes; To local search index the time; Only need corresponding type to be inquired about, need not scan all approximate vectors, improved search efficiency through cluster centre information.
Description of drawings
Fig. 1 is the structural representation of the multi-dimensional indexing structure under the cloud environment of the present invention;
Fig. 2 is the instance graph that quantizes compression;
Fig. 3 is the instance graph of pairing approximation vector cluster, and Fig. 3 (a) is the cluster principle synoptic diagram, two classes of Fig. 3 (b) for obtaining after the cluster;
Fig. 4 is the similarity querying method process flow diagram under the cloud environment of the present invention.
Embodiment
Below in conjunction with accompanying drawing technical scheme of the present invention is elaborated:
Thinking of the present invention is that the VA-File method under the traditional distributed computing environment is introduced cloud environment; Raw data in each memory node is quantized compression; Then resulting approximate set of vectors is carried out cluster respectively; As partial indexes, the cluster centre information of all partial indexes is published to whole overlay network with the address of each cluster centre place memory node through the nerve of a covering interface with cluster result.Like this, when carrying out the similarity inquiry, only need inquire about, and need all approximate vectors not scanned, dwindle query context greatly, improve search efficiency the class that belongs to apart from the nearest cluster centre of the approximate vector of vector data to be checked.
For the ease of public understanding, be example to adopt the CAN nerve of a covering to organize the memory node under the cloud environment below, technical scheme of the present invention is elaborated.
As shown in Figure 1, the data in the cloud environment of employing CAN institutional framework are that distributed earth is stored on the different server of data center.Each server is born two kinds of roles: memory node and nerve of a covering node.Memory node is a node in the cloud data center distributed memory system, is used to store data and index information.Each memory node logically organizes together according to CAN, managed together global index.Therefore memory node logically also is a nerve of a covering node, the subregion of corresponding CAN.
At first make up the multi-dimensional indexing structure under the cloud environment of the present invention, comprise foundation and the foundation of global index of the partial indexes of each memory node.
The step that partial indexes is set up comprises:
Step 1) quantizes compression to the initialization vector data in the memory node; Quantize compression method and VA-File (but detailed content list of references [Weber R; Schek H J; Blott S.A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces.Proceedings of the 24thVLDB Conference.New York, USA.1998:194-205]) unanimity; If the approximate set of vectors that quantizes after the compression is V={V 1..., V m; Fig. 2 has shown an instance that quantizes compression; The original vector data that multi-medium data is abstracted into is classified on the left side one as among the figure; Vector data dimension n=2, totally 6 records, these 6 original vector datas resulting approximate set of vectors after quantizing compression is classified on the right one as among the figure;
Step 2) pairing approximation set of vectors V carries out cluster; Can adopt existing various clustering algorithm in this step, adopt the most frequently used k-means clustering algorithm in this embodiment, specifically according to following steps:
Step 201) from approximate vector set V, selects k approximate vector as initial cluster center arbitrarily, be respectively: u 1, u 2..., u k∈ V;
Step 202) to the approximate vector of remaining m-k among the V, calculate they and this k distances of clustering centers respectively, adopt Euclidean distance in this embodiment:
d ij - sqrt ( Σ r = 1 n ( V ir - u jr ) 2 ) - - - ( 1 )
Wherein, 1≤i≤m-k, 1≤j≤k, 1≤r≤n, d IjRepresent the distance between i approximate vector and j the cluster centre, V IrThe r dimension data of representing i approximate vector, u JrThe r dimension data of representing j cluster centre;
Step 203) with the class of each approximate vector assignment to the shortest with its distance, promptly the most similar cluster centre place with it; For each approximate vector V i, calculate the class that it should belong to:
c i = arg min j ( d ij ) - - - ( 2 )
In the formula (2), 1≤i≤m-k, 1≤j≤k, c iThe approximate vector V of representative iWith k type of that type that middle distance is nearest;
Step 204) upgrades each cluster centre u j:
u j = 1 N j Σ i = 1 N j V ij - - - ( 3 )
1≤i≤N j, N jBe the approximate vector number in j the cluster, u jRepresent j cluster centre, V IjBe i approximate vector in j the cluster;
Step 205) repeating step 202)-203), till the canonical measure function begins convergence.Here adopt mean square deviation as the canonical measure function, computing formula is following:
σ j - sqrt ( 1 N j Σ i = 1 N j ( V i - u j ) 2 ) - - - ( 4 )
σ jThe mean square deviation of representing j cluster, N jIt is the approximate vector number in j the cluster.
The cluster process of the approximate set of vectors among Fig. 2 is shown in Fig. 3 (a), and the document instance after the cluster is shown in Fig. 3 (b).
The cluster result that finally obtains is the partial indexes of place memory node.
After the partial indexes of all memory nodes is set up, the memory node address at cluster centre and its place is published to whole nerve of a covering through the CAN interface, as global index.
In this embodiment, global index announces according to following method: to each memory node, according to the node mapping algorithm of CAN, with each cluster centre (ip, the u on this node j) key word u jBe mapped on 1 P in virtual coordinates space according to DHT, then (ip, u j) promptly be stored on the CAN node of P region, wherein ip refers to the IP address of partial indexes place memory node, u jRepresent j cluster centre on this memory node.Because the clustering information of each memory node all comes forth, therefore can navigate to any vector of partial indexes according to global index.
Two clusters with shown in Fig. 3 (b) are example, and its global index announces that process is following: because first cluster centre is < 0.75,2.5 >; Second cluster centre is < 3.0,0.5 >, therefore with (ip1; < 0.75,2.5 >) and (ip1, < 3.0; 0.5) clauses and subclauses are published to global index's node, the ip1 IP address of data place memory node for this reason wherein.
After accomplishing above-mentioned index structure structure, can on its basis, carry out the similarity inquiry, for the given querying condition of user < key, K >, promptly inquire about K the data the most similar with vector data key, similarity querying method of the present invention is as shown in Figure 4, and step is following:
Step 1) is for the vector data key that will inquire about, and earlier it quantized compression, obtain its approximate vector V ';
Step 2) through CAN routing mechanism inquiry global node, calculate approximate vector V ' to the C of global index (ip, u) in each distances of clustering centers d, the cluster centre u that selected distance is minimum jAnd return corresponding C i(ip, u j);
The step 3) partial indexes is according to step 2) the ip address and the u that return jNavigate to corresponding memory node and class, and at cluster centre u jKNN (K-Nearest Neighbor algorithm, K arest neighbors node algorithm) inquiry is carried out in affiliated type inside;
The data number K that step 4) is returned as if step 3) ' less than K, the data number that expression has inquired is less than the data number of requirement inquiry, then upgrades C=C-C i, K=K-K ', and jump to step 2) continue to inquire about; Otherwise, poll-final.

Claims (4)

1. the multi-dimensional indexing structure under the cloud environment; Comprise global index and lay respectively at the partial indexes of each memory node; Said cloud environment uses nerve of a covering to organize memory node; It is characterized in that said partial indexes is carried out the resulting cluster result of cluster for the approximate vector to all vector datas in its place memory node; Said global index is the address of cluster centre information and each cluster centre place memory node that is published to all partial indexes of whole overlay network.
2. the construction method of the multi-dimensional indexing structure under the cloud environment according to claim 1 is characterized in that, may further comprise the steps:
Step 1, use VA-File method quantize compression to the initialization vector data of storing in each memory node, obtain the approximate set of vectors of each memory node respectively;
Step 2, respectively the approximate set of vectors of each memory node is carried out cluster, the cluster result of each memory node is the partial indexes of this memory node;
Step 3, extract the cluster centre information in all partial indexes, and the cluster centre information of all partial indexes is published to whole overlay network with the address of each cluster centre place memory node, form global index.
3. like the construction method of the multi-dimensional indexing structure under the said cloud environment of claim 2, it is characterized in that said cluster adopts the k-means clustering method.
4. the similarity querying method under the cloud environment, said cloud environment adopts the said multi-dimensional indexing structure of claim 1, it is characterized in that, may further comprise the steps:
Step 1, use VA-File method are treated the query vector data and are quantized compression, obtain the approximate vector of vector data to be checked;
Step 2, from global index, orient and the minimum cluster centre of approximate vector distance of vector data to be checked;
Step 3, confirm the class at the cluster centre place that step 2 orients, it is carried out the KNN inquiry according to partial indexes;
Step 4, judge whether the result of KNN inquiry meets the demands, in this way, then it is exported as final Query Result; As not, then from global index, orient and the inferior little cluster centre of the approximate vector distance of vector data to be checked, and go to step 3.
CN2012103076075A 2012-08-27 2012-08-27 Multi-dimensional index structure under cloud environment, construction method thereof and similarity query method Pending CN102831225A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2012103076075A CN102831225A (en) 2012-08-27 2012-08-27 Multi-dimensional index structure under cloud environment, construction method thereof and similarity query method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2012103076075A CN102831225A (en) 2012-08-27 2012-08-27 Multi-dimensional index structure under cloud environment, construction method thereof and similarity query method

Publications (1)

Publication Number Publication Date
CN102831225A true CN102831225A (en) 2012-12-19

Family

ID=47334360

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2012103076075A Pending CN102831225A (en) 2012-08-27 2012-08-27 Multi-dimensional index structure under cloud environment, construction method thereof and similarity query method

Country Status (1)

Country Link
CN (1) CN102831225A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103235825A (en) * 2013-05-08 2013-08-07 重庆大学 Method used for designing large-quantity face recognition search engine and based on Hadoop cloud computing frame
CN103914483A (en) * 2013-01-07 2014-07-09 深圳市腾讯计算机系统有限公司 File storage method and device and file reading method and device
CN105550332A (en) * 2015-12-21 2016-05-04 河海大学 Dual-layer index structure based origin graph query method
CN108090182A (en) * 2017-12-15 2018-05-29 清华大学 A kind of distributed index method and system of extensive high dimensional data
CN108241745A (en) * 2018-01-08 2018-07-03 阿里巴巴集团控股有限公司 The processing method and processing device of sample set, the querying method of sample and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1477563A (en) * 2003-07-03 2004-02-25 复旦大学 High-dimensional vector data quick similar search method
CN102063486A (en) * 2010-12-28 2011-05-18 东北大学 Multi-dimensional data management-oriented cloud computing query processing method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1477563A (en) * 2003-07-03 2004-02-25 复旦大学 High-dimensional vector data quick similar search method
CN102063486A (en) * 2010-12-28 2011-05-18 东北大学 Multi-dimensional data management-oriented cloud computing query processing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIANGYU ZHANG ETC: "An Efficient Multi-Dimensional Index for Cloud Data Management", 《IN PROCEEDINGS OF THE CIKM WORKSHOP ON CLOUD DATA MANAGEMENT(CLOUDDB2009)》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103914483B (en) * 2013-01-07 2018-09-25 深圳市腾讯计算机系统有限公司 File memory method, device and file reading, device
CN103914483A (en) * 2013-01-07 2014-07-09 深圳市腾讯计算机系统有限公司 File storage method and device and file reading method and device
CN103235825B (en) * 2013-05-08 2016-05-25 重庆大学 A kind of magnanimity face recognition search engine design method based on Hadoop cloud computing framework
CN103235825A (en) * 2013-05-08 2013-08-07 重庆大学 Method used for designing large-quantity face recognition search engine and based on Hadoop cloud computing frame
CN105550332A (en) * 2015-12-21 2016-05-04 河海大学 Dual-layer index structure based origin graph query method
CN105550332B (en) * 2015-12-21 2019-03-29 河海大学 A kind of provenance graph querying method based on the double-deck index structure
CN108090182A (en) * 2017-12-15 2018-05-29 清华大学 A kind of distributed index method and system of extensive high dimensional data
CN108090182B (en) * 2017-12-15 2018-10-30 清华大学 A kind of distributed index method and system of extensive high dimensional data
CN108241745A (en) * 2018-01-08 2018-07-03 阿里巴巴集团控股有限公司 The processing method and processing device of sample set, the querying method of sample and device
WO2019134567A1 (en) * 2018-01-08 2019-07-11 阿里巴巴集团控股有限公司 Sample set processing method and apparatus, and sample querying method and apparatus
CN108241745B (en) * 2018-01-08 2020-04-28 阿里巴巴集团控股有限公司 Sample set processing method and device and sample query method and device
TWI696081B (en) * 2018-01-08 2020-06-11 香港商阿里巴巴集團服務有限公司 Sample set processing method and device, sample query method and device
US10896164B2 (en) 2018-01-08 2021-01-19 Advanced New Technologies Co., Ltd. Sample set processing method and apparatus, and sample querying method and apparatus

Similar Documents

Publication Publication Date Title
CN110147377B (en) General query method based on secondary index under large-scale spatial data environment
Han et al. Hgrid: A data model for large geospatial data sets in hbase
CN104133858B (en) Intelligence analysis system with double engines and method based on row storage
US20090157666A1 (en) Method for improving search engine efficiency
CN110175175B (en) SPARK-based distributed space secondary index and range query algorithm
CN103164507A (en) Mixed join of row and column database tables in native orientation
CN108182242A (en) A kind of indexing means for the inquiry of magnanimity multi dimensional numerical data area
KR20100072777A (en) A distributed index system based on multi-length signature files and method thereof
CN102831225A (en) Multi-dimensional index structure under cloud environment, construction method thereof and similarity query method
Su et al. Indexing and parallel query processing support for visualizing climate datasets
Moise et al. Terabyte-scale image similarity search: experience and best practice
CN103353901A (en) Orderly table data management method and system based on Hadoop distributed file system (HDFS)
Huang et al. Effective data co-reduction for multimedia similarity search
Kumar et al. M-Grid: a distributed framework for multidimensional indexing and querying of location based data
Zheng et al. Data storage optimization strategy in distributed column-oriented database by considering spatial adjacency
US10482085B2 (en) Methods and systems for estimating the number of points in two-dimensional data
Cheng et al. A Multi-dimensional Index Structure Based on Improved VA-file and CAN in the Cloud
US20090157624A1 (en) System and method for indexing high-dimensional data in cluster system
Nodarakis et al. (A) kNN query processing on the cloud: a survey
Li et al. SP-phoenix: a massive spatial point data management system based on phoenix
Andrade et al. Spatial-aware data partition for distributed memory parallelization of ANN search in multimedia retrieval
CN113901278A (en) Data search method and device based on global multi-detection and adaptive termination
Zhou et al. Large scale nearest neighbors search based on neighborhood graph
Zhang et al. Fast, Approximate Vector Queries on Very Large Unstructured Datasets
Su et al. A Fast Hybrid Spatial Index with External Memory Support

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20121219