CN108090182B - A kind of distributed index method and system of extensive high dimensional data - Google Patents

A kind of distributed index method and system of extensive high dimensional data Download PDF

Info

Publication number
CN108090182B
CN108090182B CN201711349831.XA CN201711349831A CN108090182B CN 108090182 B CN108090182 B CN 108090182B CN 201711349831 A CN201711349831 A CN 201711349831A CN 108090182 B CN108090182 B CN 108090182B
Authority
CN
China
Prior art keywords
data
high dimensional
keyword
dimension
dimensional data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711349831.XA
Other languages
Chinese (zh)
Other versions
CN108090182A (en
Inventor
王建民
龙明盛
文庆福
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201711349831.XA priority Critical patent/CN108090182B/en
Publication of CN108090182A publication Critical patent/CN108090182A/en
Application granted granted Critical
Publication of CN108090182B publication Critical patent/CN108090182B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of distributed index method and system of extensive high dimensional data, by all high dimensional data distributed storages on cluster, each high dimensional data is divided into multiple low-dimensional datas, a low-dimensional data of all high dimensional datas is stored in each sub-spaces, multiple cluster centres of all low-dimensional datas of each sub-spaces are obtained using Distributed Cluster algorithm, multiple cluster centres of every sub-spaces are combined, obtain multiple multidimensional keywords of all high dimensional datas, and calculate the high dimensional data that each multidimensional keyword includes, distributed index is carried out to all data with this.In inquiry, the multidimensional keyword of first inquiry and inquiry Data Matching, then inquire the high dimensional data that each keyword includes.The present invention combines the inverted index of Distributed Cluster, distributed query and multiple subspaces, under the premise of ensure that retrieval and inquisition accuracy, improves the efficiency of retrieval and inquisition, can be applied to the retrieval of large-scale distributed data.

Description

A kind of distributed index method and system of extensive high dimensional data
Technical field
The present invention relates to big data retrieval and inquisition technical fields, more particularly, to a kind of point of extensive high dimensional data Cloth indexing means and system.
Background technology
In today that information technology rapidly develops, unstructured data such as text, image, video and audio etc. is all presented Go out exponential growth.The information that user wants how is quickly and accurately obtained from the internet data of magnanimity, is non-structural Change big data management and an important technological problems in retrieval.Text that the Internet companies such as Google, Baidu are provided, image Etc. search services be people obtain information bring great convenience.And in the behind of these search services, it is required for approximate close The support of adjacent inquiring technology.Under the application scenarios of extensive high dimensional data, accurate NN Query needs to expend a large amount of storages And computing resource, and query time is too long, directory system handling capacity is too low, actual application value is relatively low.Approximate NN Query skill Art can be greatly shortened query time, reduce storage and computing cost, while ensure that query result is close with accurate query result Seemingly, therefore there is higher practicability.Other than information retrieval, similar to search technology is widely used in machine learning, number According to fields such as excavation, multimedia administrations.
In data scale ever-increasing today, more and more applications are all based on storage in a distributed system big Scale data, such as internet text, image, video frequency searching etc..Existing many Index Algorithms are all in the environment of single machine It realizes, and under distributed environment, existing major part indexing means require will be in all Data Migrations to same machine Centralized processing is carried out, but this has violated the distributed storage mode of data, brings very high data migration cost, and right The data processing performance of single machine requires very high.
Invention content
The present invention provides a kind of a kind of extensive higher-dimension for overcoming the above problem or solving the above problems at least partly The distributed index method and system of data.
According to an aspect of the invention, there is provided a kind of distributed index method of extensive high dimensional data, including:
The each high dimensional data being stored in each child node is divided into m low-dimensional data, by each high dimensional data pair The low-dimensional data answered is stored in corresponding m sub-spaces, wherein m is the integer more than or equal to 2;
Every height is obtained using Distributed Cluster algorithm for the low-dimensional data of the same subspace in all child nodes The K cluster centre in space, using each cluster centre as an one-dimensional keyword, by K keyword of every sub-spaces It is combined, obtains KmA m ties up keyword, wherein K is positive integer;
The cluster centre that the low-dimensional data in each sub-spaces of the child node is subordinate to is calculated in each child node, into And the m dimension keywords that each high dimensional data is subordinate to are obtained, to obtain all high dimensional datas that each m dimension keywords include;
Determine the KmMultiple m in a m dimensions keyword with the inquiry Data Matching tie up keyword;
All high dimensional datas that keyword includes are tieed up according to each m, are determined in each similar m dimensions keyword With the high dimensional data of the inquiry Data Matching in all high dimensional datas, all higher-dimensions with the inquiry Data Matching are found Data, as query result.
Based on the above technical solution, the present invention can also improve as follows.
Further, further include:
All high dimensional datas are stored in the computer cluster of multiple child node compositions, are chosen from multiple child nodes Host node of one child node as all child nodes.
Further, described that each high dimensional data being stored in each child node is divided into m low-dimensional data, it will be every The corresponding low-dimensional data of a high dimensional data, which is stored in corresponding m sub-spaces, to be specifically included:
Each P dimension data on this node is divided into m P/m dimension data in dimension in each child node, and Each P/m dimension datas of each P dimension datas are stored in corresponding subspace, wherein P/m is integer, the subspace Number is m.
Further, the low-dimensional data of the same subspace in all child nodes is obtained using Distributed Cluster algorithm To K cluster centre of every sub-spaces, using each cluster centre as an one-dimensional keyword, by the K of every sub-spaces A keyword is combined, and obtains KmA m dimensions keyword specifically includes:
The P/m dimension datas of i-th of subspace in all child nodes are divided using distributed K-Means clustering algorithms Cloth clusters, and obtains K cluster centre of every sub-spaces, is denoted as respectively:
U1=[u11,u12,...,u1k]
U2=[u21,u22,...,u2k]
Ui=[ui1,ui2,...,uik];
Um=[um1,um2,...,umk]
Wherein, i=1,2 ... m, m indicate that m-th of subspace, k indicate k-th of cluster centre;
Using each cluster centre as a keyword, K cluster centre of m sub-spaces is carried out on the primary node Combination, obtains KmA m ties up keyword, is denoted as U=[u1x,u2y,...,umw], wherein 0 < x, y, w≤k, and x, y, w are whole Number, U indicate that m ties up keyword, and x is indicated from U1=[u11,u12,...,u1k] in x-th of cluster centre choosing, y indicates from U2= [u21,u22,...,u2k] in y-th of cluster centre choosing, w indicates from Um=[um1,um2,...,umk] in choose w-th Cluster centre.
Further, the low-dimensional data in each sub-spaces that the child node is calculated in each child node is subordinate to Cluster centre, and then the m dimension keywords that each high dimensional data is subordinate to are obtained, to obtain the institute that each m dimension keywords include Some high dimensional datas specifically include:
Each P/m dimension data and son sky in each sub-spaces of the child node are calculated in each child node Between K cluster centre Euclidean distance, the cluster centre being subordinate to as the P/m dimension datas apart from nearest cluster centre, And then the cluster centre that each the P/m dimension data for each sub-spaces for obtaining the child node is subordinate to;
For each P dimension data in each child node, in the cluster that its corresponding m P/m dimension data is subordinate to The heart merges, and obtains the corresponding m dimensions keyword of the P dimension datas, and then obtains all P dimensions that each m dimension keywords are included Data.
Further, the determination KmMultiple m in a m dimensions keyword with the inquiry Data Matching tie up keyword It specifically includes:
On the primary node, P is tieed up into inquiry data and is divided into m P/m dimension data, calculate P dimensions inquiry data and KmA m Wei Guan The Euclidean distance between each m dimension keywords in key word;
The preceding a m dimension keyword nearest with the Euclidean distance of the inquiry data is tieed up into keyword as matched m;Or The b m dimension keywords for being less than pre-determined distance with the Euclidean distance of the inquiry data are tieed up keyword by person as matched m, In, a, b are positive integer;
Host node is distributed to each by the inquiry data and with all m dimension keywords of the inquiry Data Matching Child node.
Further, described that all high dimensional datas that keyword includes are tieed up according to each m, it determines similar in each M ties up the high dimensional data with the inquiry Data Matching in all high dimensional datas in keyword, finds and the inquiry data All high dimensional datas matched, specifically include as query result:
In each child node, calculate that described inquiry data and each matched m tie up that keyword included is each The Euclidean distance of a high dimensional data;
Using with the Euclidean distance of the inquiry data nearest preceding e high dimensional data as with the Data Matching of inquiring High dimensional data;Alternatively, being less than g high dimensional data of pre-determined distance as matched height using with the Euclidean distance of the inquiry data Dimension data, wherein e, g are positive integer;
Summarize the high dimensional data with the inquiry Data Matching found in all child nodes on the primary node, by what is summarized All high dimensional datas are returned as query result.
Further, the value of the m is 2.
According to another aspect of the present invention, a kind of distributed index system of extensive high dimensional data is provided, including:
Division module will be every for each high dimensional data being stored in each child node to be divided into m low-dimensional data The corresponding low-dimensional data of a high dimensional data is stored in corresponding m sub-spaces, wherein m is the integer more than or equal to 2;
Cluster module is calculated for the low-dimensional data for the same subspace in all child nodes using Distributed Cluster Method obtains K cluster centre of every sub-spaces, will be empty per height using each cluster centre as an one-dimensional keyword Between K keyword be combined, obtain KmA m ties up keyword;
Computing module, the low-dimensional data in each sub-spaces for calculating the child node in each child node are subordinate to Cluster centre, and then the m dimension keywords that each high dimensional data is subordinate to are obtained, to obtain the institute that each m dimension keywords include Some high dimensional datas;
First determining module, for determining the KmA m ties up multiple m Wei Guan with the inquiry Data Matching in keyword Key word;
Second determining module determines each phase for tieing up all high dimensional datas that keyword includes according to each m High dimensional data in all high dimensional datas in close m dimension keywords with the inquiry Data Matching, finds and the inquiry number According to matched all high dimensional datas, as query result.
According to a further aspect of the invention, a kind of non-transient computer readable storage medium is provided, which is characterized in that The non-transient computer readable storage medium stores computer instruction, and the computer instruction makes the computer execute big rule The distributed index method of mould high dimensional data.
The present invention provides a kind of distributed index method and system of extensive high dimensional data, by all high dimensional datas Each high dimensional data is divided into multiple low-dimensional datas, is stored in each sub-spaces by distributed storage on cluster A low-dimensional data for having high dimensional data, the more of all low-dimensional datas of each sub-spaces are obtained using Distributed Cluster algorithm A cluster centre is combined multiple cluster centres of every sub-spaces, and the multiple multidimensional for obtaining all high dimensional datas are crucial Word, and the high dimensional data that each multidimensional keyword includes is calculated, in inquiry, first inquiry and the multidimensional of inquiry Data Matching are closed Key word, then inquire the high dimensional data that each keyword includes.The present invention is by Distributed Cluster, distributed query and multiple sons The inverted index in space combines, and under the premise of ensure that retrieval and inquisition accuracy, improves the efficiency of retrieval and inquisition, this method It can be applied among the retrieval of large-scale distributed data, there is preferable scalability.
Description of the drawings
Fig. 1 is the distributed index method flow diagram of the extensive high dimensional data of one embodiment of the invention;
Fig. 2 is the schematic diagram of the two-dimentional cluster centre built in one embodiment of the invention;
Fig. 3 is the schematic diagram for finding matched two-dimentional cluster centre in one embodiment of the invention according to inquiry data;
Fig. 4 is that the distributed index system of the extensive high dimensional data of one embodiment of the invention connects block diagram.
Specific implementation mode
With reference to the accompanying drawings and examples, the specific implementation mode of the present invention is described in further detail.Implement below Example is not limited to the scope of the present invention for illustrating the present invention.
Referring to Fig. 1, the distributed index method of the extensive high dimensional data of one embodiment of the invention is provided, will be distributed Formula is applied and cumulative index combines, under the premise of ensureing data retrieval precision, additionally it is possible to improve the efficiency of data retrieval. This method includes:The each high dimensional data being stored in each child node is divided into m low-dimensional data, by each high dimensional data Corresponding low-dimensional data is stored in corresponding m sub-spaces, wherein m is the integer more than or equal to 2;For all child nodes On the low-dimensional data of same subspace K cluster centre of every sub-spaces is obtained using Distributed Cluster algorithm, will be every K keyword of every sub-spaces is combined as an one-dimensional keyword, obtains K by one cluster centremA m Wei Guan Key word;The cluster centre that the low-dimensional data in each sub-spaces of the child node is subordinate to is calculated in each child node, in turn The m dimension keywords that each high dimensional data is subordinate to are obtained, to obtain all high dimensional datas that each m dimension keywords include;Really The fixed KmMultiple m in a m dimensions keyword with the inquiry Data Matching tie up keyword;According to each m Wei Guanjianzibao All high dimensional datas included, determine in all high dimensional datas in each similar m dimension keyword with the inquiry data Matched high dimensional data finds all high dimensional datas with the inquiry Data Matching, as query result.
The present embodiment is primarily adapted for use in the retrieval and inquisition of extensive high dimensional data, since the scale of high dimensional data is bigger, Therefore, in the present embodiment, first by all high dimensional data distributed storages in the computer cluster being made of multiple child nodes In, host node of the child node as all child nodes is chosen from multiple child nodes, for managing all child nodes, with And it is responsible for the collection and distribution of the data in all child nodes.It, will be each in order to realize the retrieval and inquisition of extensive high dimensional data High dimensional data in a child node carries out dimensionality reduction, and a high dimensional data forms multiple low-dimensional datas by dimensionality reduction.By each height Each low-dimensional data of dimension data is stored in corresponding subspace, i.e., each sub-spaces, which correspond to, stores each high dimension According to some low-dimensional data after dimensionality reduction, the high dimensional data in all child nodes carries out identical processing.
For the low-dimensional data in the same subspace in all child nodes, due to the same subspace of all high dimensional datas In corresponding low-dimensional data be stored in different child nodes, in order to improve the speed and efficiency of clustering processing, in the present embodiment The low-dimensional data in the same subspace in all child nodes is clustered using Distributed Cluster algorithm, obtains each K cluster centre of subspace, using a cluster centre in each sub-spaces as an one-dimensional keyword.To each The cluster centre of subspace is combined, and obtains all possible multidimensional keyword of high dimensional data.By in each child node The cluster centre that the upper low-dimensional data for calculating each sub-spaces is subordinate to, so combine obtain each high dimensional data be subordinate to it is more Keyword is tieed up, to finally obtain all high dimensional datas that each multidimensional keyword is included.
When being inquired, first found from all possible multidimensional keyword on the primary node and inquiry Data Matching Multiple multidimensional keywords are corresponding more in each child node then according to multiple multidimensional keywords with inquiry Data Matching The high dimensional data searched in all high dimensional datas included by keyword with inquiry Data Matching is tieed up, will be looked into all child nodes It is finding to be summarized with inquiry Data Matching high dimensional data, obtain the query result for finally inquiring data.
The present embodiment combines the cumulative index of Distributed Cluster, distributed query and multiple subspaces, ensures Under the premise of retrieval and inquisition accuracy, the efficiency of retrieval and inquisition is improved, this method can be applied to large-scale distributed data Retrieval among, have preferable scalability.
On the basis of the above embodiments, described to be stored in each child node in one embodiment of the present of invention Each high dimensional data is divided into m low-dimensional data, and it is empty that the corresponding low-dimensional data of each high dimensional data is stored in corresponding m son Between in specifically include:Each P dimension data on this node is divided into m P/m dimension in dimension in each child node According to, and each P/m dimension datas of each P dimension datas are stored in corresponding subspace, wherein P/m is integer, and the son is empty Between number be m.
During specifically carrying out dimensionality reduction to each high dimensional data, it is assumed that all high dimensional datas are P dimension datas, P is more than or equal to 2.In each child node, each P dimension data on this node is divided into the low-dimensional number of m P/m dimensions According to, wherein P/m is integer, i.e. P is the integral multiple of m, and m is positive integer.Each P high dimensional datas tieed up are divided into m P/m After the low-dimensional data of dimension, each low-dimensional data is stored in corresponding subspace, the number of subspace is m.For example, right In the high dimensional data of 4 dimensions in child node, it is divided into the low-dimensional data of 22 dimensions, is respectively stored in the low-dimensional data that 22 are tieed up In corresponding two sub-spaces.For the high dimensional data of the P dimensions in all child nodes, identical dimension-reduction treatment is carried out, and will In each low-dimensional data storage to corresponding subspace after dimensionality reduction.
It is described for same in all child nodes in one embodiment of the present of invention on the basis of the various embodiments described above The low-dimensional data of one subspace is obtained K cluster centre of every sub-spaces, each is gathered using Distributed Cluster algorithm K keyword of every sub-spaces is combined as an one-dimensional keyword, obtains K by class centermA m dimensions keyword tool Body includes:The P/m dimension datas of i-th of subspace in all child nodes are divided using distributed K-Means clustering algorithms Cloth clusters, and obtains K cluster centre of every sub-spaces, is denoted as respectively:
U1=[u11,u12,...,u1k]
U2=[u21,u22,...,u2k]
Ui=[ui1,ui2,...,uik];
Um=[um1,um2,...,umk]
Wherein, i=1,2 ... m, m indicate that m-th of subspace, k indicate k-th of cluster centre;By each cluster centre As a keyword, K cluster centre of m sub-spaces is combined on the primary node, obtains KmA m ties up keyword, It is denoted as U=[u1x,u2y,...,umw], wherein 0 < x, y, w≤k, and x, y, w are integer, U indicates that m ties up keyword, and x is indicated From U1=[u11,u12,...,u1k] in x-th of cluster centre choosing, y indicates from U2=[u21,u22,...,u2k] in choose Y-th of cluster centre, w are indicated from Um=[um1,um2,...,umk] in choose w-th of cluster centre.
Dimensionality reduction has been carried out to the high dimensional data of the P dimensions in each child node in above-described embodiment, has formd m P/m dimension Low-dimensional data, and each low-dimensional data is stored in corresponding subspace.In the present embodiment, for all child nodes On same subspace low-dimensional data, be stored in due to the low-dimensional data of same subspace in different child nodes, in order to carry The speed and efficiency of high clustering processing, the present embodiment use distribution K-Means clustering algorithms, obtain the same of all child nodes K cluster centre of the low-dimensional data of sub-spaces, i.e., for m sub-spaces, each sub-spaces are all corresponding in K cluster The heart.Wherein, distributed K-Means clustering algorithms are using existing Distributed Cluster algorithm, and details are not described herein.Then exist K cluster centre of all subspaces is combined on host node, all possible multidimensional keyword is obtained, K is always obtainedm A multidimensional keyword.
It is described to be calculated in each child node in an alternative embodiment of the invention on the basis of the various embodiments described above The cluster centre that low-dimensional data in each sub-spaces of the child node is subordinate to, and then obtain the m that each high dimensional data is subordinate to Keyword is tieed up, is specifically included with obtaining all high dimensional datas that each m dimension keywords include:It is counted in each child node The Euclidean distance of each P/m dimension data in each sub-spaces of the child node and K cluster centre of the subspace is calculated, The cluster centre being subordinate to as the P/m dimension datas apart from nearest cluster centre, and then obtain each height of the child node The cluster centre that each P/m dimension data in space is subordinate to;It is for each P dimension data in each child node, its is right The cluster centre that the m P/m dimension data answered is subordinate to merges, and obtains the corresponding m dimensions keyword of the P dimension datas, and then obtain each All P dimension datas that a m dimensions keyword is included.
In above-described embodiment, the same sub-spaces of all child nodes are calculated by distributed K-Means clustering algorithms Low-dimensional data K cluster centre, the present embodiment calculates and is subordinate to per the low-dimensional data in sub-spaces in each child node The cluster centre of category.During specifically calculating the cluster centre being subordinate to, by calculating the low-dimensional data in subspace and K The Euclidean distance of cluster centre is subordinate to poly- using the cluster centre nearest with the Euclidean distance of low-dimensional data as the low-dimensional data Class center calculates the cluster centre that each low-dimensional data is subordinate in subspace.Since a high dimensional data is divided It is fitted in multiple subspaces, the cluster centre that low-dimensional data corresponding with high dimensional data is subordinate in multiple subspaces is subjected to group It closes, obtains the m dimension keywords that high dimensional data is subordinate to.Identical processing is carried out to the data in each child node, is obtained every The m that each high dimensional data is subordinate in one child node ties up keyword, and then can count each m dimension keywords and include All high dimensional datas.
On the basis of the various embodiments described above, in one embodiment of the present of invention, the determination KmA m ties up keyword In with it is described inquiry Data Matching multiple m dimension keyword specifically include:On the primary node, P is tieed up into inquiry data and is divided into m P/m dimension datas calculate P dimensions inquiry data and KmThe Euclidean distance between each m dimension keywords in a m dimensions keyword;It will Nearest preceding a m dimension keywords tie up keyword as matched m with the Euclidean distance of the inquiry data;Alternatively, will with it is described The Euclidean distance for inquiring data is less than the b m dimension keywords of pre-determined distance as matched m dimension keywords;Host node will be described It inquires data and is distributed to each child node with all m dimension keywords of the inquiry Data Matching.
During retrieval and inquisition, host node receives inquiry request, and the inquiry data that P is tieed up are divided into m by host node The low-dimensional data of P/m dimensions, and calculate the inquiry data of P dimensions and possible KmIn a m dimensions keyword between each m dimension keywords Euclidean distance.When specifically calculating Euclidean, P can be tieed up to the vector that inquiry data regard m low-dimensional data as, each m dimensions Keyword regards the vector of m dimensions as, calculates the distance between two vectors, you can obtains inquiry data and ties up keyword with each m Euclidean distance.
According to inquiry data and each m dimension keyword between Euclidean distance result of calculation, by Euclidean distance according to Distance is ranked up from small to large, and the m of a cluster centre multidimensional being arranged in front and inquiry Data Matching is tieed up keyword;Or Person set Euclidean distance threshold value, using with inquiry data Euclidean distance be less than given threshold b cluster centre as with look into The m for asking Data Matching ties up keyword, so far finds on the primary node and m dimension keywords all similar in inquiry data.Then, Host node will inquire data and what is found is distributed to each child node with multiple m dimensions keywords similar in inquiry data.
On the basis of the various embodiments described above, in an alternative embodiment of the invention, tieing up keyword according to each m includes All high dimensional datas, determine in all high dimensional datas in each similar m dimension keyword with the inquiry data The high dimensional data matched finds all high dimensional datas with the inquiry Data Matching, is specifically included as query result:Every In one child node, each high dimensional data that the inquiry data are included with each matched m dimension keywords is calculated Euclidean distance;Using the preceding e high dimensional data nearest with the Euclidean distance of the inquiry data as with the inquiry Data Matching High dimensional data;Alternatively, being less than g high dimensional data of pre-determined distance as matched using with the Euclidean distance of the inquiry data High dimensional data;The high dimensional data with the inquiry Data Matching for summarizing and being found in all child nodes is collected on the primary node, it will All high dimensional datas summarized are returned as query result.
Each child node receives the inquiry data of host node distribution and is tieed up with multiple m similar in inquiry data crucial Word, since aforementioned each m that calculated in each child node ties up all high dimensional datas that keyword is included, because This, can find in each child node and tie up the high dimensional data that keyword is included with each m similar in inquiry data.It is right In all high dimensional datas that each m dimension keywords are included, calculates each high dimensional data and inquire the Europe between data Family name's distance obtains each high dimensional data and inquires the Euclidean distance between data, and according to Euclidean distance from small to large suitable Sequence is ranked up, using the preceding e high dimensional data nearest with the Euclidean distance of inquiry data as the higher-dimension with inquiry Data Matching Data;Alternatively, being less than g high dimensional data of pre-determined distance as matched high dimensional data using with the Euclidean distance of inquiry data.
The high dimensional data with inquiry Data Matching is found in each child node, it will be all and inquiry Data Matching High dimensional data is all aggregated on host node, is returned as query result, and entire retrieval and inquisition process is so far completed.
It should be noted that through a large number of experiments the study found that wherein, m values are 2 to be one and preferably select, because This illustrates distributed index method provided by the invention for P=1024 below with m=2.
For the high dimensional data of each 1024 dimension in each child node, it is divided into the low-dimensional data of 2 512 dimensions, And be stored in corresponding two sub-spaces, therefore, for each child node, there are two subspace, a sub-spaces are deposited Store up 512 dimension datas of one of all 1024 dimension datas, another sub-spaces store all 1024 dimension datas another 512 Dimension data.Data are distributed in two sub-spaces in each child node, it can be understood as, two in each child node Subspace is identical subspace, and the data being only distributed in subspace are different.
It is poly- using distributed K-Means for 512 dimension datas in two sub-spaces in all child nodes referring to Fig. 2 Class algorithm finds K cluster centre of all 512 dimension datas in each sub-spaces, the K cluster obtained in two sub-spaces Center is denoted as U=[u respectively1,u2,…,uK] and V=[v1,v2,…,vK], wherein u1,u2,…,uKAnd v1,v2,…,vKRespectively For K cluster centre in two sub-spaces.To U=[u1,u2,…,uK] and V=[v1,v2,…,vK] be combined, it can obtain K2A all possible keyword.
The K cluster centre that every sub-spaces are had found by cluster calculates each sub-spaces in each child node In the cluster centre that is subordinate to of each 512 dimension data, the cluster that 512 dimension datas calculated in two sub-spaces are subordinate to Center merges, and obtains the keyword that each 1024 dimension data is subordinate to, is denoted as [uivj], indicate each 1024 dimension data One of them 512 dimension data is under the jurisdiction of ith cluster center, another 512 dimension data is under the jurisdiction of j-th of cluster centre, passes through Same method obtains the keyword that all 1024 dimension datas are subordinate to, and then can also count each keyword and be included 1024 all dimension datas.
During retrieval and inquisition, on the primary node, the inquiry data q of 1024 dimensions is divided into the number of 2 512 dimensions According to first carrying out inquiry data and K2Each keyword in a keyword carries out the calculating of Euclidean distance, finds and inquiry number According to similar multiple keywords, reference can be made to Fig. 3.Then host node by inquire data and find with inquiry data similar in it is more A 512 dimension keyword is distributed to each child node.In each child node, find and each similar pass of inquiry data The 1024 all dimension datas that key word includes, each 1024 dimension for being included with each similar keyword by inquiry data Data carry out Euclidean distance calculating, find in all 1024 dimension datas in each keyword with inquiry data similar in it is more A 1024 dimension data finds in a child node and inquires 1024 dimension datas all similar in data, by all child nodes On find with inquiry data similar in 1024 dimension datas summarized, returned as query result.
Referring to Fig. 4, the distributed index system of the extensive high dimensional data of one embodiment of the invention is provided, including draw Sub-module 21, cluster module 22, computing module 23, the first computing module 24 and the second computing module 25.
Division module 21, for for each high dimensional data being stored in each child node to be divided into m low-dimensional number According to the corresponding low-dimensional data of each high dimensional data is stored in corresponding m sub-spaces, wherein m is whole more than or equal to 2 Number.
Cluster module 22 is calculated for the low-dimensional data to the same subspace in all child nodes using Distributed Cluster Method obtains K cluster centre of every sub-spaces, will be empty per height using each cluster centre as an one-dimensional keyword Between K keyword be combined, obtain KmA m ties up keyword.
Computing module 23, the low-dimensional data in each sub-spaces for calculating the child node in each child node are subordinate to The cluster centre of category, and then the m dimension keywords that each high dimensional data is subordinate to are obtained, include to obtain each m dimension keywords All high dimensional datas.
First determining module 24, for determining the KmMultiple m in a m dimensions keyword with the inquiry Data Matching are tieed up Keyword.
Second determining module 25 determines each for tieing up all high dimensional datas that keyword includes according to each m With the high dimensional data of the inquiry Data Matching in all high dimensional datas in similar m dimensions keyword, find and the inquiry All high dimensional datas of Data Matching, as query result.
The present invention also provides a kind of non-transient computer readable storage medium, which deposits Store up computer instruction, which makes computer execute point of extensive high dimensional data that above-mentioned corresponding embodiment is provided Cloth indexing means, such as including:The each high dimensional data being stored in each child node is divided into m low-dimensional data, it will Each corresponding low-dimensional data of high dimensional data is stored in corresponding m sub-spaces, wherein m is the integer more than or equal to 2;It is right Low-dimensional data in the same subspace in all child nodes, using Distributed Cluster algorithm, obtain every sub-spaces K are poly- K keyword of every sub-spaces is combined, obtains using each cluster centre as an one-dimensional keyword by class center To KmA m ties up keyword;Calculate that the low-dimensional data in each sub-spaces of the child node is subordinate in each child node is poly- Class center, and then obtain the m dimension keywords that each high dimensional data is subordinate to, it is all with obtain that each m dimension keywords include High dimensional data;Determine the KmMultiple m in a m dimensions keyword with the inquiry Data Matching tie up keyword;According to each m Dimension keyword all high dimensional datas for including, determine in all high dimensional datas in each similar m dimensions keyword with institute The high dimensional data for stating inquiry Data Matching finds all high dimensional datas with the inquiry Data Matching, as query result.
One of ordinary skill in the art will appreciate that:Realize that all or part of step of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer read/write memory medium, the program When being executed, step including the steps of the foregoing method embodiments is executed;And storage medium above-mentioned includes:ROM, RAM, magnetic disc or light The various media that can store program code such as disk.
The embodiments such as the equipment of distributed index method of extensive high dimensional data described above are only schematic , wherein may or may not be physically separated as the unit that separating component illustrates, shown as unit Component may or may not be physical unit, you can be located at a place, or may be distributed over multiple networks On unit.Some or all of module therein can be selected according to the actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art are not in the case where paying performing creative labour, you can to understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It is realized by the mode of software plus required general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be expressed in the form of software products in other words, should Computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Certain Part Methods of example or embodiment.
The distributed index method and system of a kind of extensive high dimensional data provided by the invention, by all high dimensional datas Each high dimensional data is divided into multiple low-dimensional datas, is stored in each sub-spaces by distributed storage on cluster A low-dimensional data for having high dimensional data, the more of all low-dimensional datas of each sub-spaces are obtained using Distributed Cluster algorithm A cluster centre, using the keyword that each cluster centre is one-dimensional as one, to multiple one-dimensional keywords of every sub-spaces It is combined, obtains multiple multidimensional keywords of all high dimensional datas, and calculate the high dimension that each multidimensional keyword includes According to, in inquiry, the multidimensional keyword of first inquiry and inquiry Data Matching, then inquire the height for including in each multidimensional keyword Dimension data.The present invention combines the multi-dimensional indexing of Distributed Cluster, distributed query and multiple subspaces, ensure that inspection Under the premise of rope query accuracy, the efficiency of retrieval and inquisition is improved, this method can be applied to the inspection of large-scale distributed data Among rope, there is preferable scalability.
Finally, the present processes are only preferable embodiment, are not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in the protection of the present invention Within the scope of.

Claims (10)

1. a kind of distributed index method of extensive high dimensional data, which is characterized in that including:
The each high dimensional data being stored in each child node is divided into m low-dimensional data, each high dimensional data is corresponding Low-dimensional data is stored in corresponding m sub-spaces, wherein m is the integer more than or equal to 2;
Every sub-spaces are obtained using Distributed Cluster algorithm for the low-dimensional data of the same subspace in all child nodes K cluster centre K keyword of every sub-spaces is carried out using each cluster centre as an one-dimensional keyword Combination, obtains KmA m ties up keyword, wherein K is positive integer;
The cluster centre that the low-dimensional data in each sub-spaces of the child node is subordinate to is calculated in each child node, and then is obtained The m being subordinate to each high dimensional data ties up keyword, to obtain all high dimensional datas that each m dimension keywords include;
Determine the KmIn a m dimensions keyword keyword is tieed up with multiple m of inquiry Data Matching;
All high dimensional datas that keyword includes are tieed up according to each m, are determined all in each similar m dimensions keyword With the high dimensional data of the inquiry Data Matching in high dimensional data, all high dimensions with the inquiry Data Matching are found According to as query result.
2. the distributed index method of extensive high dimensional data as described in claim 1, which is characterized in that further include:
All high dimensional datas are stored in the computer cluster of multiple child node compositions, one is chosen from multiple child nodes Host node of the child node as all child nodes.
3. the distributed index method of extensive high dimensional data as claimed in claim 2, which is characterized in that described to be stored in Each high dimensional data in each child node is divided into m low-dimensional data, and the corresponding low-dimensional data of each high dimensional data is stored It is specifically included in corresponding m sub-spaces:
Each P dimension data on this node is divided into m P/m dimension data in dimension in each child node, and will be every Each P/m dimension datas of a P dimension datas are stored in corresponding subspace, wherein P/m is integer, the number of the subspace It is m.
4. the distributed index method of extensive high dimensional data as claimed in claim 3, which is characterized in that for all sub- sections The low-dimensional data of same subspace on point obtains K cluster centre of every sub-spaces using Distributed Cluster algorithm, will K keyword of every sub-spaces is combined as an one-dimensional keyword, obtains K by each cluster centremA m dimensions Keyword specifically includes:
The P/m dimension datas of i-th of subspace in all child nodes are carried out using distributed K-Means clustering algorithms distributed Cluster, obtains K cluster centre of every sub-spaces, is denoted as respectively:
Wherein, i=1,2 ... m, m indicate that m-th of subspace, k indicate k-th of cluster centre;
Using each cluster centre as a keyword, group is carried out to K cluster centre of m sub-spaces on the primary node It closes, obtains KmA m ties up keyword, is denoted as U=[u1x,u2y,...,umw], wherein 0 < x, y, w≤k, and x, y, w are integer, U indicates that m ties up keyword, and x is indicated from U1=[u11,u12,...,u1k] in x-th of cluster centre choosing, y indicates from U2= [u21,u22,...,u2k] in y-th of cluster centre choosing, w indicates from Um=[um1,um2,...,umk] in choose w-th Cluster centre.
5. the distributed index method of extensive high dimensional data as claimed in claim 4, which is characterized in that described in every height The cluster centre that the low-dimensional data in each sub-spaces of the child node is subordinate to is calculated on node, and then obtains each high dimension Keyword is tieed up according to the m being subordinate to, is specifically included with obtaining all high dimensional datas that each m dimension keywords include:
Each P/m dimension data in each sub-spaces of the child node and the subspace are calculated in each child node The Euclidean distance of K cluster centre, the cluster centre being subordinate to as the P/m dimension datas apart from nearest cluster centre, in turn Obtain the cluster centre that each P/m dimension data of each sub-spaces of the child node is subordinate to;
For each P dimension data in each child node, the cluster centre that its corresponding m P/m dimension data is subordinate to closes And the corresponding m dimensions keyword of the P dimension datas is obtained, and then obtain all P dimension datas that each m dimension keywords are included.
6. the distributed index method of extensive high dimensional data as claimed in claim 5, which is characterized in that described in the determination KmIt is specifically included with multiple m dimension keywords of the inquiry Data Matching in a m dimensions keyword:
On the primary node, P is tieed up into inquiry data and is divided into m P/m dimension data, calculate P dimensions inquiry data and KmA m ties up keyword In each m dimension keyword between Euclidean distance;
The preceding a m dimension keyword nearest with the Euclidean distance of the inquiry data is tieed up into keyword as matched m;Alternatively, will The b m dimension keywords for being less than pre-determined distance with the Euclidean distance of the inquiry data tie up keyword as matched m, wherein a, B is positive integer;
Host node is distributed to each height section by the inquiry data and with all m dimension keywords of the inquiry Data Matching Point.
7. the distributed index method of extensive high dimensional data as claimed in claim 6, which is characterized in that described according to each All high dimensional datas that a m dimension keyword includes, determine in all high dimensional datas in each similar m dimensions keyword with The high dimensional data of the inquiry Data Matching finds all high dimensional datas with the inquiry Data Matching, is tied as inquiry Fruit specifically includes:
In each child node, each height that the inquiry data are included with each matched m dimension keywords is calculated The Euclidean distance of dimension data;
Using the preceding e high dimensional data nearest with the Euclidean distance of the inquiry data as with the higher-dimension for inquiring Data Matching Data;Alternatively, being less than g high dimensional data of pre-determined distance as matched high dimension using with the Euclidean distance of the inquiry data According to, wherein e, g are positive integer;
Summarize the high dimensional data with the inquiry Data Matching found in all child nodes on the primary node, it is all by what is summarized High dimensional data is returned as query result.
8. such as the distributed index method of the extensive high dimensional data of claim 1-7 any one of them, which is characterized in that described The value of m is 2.
9. a kind of distributed index system of extensive high dimensional data, which is characterized in that including:
Division module will be each high for each high dimensional data being stored in each child node to be divided into m low-dimensional data The corresponding low-dimensional data of dimension data is stored in corresponding m sub-spaces, wherein m is the integer more than or equal to 2;
Cluster module, for being obtained using Distributed Cluster algorithm for the low-dimensional data of the same subspace in all child nodes To K cluster centre of every sub-spaces, using each cluster centre as an one-dimensional keyword, by the K of every sub-spaces A keyword is combined, and obtains KmA m ties up keyword;
Computing module, the low-dimensional data in each sub-spaces for calculating the child node in each child node are subordinate to poly- Class center, and then obtain the m dimension keywords that each high dimensional data is subordinate to, it is all with obtain that each m dimension keywords include High dimensional data;
First determining module, for determining the KmIn a m dimensions keyword keyword is tieed up with multiple m of inquiry Data Matching;
Second determining module is determined for tieing up all high dimensional datas that keyword includes according to each m similar in each M ties up the high dimensional data with the inquiry Data Matching in all high dimensional datas in keyword, finds and the inquiry data All high dimensional datas matched, as query result.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1-8 is any.
CN201711349831.XA 2017-12-15 2017-12-15 A kind of distributed index method and system of extensive high dimensional data Active CN108090182B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711349831.XA CN108090182B (en) 2017-12-15 2017-12-15 A kind of distributed index method and system of extensive high dimensional data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711349831.XA CN108090182B (en) 2017-12-15 2017-12-15 A kind of distributed index method and system of extensive high dimensional data

Publications (2)

Publication Number Publication Date
CN108090182A CN108090182A (en) 2018-05-29
CN108090182B true CN108090182B (en) 2018-10-30

Family

ID=62176631

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711349831.XA Active CN108090182B (en) 2017-12-15 2017-12-15 A kind of distributed index method and system of extensive high dimensional data

Country Status (1)

Country Link
CN (1) CN108090182B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115357609B (en) * 2022-10-24 2023-01-13 深圳比特微电子科技有限公司 Method, device, equipment and medium for processing data of Internet of things

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831225A (en) * 2012-08-27 2012-12-19 南京邮电大学 Multi-dimensional index structure under cloud environment, construction method thereof and similarity query method
CN106909942A (en) * 2017-02-28 2017-06-30 北京邮电大学 A kind of Subspace clustering method and device towards high-dimensional big data

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6122628A (en) * 1997-10-31 2000-09-19 International Business Machines Corporation Multidimensional data clustering and dimension reduction for indexing and searching
KR101003842B1 (en) * 2008-10-24 2010-12-23 연세대학교 산학협력단 Method and system of clustering for multi-dimensional data streams
US8560472B2 (en) * 2010-09-30 2013-10-15 The Aerospace Corporation Systems and methods for supporting restricted search in high-dimensional spaces
CN103678520B (en) * 2013-11-29 2017-03-29 中国科学院计算技术研究所 A kind of multi-dimensional interval query method and its system based on cloud computing
CN107368599B (en) * 2017-07-26 2020-06-23 中南大学 Visual analysis method and system for high-dimensional data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831225A (en) * 2012-08-27 2012-12-19 南京邮电大学 Multi-dimensional index structure under cloud environment, construction method thereof and similarity query method
CN106909942A (en) * 2017-02-28 2017-06-30 北京邮电大学 A kind of Subspace clustering method and device towards high-dimensional big data

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
An intelligent Weighted Kernel K-Means algorithm for high dimension data;Abdolreza Rasouli Kenari.etc;《2009 Second International Conference on the Applications of Digital Information and Web Technologies》;20091002;第829-831页 *
Clustering algorithm on high-dimension data partitional mended attribute;Tangsen Zhan.etc;《 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery》;20120709;第676-678页 *
一种高维数据集的子空间聚类算法;乐耀佳等;《南京师范大学学报(工程技术版)》;20090930;第55-63页 *
基于k 最相似聚类的子空间聚类算法;单世民等;《计算机工程》;20090731;第4-6页 *

Also Published As

Publication number Publication date
CN108090182A (en) 2018-05-29

Similar Documents

Publication Publication Date Title
US20180276250A1 (en) Distributed Image Search
Wang et al. Fast approximate k-means via cluster closures
WO2013129580A1 (en) Approximate nearest neighbor search device, approximate nearest neighbor search method, and program
US20140229473A1 (en) Determining documents that match a query
US11106708B2 (en) Layered locality sensitive hashing (LSH) partition indexing for big data applications
JP6779231B2 (en) Data processing method and system
Chen et al. Metric similarity joins using MapReduce
CN108549696B (en) Time series data similarity query method based on memory calculation
CN106095951B (en) Data space multi-dimensional indexing method based on load balancing and inquiry log
Tang et al. Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce.
Adamu et al. A survey on big data indexing strategies
US20110179013A1 (en) Search Log Online Analytic Processing
Mic et al. Speeding up similarity search by sketches
CN103761286B (en) A kind of Service Source search method based on user interest
Gao et al. Real-time social media retrieval with spatial, temporal and social constraints
Li et al. Fast distributed video deduplication via locality-sensitive hashing with similarity ranking
WO2017095439A1 (en) Incremental clustering of a data stream via an orthogonal transform based indexing
Vu et al. R*-grove: Balanced spatial partitioning for large-scale datasets
CN108090182B (en) A kind of distributed index method and system of extensive high dimensional data
Ma et al. In-memory distributed indexing for large-scale media data retrieval
CN110209895B (en) Vector retrieval method, device and equipment
Aparajita et al. Comparative analysis of clustering techniques in cloud for effective load balancing
Antaris et al. Similarity search over the cloud based on image descriptors' dimensions value cardinalities
Gupta et al. Feature selection: an overview
Jánošová et al. Organizing Similarity Spaces Using Metric Hulls

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant