CN108090182B - A kind of distributed index method and system of extensive high dimensional data - Google Patents
A kind of distributed index method and system of extensive high dimensional data Download PDFInfo
- Publication number
- CN108090182B CN108090182B CN201711349831.XA CN201711349831A CN108090182B CN 108090182 B CN108090182 B CN 108090182B CN 201711349831 A CN201711349831 A CN 201711349831A CN 108090182 B CN108090182 B CN 108090182B
- Authority
- CN
- China
- Prior art keywords
- data
- high dimensional
- keyword
- dimension
- dimensional data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/319—Inverted lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of distributed index method and system of extensive high dimensional data, by all high dimensional data distributed storages on cluster, each high dimensional data is divided into multiple low-dimensional datas, a low-dimensional data of all high dimensional datas is stored in each sub-spaces, multiple cluster centres of all low-dimensional datas of each sub-spaces are obtained using Distributed Cluster algorithm, multiple cluster centres of every sub-spaces are combined, obtain multiple multidimensional keywords of all high dimensional datas, and calculate the high dimensional data that each multidimensional keyword includes, distributed index is carried out to all data with this.In inquiry, the multidimensional keyword of first inquiry and inquiry Data Matching, then inquire the high dimensional data that each keyword includes.The present invention combines the inverted index of Distributed Cluster, distributed query and multiple subspaces, under the premise of ensure that retrieval and inquisition accuracy, improves the efficiency of retrieval and inquisition, can be applied to the retrieval of large-scale distributed data.
Description
Technical field
The present invention relates to big data retrieval and inquisition technical fields, more particularly, to a kind of point of extensive high dimensional data
Cloth indexing means and system.
Background technology
In today that information technology rapidly develops, unstructured data such as text, image, video and audio etc. is all presented
Go out exponential growth.The information that user wants how is quickly and accurately obtained from the internet data of magnanimity, is non-structural
Change big data management and an important technological problems in retrieval.Text that the Internet companies such as Google, Baidu are provided, image
Etc. search services be people obtain information bring great convenience.And in the behind of these search services, it is required for approximate close
The support of adjacent inquiring technology.Under the application scenarios of extensive high dimensional data, accurate NN Query needs to expend a large amount of storages
And computing resource, and query time is too long, directory system handling capacity is too low, actual application value is relatively low.Approximate NN Query skill
Art can be greatly shortened query time, reduce storage and computing cost, while ensure that query result is close with accurate query result
Seemingly, therefore there is higher practicability.Other than information retrieval, similar to search technology is widely used in machine learning, number
According to fields such as excavation, multimedia administrations.
In data scale ever-increasing today, more and more applications are all based on storage in a distributed system big
Scale data, such as internet text, image, video frequency searching etc..Existing many Index Algorithms are all in the environment of single machine
It realizes, and under distributed environment, existing major part indexing means require will be in all Data Migrations to same machine
Centralized processing is carried out, but this has violated the distributed storage mode of data, brings very high data migration cost, and right
The data processing performance of single machine requires very high.
Invention content
The present invention provides a kind of a kind of extensive higher-dimension for overcoming the above problem or solving the above problems at least partly
The distributed index method and system of data.
According to an aspect of the invention, there is provided a kind of distributed index method of extensive high dimensional data, including:
The each high dimensional data being stored in each child node is divided into m low-dimensional data, by each high dimensional data pair
The low-dimensional data answered is stored in corresponding m sub-spaces, wherein m is the integer more than or equal to 2;
Every height is obtained using Distributed Cluster algorithm for the low-dimensional data of the same subspace in all child nodes
The K cluster centre in space, using each cluster centre as an one-dimensional keyword, by K keyword of every sub-spaces
It is combined, obtains KmA m ties up keyword, wherein K is positive integer;
The cluster centre that the low-dimensional data in each sub-spaces of the child node is subordinate to is calculated in each child node, into
And the m dimension keywords that each high dimensional data is subordinate to are obtained, to obtain all high dimensional datas that each m dimension keywords include;
Determine the KmMultiple m in a m dimensions keyword with the inquiry Data Matching tie up keyword;
All high dimensional datas that keyword includes are tieed up according to each m, are determined in each similar m dimensions keyword
With the high dimensional data of the inquiry Data Matching in all high dimensional datas, all higher-dimensions with the inquiry Data Matching are found
Data, as query result.
Based on the above technical solution, the present invention can also improve as follows.
Further, further include:
All high dimensional datas are stored in the computer cluster of multiple child node compositions, are chosen from multiple child nodes
Host node of one child node as all child nodes.
Further, described that each high dimensional data being stored in each child node is divided into m low-dimensional data, it will be every
The corresponding low-dimensional data of a high dimensional data, which is stored in corresponding m sub-spaces, to be specifically included:
Each P dimension data on this node is divided into m P/m dimension data in dimension in each child node, and
Each P/m dimension datas of each P dimension datas are stored in corresponding subspace, wherein P/m is integer, the subspace
Number is m.
Further, the low-dimensional data of the same subspace in all child nodes is obtained using Distributed Cluster algorithm
To K cluster centre of every sub-spaces, using each cluster centre as an one-dimensional keyword, by the K of every sub-spaces
A keyword is combined, and obtains KmA m dimensions keyword specifically includes:
The P/m dimension datas of i-th of subspace in all child nodes are divided using distributed K-Means clustering algorithms
Cloth clusters, and obtains K cluster centre of every sub-spaces, is denoted as respectively:
U1=[u11,u12,...,u1k]
U2=[u21,u22,...,u2k]
Ui=[ui1,ui2,...,uik];
Um=[um1,um2,...,umk]
Wherein, i=1,2 ... m, m indicate that m-th of subspace, k indicate k-th of cluster centre;
Using each cluster centre as a keyword, K cluster centre of m sub-spaces is carried out on the primary node
Combination, obtains KmA m ties up keyword, is denoted as U=[u1x,u2y,...,umw], wherein 0 < x, y, w≤k, and x, y, w are whole
Number, U indicate that m ties up keyword, and x is indicated from U1=[u11,u12,...,u1k] in x-th of cluster centre choosing, y indicates from U2=
[u21,u22,...,u2k] in y-th of cluster centre choosing, w indicates from Um=[um1,um2,...,umk] in choose w-th
Cluster centre.
Further, the low-dimensional data in each sub-spaces that the child node is calculated in each child node is subordinate to
Cluster centre, and then the m dimension keywords that each high dimensional data is subordinate to are obtained, to obtain the institute that each m dimension keywords include
Some high dimensional datas specifically include:
Each P/m dimension data and son sky in each sub-spaces of the child node are calculated in each child node
Between K cluster centre Euclidean distance, the cluster centre being subordinate to as the P/m dimension datas apart from nearest cluster centre,
And then the cluster centre that each the P/m dimension data for each sub-spaces for obtaining the child node is subordinate to;
For each P dimension data in each child node, in the cluster that its corresponding m P/m dimension data is subordinate to
The heart merges, and obtains the corresponding m dimensions keyword of the P dimension datas, and then obtains all P dimensions that each m dimension keywords are included
Data.
Further, the determination KmMultiple m in a m dimensions keyword with the inquiry Data Matching tie up keyword
It specifically includes:
On the primary node, P is tieed up into inquiry data and is divided into m P/m dimension data, calculate P dimensions inquiry data and KmA m Wei Guan
The Euclidean distance between each m dimension keywords in key word;
The preceding a m dimension keyword nearest with the Euclidean distance of the inquiry data is tieed up into keyword as matched m;Or
The b m dimension keywords for being less than pre-determined distance with the Euclidean distance of the inquiry data are tieed up keyword by person as matched m,
In, a, b are positive integer;
Host node is distributed to each by the inquiry data and with all m dimension keywords of the inquiry Data Matching
Child node.
Further, described that all high dimensional datas that keyword includes are tieed up according to each m, it determines similar in each
M ties up the high dimensional data with the inquiry Data Matching in all high dimensional datas in keyword, finds and the inquiry data
All high dimensional datas matched, specifically include as query result:
In each child node, calculate that described inquiry data and each matched m tie up that keyword included is each
The Euclidean distance of a high dimensional data;
Using with the Euclidean distance of the inquiry data nearest preceding e high dimensional data as with the Data Matching of inquiring
High dimensional data;Alternatively, being less than g high dimensional data of pre-determined distance as matched height using with the Euclidean distance of the inquiry data
Dimension data, wherein e, g are positive integer;
Summarize the high dimensional data with the inquiry Data Matching found in all child nodes on the primary node, by what is summarized
All high dimensional datas are returned as query result.
Further, the value of the m is 2.
According to another aspect of the present invention, a kind of distributed index system of extensive high dimensional data is provided, including:
Division module will be every for each high dimensional data being stored in each child node to be divided into m low-dimensional data
The corresponding low-dimensional data of a high dimensional data is stored in corresponding m sub-spaces, wherein m is the integer more than or equal to 2;
Cluster module is calculated for the low-dimensional data for the same subspace in all child nodes using Distributed Cluster
Method obtains K cluster centre of every sub-spaces, will be empty per height using each cluster centre as an one-dimensional keyword
Between K keyword be combined, obtain KmA m ties up keyword;
Computing module, the low-dimensional data in each sub-spaces for calculating the child node in each child node are subordinate to
Cluster centre, and then the m dimension keywords that each high dimensional data is subordinate to are obtained, to obtain the institute that each m dimension keywords include
Some high dimensional datas;
First determining module, for determining the KmA m ties up multiple m Wei Guan with the inquiry Data Matching in keyword
Key word;
Second determining module determines each phase for tieing up all high dimensional datas that keyword includes according to each m
High dimensional data in all high dimensional datas in close m dimension keywords with the inquiry Data Matching, finds and the inquiry number
According to matched all high dimensional datas, as query result.
According to a further aspect of the invention, a kind of non-transient computer readable storage medium is provided, which is characterized in that
The non-transient computer readable storage medium stores computer instruction, and the computer instruction makes the computer execute big rule
The distributed index method of mould high dimensional data.
The present invention provides a kind of distributed index method and system of extensive high dimensional data, by all high dimensional datas
Each high dimensional data is divided into multiple low-dimensional datas, is stored in each sub-spaces by distributed storage on cluster
A low-dimensional data for having high dimensional data, the more of all low-dimensional datas of each sub-spaces are obtained using Distributed Cluster algorithm
A cluster centre is combined multiple cluster centres of every sub-spaces, and the multiple multidimensional for obtaining all high dimensional datas are crucial
Word, and the high dimensional data that each multidimensional keyword includes is calculated, in inquiry, first inquiry and the multidimensional of inquiry Data Matching are closed
Key word, then inquire the high dimensional data that each keyword includes.The present invention is by Distributed Cluster, distributed query and multiple sons
The inverted index in space combines, and under the premise of ensure that retrieval and inquisition accuracy, improves the efficiency of retrieval and inquisition, this method
It can be applied among the retrieval of large-scale distributed data, there is preferable scalability.
Description of the drawings
Fig. 1 is the distributed index method flow diagram of the extensive high dimensional data of one embodiment of the invention;
Fig. 2 is the schematic diagram of the two-dimentional cluster centre built in one embodiment of the invention;
Fig. 3 is the schematic diagram for finding matched two-dimentional cluster centre in one embodiment of the invention according to inquiry data;
Fig. 4 is that the distributed index system of the extensive high dimensional data of one embodiment of the invention connects block diagram.
Specific implementation mode
With reference to the accompanying drawings and examples, the specific implementation mode of the present invention is described in further detail.Implement below
Example is not limited to the scope of the present invention for illustrating the present invention.
Referring to Fig. 1, the distributed index method of the extensive high dimensional data of one embodiment of the invention is provided, will be distributed
Formula is applied and cumulative index combines, under the premise of ensureing data retrieval precision, additionally it is possible to improve the efficiency of data retrieval.
This method includes:The each high dimensional data being stored in each child node is divided into m low-dimensional data, by each high dimensional data
Corresponding low-dimensional data is stored in corresponding m sub-spaces, wherein m is the integer more than or equal to 2;For all child nodes
On the low-dimensional data of same subspace K cluster centre of every sub-spaces is obtained using Distributed Cluster algorithm, will be every
K keyword of every sub-spaces is combined as an one-dimensional keyword, obtains K by one cluster centremA m Wei Guan
Key word;The cluster centre that the low-dimensional data in each sub-spaces of the child node is subordinate to is calculated in each child node, in turn
The m dimension keywords that each high dimensional data is subordinate to are obtained, to obtain all high dimensional datas that each m dimension keywords include;Really
The fixed KmMultiple m in a m dimensions keyword with the inquiry Data Matching tie up keyword;According to each m Wei Guanjianzibao
All high dimensional datas included, determine in all high dimensional datas in each similar m dimension keyword with the inquiry data
Matched high dimensional data finds all high dimensional datas with the inquiry Data Matching, as query result.
The present embodiment is primarily adapted for use in the retrieval and inquisition of extensive high dimensional data, since the scale of high dimensional data is bigger,
Therefore, in the present embodiment, first by all high dimensional data distributed storages in the computer cluster being made of multiple child nodes
In, host node of the child node as all child nodes is chosen from multiple child nodes, for managing all child nodes, with
And it is responsible for the collection and distribution of the data in all child nodes.It, will be each in order to realize the retrieval and inquisition of extensive high dimensional data
High dimensional data in a child node carries out dimensionality reduction, and a high dimensional data forms multiple low-dimensional datas by dimensionality reduction.By each height
Each low-dimensional data of dimension data is stored in corresponding subspace, i.e., each sub-spaces, which correspond to, stores each high dimension
According to some low-dimensional data after dimensionality reduction, the high dimensional data in all child nodes carries out identical processing.
For the low-dimensional data in the same subspace in all child nodes, due to the same subspace of all high dimensional datas
In corresponding low-dimensional data be stored in different child nodes, in order to improve the speed and efficiency of clustering processing, in the present embodiment
The low-dimensional data in the same subspace in all child nodes is clustered using Distributed Cluster algorithm, obtains each
K cluster centre of subspace, using a cluster centre in each sub-spaces as an one-dimensional keyword.To each
The cluster centre of subspace is combined, and obtains all possible multidimensional keyword of high dimensional data.By in each child node
The cluster centre that the upper low-dimensional data for calculating each sub-spaces is subordinate to, so combine obtain each high dimensional data be subordinate to it is more
Keyword is tieed up, to finally obtain all high dimensional datas that each multidimensional keyword is included.
When being inquired, first found from all possible multidimensional keyword on the primary node and inquiry Data Matching
Multiple multidimensional keywords are corresponding more in each child node then according to multiple multidimensional keywords with inquiry Data Matching
The high dimensional data searched in all high dimensional datas included by keyword with inquiry Data Matching is tieed up, will be looked into all child nodes
It is finding to be summarized with inquiry Data Matching high dimensional data, obtain the query result for finally inquiring data.
The present embodiment combines the cumulative index of Distributed Cluster, distributed query and multiple subspaces, ensures
Under the premise of retrieval and inquisition accuracy, the efficiency of retrieval and inquisition is improved, this method can be applied to large-scale distributed data
Retrieval among, have preferable scalability.
On the basis of the above embodiments, described to be stored in each child node in one embodiment of the present of invention
Each high dimensional data is divided into m low-dimensional data, and it is empty that the corresponding low-dimensional data of each high dimensional data is stored in corresponding m son
Between in specifically include:Each P dimension data on this node is divided into m P/m dimension in dimension in each child node
According to, and each P/m dimension datas of each P dimension datas are stored in corresponding subspace, wherein P/m is integer, and the son is empty
Between number be m.
During specifically carrying out dimensionality reduction to each high dimensional data, it is assumed that all high dimensional datas are P dimension datas,
P is more than or equal to 2.In each child node, each P dimension data on this node is divided into the low-dimensional number of m P/m dimensions
According to, wherein P/m is integer, i.e. P is the integral multiple of m, and m is positive integer.Each P high dimensional datas tieed up are divided into m P/m
After the low-dimensional data of dimension, each low-dimensional data is stored in corresponding subspace, the number of subspace is m.For example, right
In the high dimensional data of 4 dimensions in child node, it is divided into the low-dimensional data of 22 dimensions, is respectively stored in the low-dimensional data that 22 are tieed up
In corresponding two sub-spaces.For the high dimensional data of the P dimensions in all child nodes, identical dimension-reduction treatment is carried out, and will
In each low-dimensional data storage to corresponding subspace after dimensionality reduction.
It is described for same in all child nodes in one embodiment of the present of invention on the basis of the various embodiments described above
The low-dimensional data of one subspace is obtained K cluster centre of every sub-spaces, each is gathered using Distributed Cluster algorithm
K keyword of every sub-spaces is combined as an one-dimensional keyword, obtains K by class centermA m dimensions keyword tool
Body includes:The P/m dimension datas of i-th of subspace in all child nodes are divided using distributed K-Means clustering algorithms
Cloth clusters, and obtains K cluster centre of every sub-spaces, is denoted as respectively:
U1=[u11,u12,...,u1k]
U2=[u21,u22,...,u2k]
Ui=[ui1,ui2,...,uik];
Um=[um1,um2,...,umk]
Wherein, i=1,2 ... m, m indicate that m-th of subspace, k indicate k-th of cluster centre;By each cluster centre
As a keyword, K cluster centre of m sub-spaces is combined on the primary node, obtains KmA m ties up keyword,
It is denoted as U=[u1x,u2y,...,umw], wherein 0 < x, y, w≤k, and x, y, w are integer, U indicates that m ties up keyword, and x is indicated
From U1=[u11,u12,...,u1k] in x-th of cluster centre choosing, y indicates from U2=[u21,u22,...,u2k] in choose
Y-th of cluster centre, w are indicated from Um=[um1,um2,...,umk] in choose w-th of cluster centre.
Dimensionality reduction has been carried out to the high dimensional data of the P dimensions in each child node in above-described embodiment, has formd m P/m dimension
Low-dimensional data, and each low-dimensional data is stored in corresponding subspace.In the present embodiment, for all child nodes
On same subspace low-dimensional data, be stored in due to the low-dimensional data of same subspace in different child nodes, in order to carry
The speed and efficiency of high clustering processing, the present embodiment use distribution K-Means clustering algorithms, obtain the same of all child nodes
K cluster centre of the low-dimensional data of sub-spaces, i.e., for m sub-spaces, each sub-spaces are all corresponding in K cluster
The heart.Wherein, distributed K-Means clustering algorithms are using existing Distributed Cluster algorithm, and details are not described herein.Then exist
K cluster centre of all subspaces is combined on host node, all possible multidimensional keyword is obtained, K is always obtainedm
A multidimensional keyword.
It is described to be calculated in each child node in an alternative embodiment of the invention on the basis of the various embodiments described above
The cluster centre that low-dimensional data in each sub-spaces of the child node is subordinate to, and then obtain the m that each high dimensional data is subordinate to
Keyword is tieed up, is specifically included with obtaining all high dimensional datas that each m dimension keywords include:It is counted in each child node
The Euclidean distance of each P/m dimension data in each sub-spaces of the child node and K cluster centre of the subspace is calculated,
The cluster centre being subordinate to as the P/m dimension datas apart from nearest cluster centre, and then obtain each height of the child node
The cluster centre that each P/m dimension data in space is subordinate to;It is for each P dimension data in each child node, its is right
The cluster centre that the m P/m dimension data answered is subordinate to merges, and obtains the corresponding m dimensions keyword of the P dimension datas, and then obtain each
All P dimension datas that a m dimensions keyword is included.
In above-described embodiment, the same sub-spaces of all child nodes are calculated by distributed K-Means clustering algorithms
Low-dimensional data K cluster centre, the present embodiment calculates and is subordinate to per the low-dimensional data in sub-spaces in each child node
The cluster centre of category.During specifically calculating the cluster centre being subordinate to, by calculating the low-dimensional data in subspace and K
The Euclidean distance of cluster centre is subordinate to poly- using the cluster centre nearest with the Euclidean distance of low-dimensional data as the low-dimensional data
Class center calculates the cluster centre that each low-dimensional data is subordinate in subspace.Since a high dimensional data is divided
It is fitted in multiple subspaces, the cluster centre that low-dimensional data corresponding with high dimensional data is subordinate in multiple subspaces is subjected to group
It closes, obtains the m dimension keywords that high dimensional data is subordinate to.Identical processing is carried out to the data in each child node, is obtained every
The m that each high dimensional data is subordinate in one child node ties up keyword, and then can count each m dimension keywords and include
All high dimensional datas.
On the basis of the various embodiments described above, in one embodiment of the present of invention, the determination KmA m ties up keyword
In with it is described inquiry Data Matching multiple m dimension keyword specifically include:On the primary node, P is tieed up into inquiry data and is divided into m
P/m dimension datas calculate P dimensions inquiry data and KmThe Euclidean distance between each m dimension keywords in a m dimensions keyword;It will
Nearest preceding a m dimension keywords tie up keyword as matched m with the Euclidean distance of the inquiry data;Alternatively, will with it is described
The Euclidean distance for inquiring data is less than the b m dimension keywords of pre-determined distance as matched m dimension keywords;Host node will be described
It inquires data and is distributed to each child node with all m dimension keywords of the inquiry Data Matching.
During retrieval and inquisition, host node receives inquiry request, and the inquiry data that P is tieed up are divided into m by host node
The low-dimensional data of P/m dimensions, and calculate the inquiry data of P dimensions and possible KmIn a m dimensions keyword between each m dimension keywords
Euclidean distance.When specifically calculating Euclidean, P can be tieed up to the vector that inquiry data regard m low-dimensional data as, each m dimensions
Keyword regards the vector of m dimensions as, calculates the distance between two vectors, you can obtains inquiry data and ties up keyword with each m
Euclidean distance.
According to inquiry data and each m dimension keyword between Euclidean distance result of calculation, by Euclidean distance according to
Distance is ranked up from small to large, and the m of a cluster centre multidimensional being arranged in front and inquiry Data Matching is tieed up keyword;Or
Person set Euclidean distance threshold value, using with inquiry data Euclidean distance be less than given threshold b cluster centre as with look into
The m for asking Data Matching ties up keyword, so far finds on the primary node and m dimension keywords all similar in inquiry data.Then,
Host node will inquire data and what is found is distributed to each child node with multiple m dimensions keywords similar in inquiry data.
On the basis of the various embodiments described above, in an alternative embodiment of the invention, tieing up keyword according to each m includes
All high dimensional datas, determine in all high dimensional datas in each similar m dimension keyword with the inquiry data
The high dimensional data matched finds all high dimensional datas with the inquiry Data Matching, is specifically included as query result:Every
In one child node, each high dimensional data that the inquiry data are included with each matched m dimension keywords is calculated
Euclidean distance;Using the preceding e high dimensional data nearest with the Euclidean distance of the inquiry data as with the inquiry Data Matching
High dimensional data;Alternatively, being less than g high dimensional data of pre-determined distance as matched using with the Euclidean distance of the inquiry data
High dimensional data;The high dimensional data with the inquiry Data Matching for summarizing and being found in all child nodes is collected on the primary node, it will
All high dimensional datas summarized are returned as query result.
Each child node receives the inquiry data of host node distribution and is tieed up with multiple m similar in inquiry data crucial
Word, since aforementioned each m that calculated in each child node ties up all high dimensional datas that keyword is included, because
This, can find in each child node and tie up the high dimensional data that keyword is included with each m similar in inquiry data.It is right
In all high dimensional datas that each m dimension keywords are included, calculates each high dimensional data and inquire the Europe between data
Family name's distance obtains each high dimensional data and inquires the Euclidean distance between data, and according to Euclidean distance from small to large suitable
Sequence is ranked up, using the preceding e high dimensional data nearest with the Euclidean distance of inquiry data as the higher-dimension with inquiry Data Matching
Data;Alternatively, being less than g high dimensional data of pre-determined distance as matched high dimensional data using with the Euclidean distance of inquiry data.
The high dimensional data with inquiry Data Matching is found in each child node, it will be all and inquiry Data Matching
High dimensional data is all aggregated on host node, is returned as query result, and entire retrieval and inquisition process is so far completed.
It should be noted that through a large number of experiments the study found that wherein, m values are 2 to be one and preferably select, because
This illustrates distributed index method provided by the invention for P=1024 below with m=2.
For the high dimensional data of each 1024 dimension in each child node, it is divided into the low-dimensional data of 2 512 dimensions,
And be stored in corresponding two sub-spaces, therefore, for each child node, there are two subspace, a sub-spaces are deposited
Store up 512 dimension datas of one of all 1024 dimension datas, another sub-spaces store all 1024 dimension datas another 512
Dimension data.Data are distributed in two sub-spaces in each child node, it can be understood as, two in each child node
Subspace is identical subspace, and the data being only distributed in subspace are different.
It is poly- using distributed K-Means for 512 dimension datas in two sub-spaces in all child nodes referring to Fig. 2
Class algorithm finds K cluster centre of all 512 dimension datas in each sub-spaces, the K cluster obtained in two sub-spaces
Center is denoted as U=[u respectively1,u2,…,uK] and V=[v1,v2,…,vK], wherein u1,u2,…,uKAnd v1,v2,…,vKRespectively
For K cluster centre in two sub-spaces.To U=[u1,u2,…,uK] and V=[v1,v2,…,vK] be combined, it can obtain
K2A all possible keyword.
The K cluster centre that every sub-spaces are had found by cluster calculates each sub-spaces in each child node
In the cluster centre that is subordinate to of each 512 dimension data, the cluster that 512 dimension datas calculated in two sub-spaces are subordinate to
Center merges, and obtains the keyword that each 1024 dimension data is subordinate to, is denoted as [uivj], indicate each 1024 dimension data
One of them 512 dimension data is under the jurisdiction of ith cluster center, another 512 dimension data is under the jurisdiction of j-th of cluster centre, passes through
Same method obtains the keyword that all 1024 dimension datas are subordinate to, and then can also count each keyword and be included
1024 all dimension datas.
During retrieval and inquisition, on the primary node, the inquiry data q of 1024 dimensions is divided into the number of 2 512 dimensions
According to first carrying out inquiry data and K2Each keyword in a keyword carries out the calculating of Euclidean distance, finds and inquiry number
According to similar multiple keywords, reference can be made to Fig. 3.Then host node by inquire data and find with inquiry data similar in it is more
A 512 dimension keyword is distributed to each child node.In each child node, find and each similar pass of inquiry data
The 1024 all dimension datas that key word includes, each 1024 dimension for being included with each similar keyword by inquiry data
Data carry out Euclidean distance calculating, find in all 1024 dimension datas in each keyword with inquiry data similar in it is more
A 1024 dimension data finds in a child node and inquires 1024 dimension datas all similar in data, by all child nodes
On find with inquiry data similar in 1024 dimension datas summarized, returned as query result.
Referring to Fig. 4, the distributed index system of the extensive high dimensional data of one embodiment of the invention is provided, including draw
Sub-module 21, cluster module 22, computing module 23, the first computing module 24 and the second computing module 25.
Division module 21, for for each high dimensional data being stored in each child node to be divided into m low-dimensional number
According to the corresponding low-dimensional data of each high dimensional data is stored in corresponding m sub-spaces, wherein m is whole more than or equal to 2
Number.
Cluster module 22 is calculated for the low-dimensional data to the same subspace in all child nodes using Distributed Cluster
Method obtains K cluster centre of every sub-spaces, will be empty per height using each cluster centre as an one-dimensional keyword
Between K keyword be combined, obtain KmA m ties up keyword.
Computing module 23, the low-dimensional data in each sub-spaces for calculating the child node in each child node are subordinate to
The cluster centre of category, and then the m dimension keywords that each high dimensional data is subordinate to are obtained, include to obtain each m dimension keywords
All high dimensional datas.
First determining module 24, for determining the KmMultiple m in a m dimensions keyword with the inquiry Data Matching are tieed up
Keyword.
Second determining module 25 determines each for tieing up all high dimensional datas that keyword includes according to each m
With the high dimensional data of the inquiry Data Matching in all high dimensional datas in similar m dimensions keyword, find and the inquiry
All high dimensional datas of Data Matching, as query result.
The present invention also provides a kind of non-transient computer readable storage medium, which deposits
Store up computer instruction, which makes computer execute point of extensive high dimensional data that above-mentioned corresponding embodiment is provided
Cloth indexing means, such as including:The each high dimensional data being stored in each child node is divided into m low-dimensional data, it will
Each corresponding low-dimensional data of high dimensional data is stored in corresponding m sub-spaces, wherein m is the integer more than or equal to 2;It is right
Low-dimensional data in the same subspace in all child nodes, using Distributed Cluster algorithm, obtain every sub-spaces K are poly-
K keyword of every sub-spaces is combined, obtains using each cluster centre as an one-dimensional keyword by class center
To KmA m ties up keyword;Calculate that the low-dimensional data in each sub-spaces of the child node is subordinate in each child node is poly-
Class center, and then obtain the m dimension keywords that each high dimensional data is subordinate to, it is all with obtain that each m dimension keywords include
High dimensional data;Determine the KmMultiple m in a m dimensions keyword with the inquiry Data Matching tie up keyword;According to each m
Dimension keyword all high dimensional datas for including, determine in all high dimensional datas in each similar m dimensions keyword with institute
The high dimensional data for stating inquiry Data Matching finds all high dimensional datas with the inquiry Data Matching, as query result.
One of ordinary skill in the art will appreciate that:Realize that all or part of step of above method embodiment can pass through
The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer read/write memory medium, the program
When being executed, step including the steps of the foregoing method embodiments is executed;And storage medium above-mentioned includes:ROM, RAM, magnetic disc or light
The various media that can store program code such as disk.
The embodiments such as the equipment of distributed index method of extensive high dimensional data described above are only schematic
, wherein may or may not be physically separated as the unit that separating component illustrates, shown as unit
Component may or may not be physical unit, you can be located at a place, or may be distributed over multiple networks
On unit.Some or all of module therein can be selected according to the actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art are not in the case where paying performing creative labour, you can to understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can
It is realized by the mode of software plus required general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on
Stating technical solution, substantially the part that contributes to existing technology can be expressed in the form of software products in other words, should
Computer software product can store in a computer-readable storage medium, such as ROM/RAM, magnetic disc, CD, including several fingers
It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation
Certain Part Methods of example or embodiment.
The distributed index method and system of a kind of extensive high dimensional data provided by the invention, by all high dimensional datas
Each high dimensional data is divided into multiple low-dimensional datas, is stored in each sub-spaces by distributed storage on cluster
A low-dimensional data for having high dimensional data, the more of all low-dimensional datas of each sub-spaces are obtained using Distributed Cluster algorithm
A cluster centre, using the keyword that each cluster centre is one-dimensional as one, to multiple one-dimensional keywords of every sub-spaces
It is combined, obtains multiple multidimensional keywords of all high dimensional datas, and calculate the high dimension that each multidimensional keyword includes
According to, in inquiry, the multidimensional keyword of first inquiry and inquiry Data Matching, then inquire the height for including in each multidimensional keyword
Dimension data.The present invention combines the multi-dimensional indexing of Distributed Cluster, distributed query and multiple subspaces, ensure that inspection
Under the premise of rope query accuracy, the efficiency of retrieval and inquisition is improved, this method can be applied to the inspection of large-scale distributed data
Among rope, there is preferable scalability.
Finally, the present processes are only preferable embodiment, are not intended to limit the scope of the present invention.It is all
Within the spirit and principles in the present invention, any modification, equivalent replacement, improvement and so on should be included in the protection of the present invention
Within the scope of.
Claims (10)
1. a kind of distributed index method of extensive high dimensional data, which is characterized in that including:
The each high dimensional data being stored in each child node is divided into m low-dimensional data, each high dimensional data is corresponding
Low-dimensional data is stored in corresponding m sub-spaces, wherein m is the integer more than or equal to 2;
Every sub-spaces are obtained using Distributed Cluster algorithm for the low-dimensional data of the same subspace in all child nodes
K cluster centre K keyword of every sub-spaces is carried out using each cluster centre as an one-dimensional keyword
Combination, obtains KmA m ties up keyword, wherein K is positive integer;
The cluster centre that the low-dimensional data in each sub-spaces of the child node is subordinate to is calculated in each child node, and then is obtained
The m being subordinate to each high dimensional data ties up keyword, to obtain all high dimensional datas that each m dimension keywords include;
Determine the KmIn a m dimensions keyword keyword is tieed up with multiple m of inquiry Data Matching;
All high dimensional datas that keyword includes are tieed up according to each m, are determined all in each similar m dimensions keyword
With the high dimensional data of the inquiry Data Matching in high dimensional data, all high dimensions with the inquiry Data Matching are found
According to as query result.
2. the distributed index method of extensive high dimensional data as described in claim 1, which is characterized in that further include:
All high dimensional datas are stored in the computer cluster of multiple child node compositions, one is chosen from multiple child nodes
Host node of the child node as all child nodes.
3. the distributed index method of extensive high dimensional data as claimed in claim 2, which is characterized in that described to be stored in
Each high dimensional data in each child node is divided into m low-dimensional data, and the corresponding low-dimensional data of each high dimensional data is stored
It is specifically included in corresponding m sub-spaces:
Each P dimension data on this node is divided into m P/m dimension data in dimension in each child node, and will be every
Each P/m dimension datas of a P dimension datas are stored in corresponding subspace, wherein P/m is integer, the number of the subspace
It is m.
4. the distributed index method of extensive high dimensional data as claimed in claim 3, which is characterized in that for all sub- sections
The low-dimensional data of same subspace on point obtains K cluster centre of every sub-spaces using Distributed Cluster algorithm, will
K keyword of every sub-spaces is combined as an one-dimensional keyword, obtains K by each cluster centremA m dimensions
Keyword specifically includes:
The P/m dimension datas of i-th of subspace in all child nodes are carried out using distributed K-Means clustering algorithms distributed
Cluster, obtains K cluster centre of every sub-spaces, is denoted as respectively:
Wherein, i=1,2 ... m, m indicate that m-th of subspace, k indicate k-th of cluster centre;
Using each cluster centre as a keyword, group is carried out to K cluster centre of m sub-spaces on the primary node
It closes, obtains KmA m ties up keyword, is denoted as U=[u1x,u2y,...,umw], wherein 0 < x, y, w≤k, and x, y, w are integer,
U indicates that m ties up keyword, and x is indicated from U1=[u11,u12,...,u1k] in x-th of cluster centre choosing, y indicates from U2=
[u21,u22,...,u2k] in y-th of cluster centre choosing, w indicates from Um=[um1,um2,...,umk] in choose w-th
Cluster centre.
5. the distributed index method of extensive high dimensional data as claimed in claim 4, which is characterized in that described in every height
The cluster centre that the low-dimensional data in each sub-spaces of the child node is subordinate to is calculated on node, and then obtains each high dimension
Keyword is tieed up according to the m being subordinate to, is specifically included with obtaining all high dimensional datas that each m dimension keywords include:
Each P/m dimension data in each sub-spaces of the child node and the subspace are calculated in each child node
The Euclidean distance of K cluster centre, the cluster centre being subordinate to as the P/m dimension datas apart from nearest cluster centre, in turn
Obtain the cluster centre that each P/m dimension data of each sub-spaces of the child node is subordinate to;
For each P dimension data in each child node, the cluster centre that its corresponding m P/m dimension data is subordinate to closes
And the corresponding m dimensions keyword of the P dimension datas is obtained, and then obtain all P dimension datas that each m dimension keywords are included.
6. the distributed index method of extensive high dimensional data as claimed in claim 5, which is characterized in that described in the determination
KmIt is specifically included with multiple m dimension keywords of the inquiry Data Matching in a m dimensions keyword:
On the primary node, P is tieed up into inquiry data and is divided into m P/m dimension data, calculate P dimensions inquiry data and KmA m ties up keyword
In each m dimension keyword between Euclidean distance;
The preceding a m dimension keyword nearest with the Euclidean distance of the inquiry data is tieed up into keyword as matched m;Alternatively, will
The b m dimension keywords for being less than pre-determined distance with the Euclidean distance of the inquiry data tie up keyword as matched m, wherein a,
B is positive integer;
Host node is distributed to each height section by the inquiry data and with all m dimension keywords of the inquiry Data Matching
Point.
7. the distributed index method of extensive high dimensional data as claimed in claim 6, which is characterized in that described according to each
All high dimensional datas that a m dimension keyword includes, determine in all high dimensional datas in each similar m dimensions keyword with
The high dimensional data of the inquiry Data Matching finds all high dimensional datas with the inquiry Data Matching, is tied as inquiry
Fruit specifically includes:
In each child node, each height that the inquiry data are included with each matched m dimension keywords is calculated
The Euclidean distance of dimension data;
Using the preceding e high dimensional data nearest with the Euclidean distance of the inquiry data as with the higher-dimension for inquiring Data Matching
Data;Alternatively, being less than g high dimensional data of pre-determined distance as matched high dimension using with the Euclidean distance of the inquiry data
According to, wherein e, g are positive integer;
Summarize the high dimensional data with the inquiry Data Matching found in all child nodes on the primary node, it is all by what is summarized
High dimensional data is returned as query result.
8. such as the distributed index method of the extensive high dimensional data of claim 1-7 any one of them, which is characterized in that described
The value of m is 2.
9. a kind of distributed index system of extensive high dimensional data, which is characterized in that including:
Division module will be each high for each high dimensional data being stored in each child node to be divided into m low-dimensional data
The corresponding low-dimensional data of dimension data is stored in corresponding m sub-spaces, wherein m is the integer more than or equal to 2;
Cluster module, for being obtained using Distributed Cluster algorithm for the low-dimensional data of the same subspace in all child nodes
To K cluster centre of every sub-spaces, using each cluster centre as an one-dimensional keyword, by the K of every sub-spaces
A keyword is combined, and obtains KmA m ties up keyword;
Computing module, the low-dimensional data in each sub-spaces for calculating the child node in each child node are subordinate to poly-
Class center, and then obtain the m dimension keywords that each high dimensional data is subordinate to, it is all with obtain that each m dimension keywords include
High dimensional data;
First determining module, for determining the KmIn a m dimensions keyword keyword is tieed up with multiple m of inquiry Data Matching;
Second determining module is determined for tieing up all high dimensional datas that keyword includes according to each m similar in each
M ties up the high dimensional data with the inquiry Data Matching in all high dimensional datas in keyword, finds and the inquiry data
All high dimensional datas matched, as query result.
10. a kind of non-transient computer readable storage medium, which is characterized in that the non-transient computer readable storage medium is deposited
Computer instruction is stored up, the computer instruction makes the computer execute the method as described in claim 1-8 is any.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711349831.XA CN108090182B (en) | 2017-12-15 | 2017-12-15 | A kind of distributed index method and system of extensive high dimensional data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711349831.XA CN108090182B (en) | 2017-12-15 | 2017-12-15 | A kind of distributed index method and system of extensive high dimensional data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108090182A CN108090182A (en) | 2018-05-29 |
CN108090182B true CN108090182B (en) | 2018-10-30 |
Family
ID=62176631
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711349831.XA Active CN108090182B (en) | 2017-12-15 | 2017-12-15 | A kind of distributed index method and system of extensive high dimensional data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108090182B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115357609B (en) * | 2022-10-24 | 2023-01-13 | 深圳比特微电子科技有限公司 | Method, device, equipment and medium for processing data of Internet of things |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831225A (en) * | 2012-08-27 | 2012-12-19 | 南京邮电大学 | Multi-dimensional index structure under cloud environment, construction method thereof and similarity query method |
CN106909942A (en) * | 2017-02-28 | 2017-06-30 | 北京邮电大学 | A kind of Subspace clustering method and device towards high-dimensional big data |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6122628A (en) * | 1997-10-31 | 2000-09-19 | International Business Machines Corporation | Multidimensional data clustering and dimension reduction for indexing and searching |
KR101003842B1 (en) * | 2008-10-24 | 2010-12-23 | 연세대학교 산학협력단 | Method and system of clustering for multi-dimensional data streams |
US8560472B2 (en) * | 2010-09-30 | 2013-10-15 | The Aerospace Corporation | Systems and methods for supporting restricted search in high-dimensional spaces |
CN103678520B (en) * | 2013-11-29 | 2017-03-29 | 中国科学院计算技术研究所 | A kind of multi-dimensional interval query method and its system based on cloud computing |
CN107368599B (en) * | 2017-07-26 | 2020-06-23 | 中南大学 | Visual analysis method and system for high-dimensional data |
-
2017
- 2017-12-15 CN CN201711349831.XA patent/CN108090182B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831225A (en) * | 2012-08-27 | 2012-12-19 | 南京邮电大学 | Multi-dimensional index structure under cloud environment, construction method thereof and similarity query method |
CN106909942A (en) * | 2017-02-28 | 2017-06-30 | 北京邮电大学 | A kind of Subspace clustering method and device towards high-dimensional big data |
Non-Patent Citations (4)
Title |
---|
An intelligent Weighted Kernel K-Means algorithm for high dimension data;Abdolreza Rasouli Kenari.etc;《2009 Second International Conference on the Applications of Digital Information and Web Technologies》;20091002;第829-831页 * |
Clustering algorithm on high-dimension data partitional mended attribute;Tangsen Zhan.etc;《 2012 9th International Conference on Fuzzy Systems and Knowledge Discovery》;20120709;第676-678页 * |
一种高维数据集的子空间聚类算法;乐耀佳等;《南京师范大学学报(工程技术版)》;20090930;第55-63页 * |
基于k 最相似聚类的子空间聚类算法;单世民等;《计算机工程》;20090731;第4-6页 * |
Also Published As
Publication number | Publication date |
---|---|
CN108090182A (en) | 2018-05-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20180276250A1 (en) | Distributed Image Search | |
Wang et al. | Fast approximate k-means via cluster closures | |
WO2013129580A1 (en) | Approximate nearest neighbor search device, approximate nearest neighbor search method, and program | |
US20140229473A1 (en) | Determining documents that match a query | |
US11106708B2 (en) | Layered locality sensitive hashing (LSH) partition indexing for big data applications | |
JP6779231B2 (en) | Data processing method and system | |
Chen et al. | Metric similarity joins using MapReduce | |
CN108549696B (en) | Time series data similarity query method based on memory calculation | |
CN106095951B (en) | Data space multi-dimensional indexing method based on load balancing and inquiry log | |
Tang et al. | Efficient Processing of Hamming-Distance-Based Similarity-Search Queries Over MapReduce. | |
Adamu et al. | A survey on big data indexing strategies | |
US20110179013A1 (en) | Search Log Online Analytic Processing | |
Mic et al. | Speeding up similarity search by sketches | |
CN103761286B (en) | A kind of Service Source search method based on user interest | |
Gao et al. | Real-time social media retrieval with spatial, temporal and social constraints | |
Li et al. | Fast distributed video deduplication via locality-sensitive hashing with similarity ranking | |
WO2017095439A1 (en) | Incremental clustering of a data stream via an orthogonal transform based indexing | |
Vu et al. | R*-grove: Balanced spatial partitioning for large-scale datasets | |
CN108090182B (en) | A kind of distributed index method and system of extensive high dimensional data | |
Ma et al. | In-memory distributed indexing for large-scale media data retrieval | |
CN110209895B (en) | Vector retrieval method, device and equipment | |
Aparajita et al. | Comparative analysis of clustering techniques in cloud for effective load balancing | |
Antaris et al. | Similarity search over the cloud based on image descriptors' dimensions value cardinalities | |
Gupta et al. | Feature selection: an overview | |
Jánošová et al. | Organizing Similarity Spaces Using Metric Hulls |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |