Summary of the invention
Technical assignment of the present invention is for above weak point, provide a kind of practical, based on the KNN querying method of combination grain distributed memory grid index.
Based on a KNN querying method for combination grain distributed memory grid index, its specific implementation process is:
One, carry out the step of data prediction: based on grid and density, overall data is carried out spatial division, the summary obtaining conceptual data distribution is estimated;
Two, the step of data query is carried out:
Set up the distributed memory grid index structure of combination grain, namely set up non-decile coarseness grid index and decile fine granularity grid index;
Based on distributed search algorithm KNN of above-mentioned Index Structure Design, realize the quick KNN inquiry to mass data: namely based on non-decile coarseness grid index, by searching for the adjacent coarseness grid set of object place to be checked non-decile coarseness grid, determine the Slave node at each coarseness grid place in the set of adjacent coarseness grid; On these Slave nodes, based on decile fine granularity grid index, by searching for the adjacent fine granularity grid set of point fine granularity grids such as object place to be checked, each object comprise fine granularity grid each in the set of adjacent fine granularity grid and the distance of object to be checked contrast, and then obtain k arest neighbors object of inquiry.
The detailed process of described overall data spatial division is:
Each dimension in all data space is carried out decile partition by fine granularities according to unique step δ, forms fine granularity mesh space;
Data object p is mapped to corresponding grid;
Each grid proper vector g (gid, num) represents, the numbering recording each grid and the data object number comprised, wherein gid represents the numbering of grid, has uniqueness, and num represents the number of the data object that this grid comprises.
The process of establishing of the distributed memory grid index structure of described combination grain is:
According to the summary estimation of the Data distribution8 that step one draws, all data space is carried out to the stress and strain model of non-decile coarseness, set up the non-decile coarseness grid index in all data space, the distributed memory grid index structure C GGI of the coarseness in the Master node maintenance all data space of main memory cluster, this Master node is responsible for each Slave node distributing data of cluster;
The fine-grained division of decile is carried out to the subdata space represented by each coarseness grid of above-mentioned division, set up the fine-grained grid index in each subdata space, the fine-grained distributed memory grid index structure FGGI in each one or several subdata spaces of Slave node maintenance of main memory cluster, the subspace non-overlapping copies that namely the fine granularity grid index non-overlapping copies of each Slave node maintenance is safeguarded.
The concrete process of establishing of described non-decile coarseness distributed memory grid index structure is:
According to the process of step one, add up the number of the data object that each division of each dimension comprises;
Each division of every one dimension is made at least to comprise θ data object, when the data object number during certain divides is less than θ, then itself and adjacent division are merged, until its data object comprised data are greater than θ or data space has not had remaining division;
Through above-mentioned calculating and merging, all data space is divided into a non-decile coarseness mesh space, and the data object that each coarseness grid comprises is substantially even;
Set up the grid index CGGI of all data space coarseness, each coarseness grid tlv triple <Cgid, a Cgnum of CGGI, SIP> represents, wherein, Cgid represents the numbering of coarseness grid, with (<lb
1, ub
1>, <lb
2, ub
2> ..., <lb
i, ubi> ..., <lb
n, ub
n>) represent, <lb
i, ub
i> represents that this grid is in the lower bound of division of the i-th dimension and the upper bound; Cgnum represents the number of this coarseness data in grid object; SIP represents this Slave node address corresponding to coarseness grid.
The process of establishing of described decile fine-grained distributed memory grid index structure is: based on the non-decile coarseness distributed memory grid index structure set up, to each coarseness grid <Cgid, Cgnum, the subdata space that SIP> is corresponding is further segmented, get the granularity that fixed step size λ carries out as every one dimension dividing, each like this coarseness grid will be divided into the fine granularity mesh space of a unique step, based on point fine granularity distributed memory grid index FGGI such as this fine granularity mesh space foundation, each fine granularity grid tlv triple <Fgid of FGGI, Fgnum, List> represents, wherein, Fgid represents the numbering of fine granularity grid, use <l
1, l
2..., l
3..., l
n> represents, has uniqueness, and Fgnum represents the number of this fine granularity data in grid object, and List represents the data object that this fine granularity grid comprises.
The data object of described CGGI, FGGI all can insert and delete, wherein:
The insertion process of CGGI data object is: for data object p (d
1, d
2... ..., d
n) insertion, by calculate d
ithe division at place (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, increased by 1;
Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out insertion renewal: first, data object p, by the update of CGGI, is distributed to corresponding coarseness grid and Slave node by Master node; Secondly, for data object p (d
1, d
2... ..., d
n) insertion, by calculate
the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, increased by 1, meanwhile, data object p be inserted List;
The delete procedure of the data object of CGGI is: for data object p (d
1, d
2... ..., d
n) deletion, by calculate d
ithe division at place (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, reduced 1;
Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out deletion renewal: first, Master node, by the deletion action of CGGI, finds coarseness grid and the Slave node at data object p place; Secondly, for data object p (d
1, d
2... ..., d
n) deletion, by calculate
the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, reduced 1, data object p be deleted from List meanwhile.
Described KNN query script is k the arest neighbors object searching object q, and its concrete query script is:
Master node is service data object map algorithm MOG algorithm first, data object is mapped to coarseness grid index CGGI, determines the coarseness grid Cg of q at CGGI place
q;
Secondly run the adjacent grid of adjacent grid-search algorithms SNNG algorithm search coarseness grid, namely search for Cg
qadjacent grid, judge Cg
qand whether the object number summation in its adjacent grid is greater than k, if be less than k, continue the adjacent grid of its adjacent grid of search, until the total quantity of object is greater than k or searches for complete coarseness mesh space, finally obtain Cg
qadjacent coarseness grid set C
q, determine C
qthe Slave node at middle coarseness grid place;
Comprising C
qthe Slave node of middle coarseness grid runs SDKNN algorithm, and each Slave node exports Query Result;
The result reduction exported by each Slave node, to a Slave node, obtains result set S, carries out ascending sort to S, get a front k object and export as net result.
Described MOG algorithm runs at Master node, and its specific implementation process is: input data object q (d to be checked
1, d
2... ..., d
n), coarseness grid set C, determine the division of the coarseness grid at the every one dimension place of q, q be mapped in CGGI, determine the coarseness grid Cg at q place
q.
Described SNNG algorithm runs at Master node, and the definition according to " adjacent grid " calculates q place coarseness grid Cg
qadjacent grid, obtain coarseness and adjoin grid set C
q, statistics C
qobject number summation num in middle coarseness grid, if num>=k, then exports C
q, otherwise, for C
qin each coarseness grid perform SNNG algorithm, until C
qobject number num>=k in middle coarseness grid or search for complete coarseness mesh space, exports C
q.
Described SDKNN algorithm is distributed, easily extensible KNN algorithm, and this algorithm is distributed runs on the coarseness grid Cg storing object q place to be checked
qand C
qeach Slave node of middle coarseness grid, each Slave node returns k object of the arest neighbors of q;
Based on Slave node, the specific implementation process of this algorithm is:
Slave node 1 is made to store coarseness grid Cg
j, and Cg
j∈ C
q, Cg
jbe fine granularity grid index Fg at this Slave node
j;
To Fg
jperform Circle-Traversal algorithm, at least comprised the adjacent fine granularity grid set F of k the arest neighbors object of q
j, this Circle-Traversal algorithm is the neighbour's fine granularity grid-search algorithms without losing, by input fine granularity grid index Fg
j, object q place to be checked fine granularity grid Fg
q, the step-length λ of fine granularity stress and strain model, searching loop number of times i; With Fg
qcentered by by circle search belong to Fg
jfine granularity grid, what obtain the periphery of q place fine granularity grid belongs to Fg
jfine granularity grid set F
j;
For F
jin object p i.e. { p|p ∈ Fg, the Fg ∈ F of any fine granularity grid Fg
j, calculate the distance dist (p, q) of p and q, sort according to distance, return the S set of the k nearest with a q object
1;
Repeat above-mentioned steps, obtain C
qin the S set of other coarseness grid place Slave node and nearest k the object of q
2, S
3..., S
n, according to the distance size with q to { S
1, S
2, S
3..., S
nin object sort, finally return the S set of the k nearest with a q object.
A kind of KNN querying method based on combination grain distributed memory grid index of the present invention, have the following advantages: a kind of KNN querying method based on combination grain distributed memory grid index that the present invention proposes, first, overall data is utilized and analyzes based on the method for grid and density, show that the summary of Data distribution8 is estimated, to reduce the data skew of cluster as much as possible, secondly, basis is estimated as with the summary of Data distribution8, set up the distributed memory grid index structure of thick-thin combination grain, to eliminate the bottleneck of unit handling property, improve data search efficiency, support distributed algorithm, again, based on the fine granularity grid index set up, design is without the neighbour's fine granularity grid-search algorithms lost, ensure fast, neighbour's fine granularity grid of locating query object accurately, finally, based on this distributed memory index structure and neighbour's fine granularity grid-search algorithms, design easily extensible, distributed search algorithm KNN, to eliminate the bottleneck of the unit handling property of centralized search algorithm KNN and to write back based on search algorithm's KNN intermediate result of MapReduce framework the low problem of real-time that disk causes, practical, be easy to promote.
Embodiment
Below in conjunction with the drawings and specific embodiments, the invention will be further described.
The invention provides a kind of KNN querying method based on combination grain distributed memory grid index, for the problem that traditional search algorithm's KNN search efficiency existing under large data environment is low, utilize and based on the method for grid and density, overall data is analyzed, to reduce data skew as far as possible, the distributed memory grid index structure of thick-thin combination grain of design, to improve data search efficiency, support distributed algorithm, on this basis, design the easily extensible based on distributed memory grid index, distributed search algorithm KNN, to realize the fast query to mass data.
The explanation of nouns related in the method is as follows: internal memory index is a kind of data structure of the some of data in EMS memory or multiple property value being carried out to tissue sequence; Distributed memory index refers to the internal memory index that easily can divide and also distributedly can be deployed to each processing node in main memory cluster; Grid refers to every one dimension A of d dimension data space A
i(i=1,2 ..., d) be divided into p
iindividual interval,
each grid g by
(c
i=1,2 ..., p
i) composition, be expressed as: g=(c
1, c
2..., c
d); Adjacent grid refers to for two grids
with
if one dimension i, satisfied arbitrarily
or
or
or
neighborQuery), refer to the result set searched k the object nearest with appointed object q and form, making whole object set be O, KNN query results is O ', for
all there is the distance that dist (p', q)≤dist (p, q), dist (p, q) refers between object p and object q; Master node refers to the host node of cluster, is in charge of the execution of distributed data Sum decomposition task; Slave node refer to cluster from node, be responsible for Distributed Storage and tasks carrying; Easily extensible, distributed search algorithm KNN refer to based on distributed memory index, collaborative search algorithm KNN carrying out query processing on each processing node that distributedly can run on main memory cluster.
As shown in accompanying drawing 1, Fig. 2, the present invention analyzes overall data based on the method for grid and density by utilizing, and the summary forming conceptual data distribution is estimated; Based on this, set up the distributed memory grid index structure of combination grain, cluster Master node is set up and is safeguarded the non-decile coarseness grid index of the total space, and be responsible for each Slave node distributing data of cluster, each Slave node of cluster preserves one or several coarseness grids, and decile refinement is carried out to each coarseness grid, set up corresponding decile fine granularity grid index, the subspace non-overlapping copies that namely the fine granularity grid index non-overlapping copies of each Slave node maintenance is safeguarded; Finally, based on the distributed memory grid index structure of thick-thin combination grain with without the neighbour's fine granularity grid-search algorithms lost, design easily extensible, distributed search algorithm KNN, realize the quick KNN inquiry to mass data.Its specific implementation process is:
One, carry out the step of data prediction: based on grid and density, overall data is carried out spatial division, the summary obtaining conceptual data distribution is estimated;
Two, carry out the step of data query: the distributed memory grid index structure setting up combination grain, namely set up non-decile coarseness grid index and decile fine granularity grid index; The basis of above-mentioned index structure is designed distributed search algorithm KNN, realize the quick KNN inquiry to mass data: namely based on non-decile coarseness grid index, by searching for the adjacent coarseness grid set of object place to be checked non-decile coarseness grid, determine the Slave node at each coarseness grid place in the set of adjacent coarseness grid; On these Slave nodes, based on decile fine granularity grid index, by searching for the adjacent fine granularity grid set of point fine granularity grids such as object place to be checked, each object comprise fine granularity grid each in the set of adjacent fine granularity grid and the distance of object to be checked contrast, and then obtain k arest neighbors object of inquiry.
The detailed process of described overall data spatial division is: each dimension in all data space is carried out decile partition by fine granularities according to unique step δ, forms fine granularity mesh space; Data object p is mapped to corresponding grid, such as, data object p (d
1, d
2... ..., d
n), n is the dimension in all data space, is mapped to grid
each grid proper vector g (gid, num) represents, the numbering recording each grid and the data object number comprised, wherein gid represents the numbering of grid, has uniqueness, and num represents the number of the data object that this grid comprises.
The process of establishing of the distributed memory grid index structure of described combination grain is: according to the summary estimation of the Data distribution8 that step one draws, all data space is carried out to the stress and strain model of non-decile coarseness, set up the non-decile coarseness grid index in all data space, the distributed memory grid index structure C GGI of the coarseness in the Master node maintenance all data space of main memory cluster, i.e. coarsegrainedgridindex, this Master node is responsible for each Slave node distributing data of cluster; The fine-grained division of decile is carried out to the subdata space represented by each coarseness grid of above-mentioned division, set up the fine-grained grid index in each subdata space, the fine-grained distributed memory grid index structure FGGI in each one or several subdata spaces of Slave node maintenance of main memory cluster, i.e. finegrainedgridindex, the subspace non-overlapping copies that namely the fine granularity grid index non-overlapping copies of each Slave node maintenance is safeguarded.
The concrete process of establishing of described non-decile coarseness distributed memory grid index structure is: according to the process of step one, adds up the number of the data object that each division of each dimension comprises; Each division of every one dimension is made at least to comprise θ data object, when the data object number during certain divides is less than θ, then itself and adjacent division are merged, until its data object comprised data are greater than θ or data space has not had remaining division; Through above-mentioned calculating and merging, all data space is divided into a non-decile coarseness mesh space, and the data object number that each coarseness grid comprises is substantially even; Set up the grid index CGGI of all data space coarseness, each coarseness grid tlv triple <Cgid, a Cgnum of CGGI, SIP> represents, wherein, Cgid represents the numbering of coarseness grid, with (<lb
1, ub
1>, <lb
2, ub
2> ..., <lb
i, ubi> ..., <lb
n, ub
n>) represent, <lb
i, ub
i> represents that this grid is in the lower bound of division of the i-th dimension and the upper bound; Cgnum represents the number of this coarseness data in grid object; SIP represents this Slave node address corresponding to coarseness grid.
The process of establishing of described decile fine-grained distributed memory grid index structure is: based on the non-decile coarseness distributed memory grid index structure set up, to each coarseness grid <Cgid, Cgnum, the subdata space that SIP> is corresponding is further segmented, get the granularity that fixed step size λ carries out as every one dimension dividing, each like this coarseness grid will be divided into the fine granularity mesh space of a unique step, based on point fine granularity distributed memory grid index FGGI such as this fine granularity mesh space foundation, each fine granularity grid tlv triple <Fgid of FGGI, Fgnum, List> represents, wherein, Fgid represents the numbering of fine granularity grid, use <l
1, l
2, l
3..., l
n> represents, has uniqueness, and Fgnum represents the number of this fine granularity data in grid object, and List represents the data object that this fine granularity grid comprises.
The data object of described CGGI, FGGI all can insert and delete, and wherein the insertion process of CGGI data object is: for data object p (d
1, d
2... ..., d
n) insertion, by calculate d
ithe division at place (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, increased by 1; Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out insertion renewal: first, data object p, by the update of CGGI, is distributed to corresponding coarseness grid and Slave node by Master node; Secondly, for data object p (d
1, d
2... ..., d
n) insertion, by calculate
the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, increased by 1, meanwhile, data object p be inserted List;
The delete procedure of the data object of CGGI is: for data object p (d
1, d
2... ..., d
n) deletion, by calculate d
iplace division (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, reduced 1; Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out deletion renewal: first, Master node, by the deletion action of CGGI, finds coarseness grid and the Slave node at data object p place; Secondly, for data object p (d
1, d
2... ..., d
n) deletion, by calculate
the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, reduced 1, data object p be deleted from List meanwhile.
Described KNN query script is k the arest neighbors object searching object q, its concrete query script is: Master node is service data object map algorithm MOG algorithm (mapobjecttogrid) first, data object q is mapped to coarseness grid index CGGI, determines the coarseness grid Cg of q at CGGI place
q; Secondly run the adjacent grid that adjacent grid-search algorithms SNNG algorithm (searchnearestneighborgrid) searches for coarseness grid, namely search for Cg
qadjacent grid, judge Cg
qand whether the object number summation in its adjacent grid is greater than k, if be less than k, continue the adjacent grid of its adjacent grid of search, until the total quantity of object is greater than k or searches for complete coarseness mesh space, finally obtain Cg
qadjacent coarseness grid set C
q, determine C
qthe Slave node at middle coarseness grid place; Comprising C
qthe Slave node of middle coarseness grid runs SDKNN (scalabledistributedKNNalgorithm), and each Slave node exports Query Result; The result reduction exported by each Slave node, to a Slave node, obtains result set S, carries out ascending sort to S, get a front k object and export as net result.
Described MOG algorithm runs at Master node, and its specific implementation process is: input data object q (d to be checked
1, d
2... ..., d
n), coarseness grid set C, determine the division in the coarseness grid at the every one dimension place of q, q be mapped in CGGI, determine the coarseness grid Cg at q place
q; Equaled for 2 (namely for 2-D data space) for n now, the detailed process of this algorithm is:
Described SNNG algorithm runs at Master node, and the definition according to " adjacent grid " calculates q place coarseness grid Cg
qadjacent grid, obtain coarseness and adjoin grid set C
q, statistics C
qobject number summation num in middle coarseness grid, if num>=k, then exports C
q, otherwise, for C
qin each coarseness grid perform SNNG algorithm, until C
qobject number num>=k in middle coarseness grid or search for complete coarseness mesh space, exports C
q; Its specific implementation process is:
Described SDKNN algorithm is distributed, easily extensible KNN algorithm, and this algorithm is distributed runs on the coarseness grid Cg storing object q place to be checked
qand Cg
qadjacent grid set C
qeach Slave node of middle coarseness grid, each Slave node returns k object of the arest neighbors of q; For Slave node 1, detailed process is:
Slave node 1 is made to store coarseness grid Cg
j, and Cg
j∈ C
q, Cg
jbe fine granularity grid index Fg at Slave node 1
j;
To Fg
jperform Circle-Traversal algorithm, at least comprised the fine granularity grid set F of k the arest neighbors object of q
j, this Circle-Traversal algorithm is the adjacent fine granularity grid-search algorithms without losing, by input fine granularity grid index Fg
j, object q place to be checked fine granularity grid Fg
q, the step-length λ of fine granularity stress and strain model, searching loop number of times i; With Fg
qcentered by by circle search belong to Fg
jfine granularity grid, at least comprised the fine granularity grid set F of k the arest neighbors object of q
j;
For F
jin any data object p i.e. { p|p ∈ Fg, the Fg ∈ F that comprises of fine granularity grid Fg
j, calculate the distance dist (p, q) of p and q, sort according to distance, return the S set of the k nearest with a q object
1.
In like manner, C can be obtained
qin the S set of other coarseness grid place Slave nodes and nearest k the object of q
2, S
3..., S
n, according to the distance size with q to { S
1, S
2, S
3..., S
nin object sort, finally return the S set of the k nearest with a q object.
The detailed process of above-mentioned Circle-Traversal algorithm is:
The algorithm of described SDKNN is:
The present invention is in conjunction with the characteristic of main memory cluster, by proposing and set up slightly a kind of-thin distributed memory grid index structure of combination grain, and design is based on distributed search algorithm KNN of this index structure, KNN algorithm centralized for tradition is expanded to distributed memory cluster environment, the problem of centralized search algorithm KNN and the search algorithm's KNN inefficiency based on MapReduce framework under improving large data environment.
Specific embodiment:
1, without loss of generality, using the cluster of 6 station servers compositions for experiment porch (wherein 1 as Master node, 5 as Slave node), the detailed description of technical solution of the present invention is carried out for two-dimensional space data KNN inquiry.Overall data is as shown in the table, space distribution as shown in Figure 3,
(12,68) |
(31,73) |
(58,63) |
(57,23) |
(4,26) |
(28,33) |
(11,16) |
(56,8) |
(21,66) |
(16,72) |
(13,56) |
(62,78) |
(52,29) |
(7,34) |
(32,19) |
(13,26) |
(53,16) |
(23,61) |
(18,61) |
(26,57) |
(65,66) |
(59,49) |
(6,23) |
(28,13) |
(26,23) |
(56,43) |
(27,71) |
(11,63) |
(7,55) |
(67,72) |
(64,24) |
(9,11) |
(38,26) |
(67,43) |
(66,16) |
(53,72) |
(8,73) |
(21,53) |
(46,33) |
(62,36) |
(8,2) |
(37,13) |
(2,12) |
(57,71) |
(56,76) |
2, utilize and carry out spatial division based on the data in the method his-and-hers watches 1 of grid and density.Get fixed step size δ=5, by the data-mapping in table 1 to corresponding grid, determine the numbering gid of grid belonging to each data object, result is as shown in the data-mapping result of following table.Such as: for data object p (d
1, d
2), by calculating the numbering of grid belonging to p
According to data-mapping result, the spatial division of overall data can be obtained based on grid and density, each grid proper vector g (gid, num), the proper vector of all non-NULL grids is as shown in the proper vector of following table non-NULL grid, and the spatial division of overall data and distribution are as shown in Figure 4.
g(<2,13>,1) |
g(<6,14>,1) |
g(<11,12>,1) |
g(<11,4>,1) |
g(<0,5>,1) |
g(<3,14>,1) |
g(<2,11>,1) |
g(<12,15>,1) |
g(<10,5>,1) |
g(<1,6>,1) |
g(<3,12>,1) |
g(<5,11>,1) |
g(<13,13>,1) |
g(<11,9>,1) |
g(<1,4>,1) |
g(<2,12>,1) |
g(<1,11>,1) |
g(<13,14>,1) |
g(<12,4>,1) |
g(<1,2>,1) |
g(<1,14>,1) |
g(<4,10>,1) |
g(<9,6>,1) |
g(<12,7>,1) |
g(<1,0>,1) |
g(<2,3>,1) |
g(<2,5>,1) |
g(<5,4>,1) |
g(<13,8>,1) |
g(<0,2>,1) |
g(<10,3>,1) |
g(<11,8>,1) |
g(<13,3>,1) |
g(<11,14>,1) |
g(<4,13>,1 |
g(<5,14>,1) |
g(<10,14>,1) |
g(<11,15>,1) |
g(<5,6>,1) |
g(<6,3>,1) |
g(<5,2>,1) |
g(<7,5>,1) |
g(<7,2>,1) |
g(<11,1>,1) |
g(<4,12>,1) |
The spatial division of the grid search-engine vector sum overall data 3, obtained based on above-mentioned steps 2, sets up non-decile coarseness distributed memory grid index.Concrete steps are as follows,
1) add up the division of the X dimension space shown in Fig. 3, concrete outcome is as shown in the table.
(a)
0th divides |
1st divides |
2nd divides |
3rd divides |
g(<0,2>,1) |
g(<1,14>,1) |
g(<2,13>,1) |
g(<3,14>,1) |
g(<0,5>,1) |
g(<1,11>,1) |
g(<2,12>,1) |
g(<3,12>,1) |
|
g(<1,6>,1) |
g(<2,3>,1) |
|
|
g(<1,4>,1) |
g(<2,11>,1) |
|
|
g(<1,2>,1) |
g(<2,5>,1) |
|
|
g(<1,0>,1) |
|
|
(b)
4th divides |
5th divides |
6th divides |
7th divides |
g(<4,10>,1) |
g(<5,14>,1) |
g(<6,3>,1) |
g(<7,5>,1) |
g(<4,13>,1) |
g(<5,2>,1) |
g(<6,14>,1) |
g(<7,2>,1) |
g(<4,12>,1) |
g(<5,4>,1) |
|
|
|
g(<5,6>,1) |
|
|
|
g(<5,11>,1) |
|
|
(c)
(d)
12nd divides |
13rd divides |
14th divides |
15th divides |
g(<12,4>,1) |
g(<13,14>,1) |
|
|
g(<12,7>,1) |
g(<13,8>,1) |
|
|
g(<12,15>,1) |
g(<13,13>,1) |
|
|
|
g(<13,3>,1) |
|
|
|
|
|
|
The division of the Y dimension space shown in Fig. 3 is added up, shown in the division statistics that concrete outcome is tieed up based on Y as following table.
(a)
0th divides |
1st divides |
2nd divides |
3rd divides |
g(<1,0>,1) |
g(<11,1>,1) |
g(<5,2>,1) |
g(<2,3>,1) |
|
|
g(<1,2>,1) |
g(<12,3>,1) |
|
|
g(<0,2>,1) |
g(<13,3>,1) |
|
|
g(<7,2>,1) |
g(<6,3>,1) |
(b)
4th divides |
5th divides |
6th divides |
7th divides |
g(<5,4>,1) |
g(<2,5>,1) |
g(<9,6>,1) |
g(<12,7>,1) |
g(<11,4>,1) |
g(<7,5>,1) |
g(<5,6>,1) |
|
g(<12,4>,1) |
g(<10,5>,1) |
g(<1,6>,1) |
|
g(<1,4>,1) |
g(<0,5>,1) |
|
|
(c)
8th divides |
9th divides |
10th divides |
11st divides |
g(<11,8>,1) |
g(<11,9>,1) |
g(<4,10>,1) |
g(<2,11>,1) |
g(<13,8>,1) |
|
|
g(<5,11>,1) |
|
|
|
g(<1,11>,1) |
(d)
12nd divides |
13rd divides |
14th divides |
15th divides |
g(<3,12>,1) |
g(<2,13>,1) |
g(<3,14>,1) |
g(<12,15>,1) |
g(<2,12>,1) |
g(<13,13>,1) |
g(<1,14>,1) |
g(<11,15>,1) |
g(<11,12>,1) |
g(<4,13>,1) |
g(<5,14>,1) |
|
g(<4,12>,1) |
|
g(<6,14>,1) |
|
|
|
g(<10,14>,1) |
|
|
|
g(<13,14>,1) |
|
|
|
g(<11,14>,1) |
|
2) get parameter θ=10, respectively the division of X dimension, Y dimension is merged, to carry out even partition to overall data as far as possible.
First X dimension is scanned from low to high according to the numbering divided, 0th division has 2 data objects, is less than θ, therefore needs to divide with the 1st to merge, 8 data objects are had after merging, be less than θ, therefore need to divide with the 2nd to merge, after merging, have 13 data objects, be greater than θ, stop merging, so far, divide X dimension the 0th, 1,2 three and merge.The like, the division that scanning is remaining, can tie up the 3rd, 4,5 three division at X and merge, the 6th, 7,8,9,10,11 6 division merges, and the 12nd, 13,14,15 4 divides merging, and result is as shown in Figure 5.
Secondly, by that analogy, the division of Y dimension is merged, 0th, 1,2,3 four division merges, 4th, 5,6 three divisions merge, and the 7th, 8,9,10,11,12 6 division merges, and the 13rd, 14 two division merges, 15th divides as a division, and result as shown in Figure 6.
3) based on 2) result merged is divided to X dimension, Y dimension, set up coarseness grid index CGGI as shown in Figure 7, the proper vector tlv triple <Cgid of each coarseness grid, Cgnum, SIP> represents, Cgid represents the numbering of coarseness grid, with (<lb
1, ub
1>, <lb
2, ub
2>) represent, wherein <lb
i, ub
i> represents that this grid is in the lower bound of division of the i-th dimension and the upper bound, and Cgnum represents the number of this coarseness data in grid object, and SIP represents this Slave node address corresponding to coarseness grid.We store coarseness grid with 4 Slave nodes, the numbering of 4 Slave nodes is respectively 001,002,003,004, scan all coarseness grids of CGGI successively, then 4 Slave nodes are distributed to, specific rules is: first, and the division along X dimension scans from small to large, and the division then tieed up along Y in each division of X dimension scans from small to large; Then, by coarseness Grid delivery to comprise data object few Slave node.Such as, initial 4 Slave nodes comprise data object and are 0, be (<0 from Cgid, 15>, <0, 20>) start scanning, be distributed to Slave node 001, (<0, 15>, <20, 35>) be distributed to Slave node 002, (<0, 15>, <35, 65>) be distributed to Slave node 003, (<0, 15>, <65, 75>) be distributed to Slave node 004, now 001, 002, 003, the data object number of 004 is respectively 4, 4, 3, 2, so (<0, 15>, <75, 80>) be distributed to 004 node, by that analogy by remaining coarseness Grid delivery to corresponding Slave node.Finally, shown in the proper vector of the memory node that CGGI all coarsenesses grid is corresponding and proper vector coarseness grids as all in following table CGGI.
Slave node 001 |
Slave node 002 |
Slave node 003 |
Slave node 004 |
((<0,15>,<0,20>),4,001) |
((<0,15>,<20,35>),4,002) |
((<0,15>,<35,65>),3,003) |
((<0,15>,<65,75>),2,004) |
((<15,30>,<65,75>),3,001) |
((<15,30>,<75,80>),0,002) |
((<15,30>,<20,35>),2,003) |
((<0,15>,<75,80>),0,004) |
((<30,60>,<35,65>),3,001) |
((<30,60>,<0,20>),4,002) |
((<30,60>,<20,35>),4,003) |
((<15,30>,<0,20>),1,004) |
((<60,80>,<35,65>),2,001) |
((<30,60>,<75,80>),1,002) |
((<60,80>,<20,35>),1,003) |
((<15,30>,<35,65>),4,004) |
|
((<60,80>,<0,20>),1,002) |
((<60,80>,<75,80>),1,003) |
((<30,60>,<65,75>),3,004) |
|
((<60,80>,<65,75>),2,002) |
|
|
4, to point fine granularity distributed memory grid index structure FGGI such as each coarseness grid foundation in the non-decile coarseness distributed memory grid index structure of above-mentioned steps 3 foundation.Get fixed step size λ=5, carry out partition by fine granularities to each coarseness grid, each fine granularity grid search-engine vector tlv triple <Fgid of FGGI, Fgnum, List> represent.Wherein, Fgid represents the numbering of fine granularity grid, with using <l
1, l
2> represents, has uniqueness, and Fgnum represents the number of this fine granularity data in grid object, and List represents the data object that this fine granularity grid comprises.Such as coarseness grid Cg ((<0, 15>, <0, 20>), 4, 001) fine granularity grid index as shown in Figure 8, the proper vector of non-NULL fine granularity grid is respectively (<0, 2>, 1, (2, 12)), (<1, 0>, 1, (8, 2)), (<1, 2>, 1, (9, 11)), (<2, 3>, 1, (11, 16)).The fine granularity grid index of all coarseness grids can be calculated by that analogy.
5, KNN inquiry: 2 the arest neighbors objects searching data object q (56,43).
1) based on distributed memory grid index structure C GGI and FGGI of 3 and 4 thick-thin combination grains set up, run MOG algorithm, determine the coarseness grid Cg of q at CGGI place
q.As follows according to algorithm concrete operations: 1. initialization Cg
qfor sky.2. determine that q ties up the division at place at X, the division tieed up the X of CGGI is according to ascending sort, and result is (<0,15>, <15,30>, <30,60>, <60,80>), q is 56 in the value that X ties up, that can determine that q ties up at X by comparison is divided into <30,60>.3. determine that q ties up the division at place at Y, the division tieed up the Y of CGGI is according to ascending sort, result is (<0,20>, <20,35>, <35,65>, <65,75>, <75,80>), q is 43 in the value that X ties up, that can determine that q ties up at Y by comparison is divided into <35,65>.4. pass through 2., 3. can determine the Cgid of the coarseness grid at q place for (<30,60>, <35,65>), export net result Cg
q((<30,60>, <35,65>), 3,001).
2) based on distributed memory grid index structure C GGI and FGGI of 3 and 4 thick-thin combination grains set up, run SNNG algorithm and search Cg
qcoarseness adjoin grid set C
q.As follows according to algorithm concrete operations: all coarseness grids 1. for CGGI represent with set C, whether for coarseness grid each in C, calculating according to the definition of " adjacent grid " is successively Cg
qadjacent grid, after comparison, it is deleted from C, until when C is empty, end operation.Such as: for the coarse mesh Cg in C
1((<0,15>, <0,20>), 4,001), according to the definition of " adjacent grid " calculate whether with Cg
qadjacent, first, whether both calculating adjoins in X dimension, Cg
1x dimension is divided into <0,15>, Cg
qx dimension is divided into <30,60> because 0 ≠ 60,15 ≠ 30,0 ≠ 30,15 ≠ 60, thus in X dimension Cg
1and Cg
qdo not adjoin, Cg
1and Cg
qit not adjacent grid; For the coarse mesh Cg in C
2((<15,30>, <20,35>), 2,003), according to the definition of " adjacent grid " calculate whether with Cg
qadjacent, first, whether both calculating adjoins in X dimension, Cg
2x dimension is divided into <15,30>, Cg
qx dimension is divided into <30,60>, 30==30 (i.e. Cg
2at the upper bound and the Cg of X dimension
qx dimension lower bound equal), so in X dimension Cg
2and Cg
qadjacent, secondly, whether both calculating adjoins in Y dimension, Cg
2y dimension is divided into <20,35>, Cg
qx dimension is divided into <35,65>, 35==35 (i.e. Cg
2at the upper bound and the Cg of Y dimension
qy dimension lower bound equal), so in Y dimension Cg
2and Cg
qadjacent, because Cg
2with Cg
qall adjacent in X peacekeeping Y dimension, so Cg
2cg
qadjacent grid.By said method, Cg can be calculated
qcoarseness adjoin grid set C
q, result is as following table Cg
qcoarseness adjoin grid set C
qshown in.2. C is added up
qthe number of middle data object, cumulative C
qin the data object number of each coarseness grid, i.e. C
q.num=C
q.num+Cg.num, (initial C
q.num=0), because q itself will be removed, so can C be obtained
qmiddle data object adds up to 23.3. C is judged
qthe size of the arest neighbors object number k of middle data object sum and inquiry, because 23>2, so C
qexport as net result.
((<15,30>,<20,35>),2,003) |
((<30,60>,<20,35>),4,003) |
((<60,80>,<20,35>),1,003) |
((<15,30>,<35,65>),4,004) |
((<30,60>,<35,65>),3,001) |
((<60,80>,<35,65>),2,001) |
((<15,30>,<65,75>),3,001) |
((<30,60>,<65,75>),3,004) |
((<60,80>,<65,75>),2,002) |
3) at C
qin each coarseness grid place Slave node on run SDKNN algorithm, k (k=2) the neighbour object of distributed calculating q.From step 3) front side table, C
qin 9 coarsenesses grid-distributed be stored in Slave node 001,002,003,004, respectively SDKNN algorithm is performed to fine granularity grid index corresponding to these 9 coarseness grids at these 4 nodes.As follows according to algorithm concrete operations: with Cg ((<60,80>, <35,65>), 2,001) for example, 1. calculate the fine granularity grid Fg at q place
q,
fg
q=(<11,8>, 1, (56,43)).2. Circle-Traversal algorithm is run, search Fg
qarest neighbors grid set F in the fine granularity grid index Fg that Cg is corresponding
1, all fine granularity grids that the fine granularity grid index Fg that Cg is corresponding comprises are as shown in the table, and calculating with (56,43) for the center of circle, is the Fg of radius with i × 5
qperiphery neighbour grid, return the S set of object nearest with q in this coarse mesh
1, being specially (process as shown in Figure 9): as i=0, is the fine granularity lattice Fg at (56,43) place
q, because
so continue to extending out a circle, as i=1, first lap travels through,
fg
startnot in Fg, along upwards searching, j=1,
not in Fg, along upwards searching, j=2,
not in Fg, now j=3>2*i, and F ' g unequal to Fg
start, change direction for search to the right, j=1,
not in Fg, along searching to the right, j=2,
in Fg, F
1=Fg ' }={ (<12,9>, 0, null) }, now j=3>2*i, and Fg ' unequal to Fg
start, change direction for search downwards, j=1,
in Fg, F
1=F
1∪ Fg '={ (<12,9>, 0, null), (<12,8>, 0, null) }, along searching downwards, j=2,
in Fg, F
1=F
1∪ Fg '={ (<12,9>, 0, null), (<12,8>, 0, null), (<12,7>, 1, (62,36)), now j=3>2*i, and Fg ' unequal to Fg
start, change direction for search left, j=1,
not in Fg, along searching left, j=2,
now, Fg '=Fg
start, first lap traversal terminates, F
1={ (<12,9>, 0, null), (<12,8>, 0, null), (<12,7>, 1, (62,36)), F
1.num=1<k=2, so need to continue to extending out a circle, i=2, carry out the second circle traversal, traversal form is similar with first lap, after the second circle traversal terminates, and F
1={ (<12, 9>, 0, null), (<12, 8>, 0, null), (<12, 7>, 1, (62, 36)), (<12, 10>, 0, null), (<13, 10>, 0, null), (<13, 9>, 0, null), (<13, 8>, 1, (67, 43)), (<13, 7>, 0, null) }, F
1.num=2, in order to F
jensure at least to comprise the k nearest with a q object, need to expand a circle again, i=3 carries out the 3rd circle traversal, after traversal terminates, and F
1={ (<12, 9>, 0, null), (<12, 8>, 0, null), (<12, 7>, 1, (62, 36)), (<12, 10>, 0, null), (<13, 10>, 0, null), (<13, 9>, 0, null), (<13, 8>, 1, (67, 43)), (<13, 7>, 0, null), (<12, 11>, 0, null), (<13, 11>, 0, null), (<14, 11>, 0, null), (<14, 10>, 0, null), (<14, 9>, 0, null), (<14, 8>, 0, null), (<14, 7>, 0, null) }, calculate q and F
1the distance of object in middle fine granularity grid
wherein p
1=(<12,7>, 1, (62,36)),
wherein p
2=(<13,8>, 1, (67,43)), are got apart from minimum 2 data objects by sequence, obtain S
1=(<9.2, (62,36) >, <11, (67,43) >).3. in like manner by running SDKNN algorithm, C can be calculated
qin minimum k the object of the distance of other coarseness grids and q, specific as follows:
For coarse mesh ((<15, 30>, <65, 75>), 3, 001), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<5, 14>, 1, (27, 71)), (<4, 13>, 1, (21, 66)), (<3, 14>, 1, (16, 72)), by calculating distance, get the k nearest with a q object, S
2=(<40.3, (27,71) >, <41.9, (21,66) >).
For coarse mesh ((<15, 30>, <35, 65>), 2, 004), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<5, 11>, 1, (26, 57)), (<4, 10>, 1, (21, 53)), (<4, 12>, 1, (23, 61)), (<3, 12>, 1, (18, 61)), by calculating distance, get the k nearest with a q object, S
3=(<33.1, ((26,57) >, <36.4, (21,53) >).
For coarse mesh ((<15,30>, <20,35>), 2,003), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<5,4>, 1, (26,23)), (<5,6>, 1, (28,33)), by calculating distance, get the k nearest with a q object, S
4=(<36.1, (26,23) >, <29.7, (28,33) >).
For coarse mesh ((<30, 60>, <65, 75>), 3, 004), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<6, 14>, 1, (31, 73)), (<10, 14>, 1, (53, 72)), (<11, 14>, 1, (57, 71)), by calculating distance, get the k nearest with a q object, S
5=(<29.2, (53,72) >, <28, (57,71) >).
For coarse mesh ((<30,60>, <35,65>), 3,001), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<11,12>, 1, (58,63)), (<11,9>, 1, (59,49)), by calculating distance, get the k nearest with a q object, S
6=(<20.1, (58,63) >, <6.7, (59,49) >).
For coarse mesh ((<30, 60>, <20, 35>), 4, 003), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<9, 6>, 1, (46, 33)), (<10, 5>, 1, (52, 29)), (<7, 5>, 1, (38, 26)), (<11, 4>, 1, (57, 23)), by calculating distance, get the k nearest with a q object, S
7=(<14.1, (46,33) >, <14.6, (52,29) >).
For coarse mesh ((<60,80>, <65,75>), 2,002), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<13,13>, 1, (65,66)), (<13,14>, 1, (67,72)), by calculating distance, get the k nearest with a q object, S
8=(<24.7, (65,66) >, <31.02, (67,72) >).
For coarse mesh ((<60,80>, <20,35>), 1,003), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<12,4>, 1, (64,24)), by calculating distance, get the k nearest with a q object (when number of objects is less than k, all getting), S
9=(<20.6, (64,24) >).
(<12,7>,1,(62,36)) |
(<13,7>,0,null) |
(<14,7>,0,null) |
(<15,7>,0,null) |
(<12,8>,0,null) |
(<13,8>,1,(67,43)) |
(<14,8>,0,null) |
(<15,8>,0,null) |
(<12,9>,0,null) |
(<13,9>,0,null) |
(<14,9>,0,null) |
(<15,9>,0,null) |
(<12,10>,0,null) |
(<13,10>,0,null) |
(<14,10>,0,null) |
(<15,10>,0,null) |
(<12,11>,0,null) |
(<13,11>,0,null) |
(<14,11>,0,null) |
(<15,11>,0,null) |
(<12,12>,0,null) |
(<13,12>,0,null) |
(<14,12>,0,null) |
(<15,12>,0,null) |
4) Slave node 001,002,003,004 is run the result S that SDKNN algorithm obtains
1, S
2, S
3, S
4, S
5, S
6, S
7, S
8, S
9reduction to Slave node 005 carries out ascending sort, and get rear front 2 results of sequence, obtaining net result is S=(<6.7, (59,49) >, <9.2, (62,36) >).
S is exported as final query results.
The present invention utilizes and analyzes overall data based on the method for grid and density, showing that the summary of Data distribution8 is estimated, laying the foundation for setting up coarseness grid, for the data skew reducing cluster provides foundation; Basis is estimated as with the summary of Data distribution8, set up the combination grain distributed memory grid index structure based on non-decile coarseness and the thin combination grain of decile, the bottleneck of unit handling property can be eliminated, improve data search efficiency, supporting distributed algorithm, is the core technology designing efficient, distributed KNN algorithm; Based on set up decile fine granularity grid index, design, can neighbour's fine granularity grid of locating query object fast and accurately without the neighbour's fine granularity grid-search algorithms lost; Based on the distributed memory grid index structure of thick-thin combination grain with without the neighbour's fine granularity grid-search algorithms lost, devise easily extensible, distributed search algorithm KNN, eliminate the bottleneck of centralized search algorithm's KNN unit handling property and write back based on search algorithm's KNN intermediate result of MapReduce framework the low problem of real-time that disk causes, achieving the fast query to mass data.
Above-mentioned embodiment is only concrete case of the present invention; scope of patent protection of the present invention includes but not limited to above-mentioned embodiment; claims of any a kind of KNN querying method based on combination grain distributed memory grid index according to the invention and the those of ordinary skill of any described technical field to its suitable change done or replacement, all should fall into scope of patent protection of the present invention.