CN105138607A

CN105138607A - Hybrid granularity distributional memory grid index-based KNN query method

Info

Publication number: CN105138607A
Application number: CN201510481594.7A
Authority: CN
Inventors: 蔡斌雷; 朱世伟; 郭芹; 杨子江; 于俊凤; 魏墨济; 李思思; 徐蓓蓓; 李晨; 巴志超; 鞠镁隆
Original assignee: INFORMATION RESEARCH INSTITUTE OF SHANDONG ACADEMY OF SCIENCES
Current assignee: China Southern Power Grid Internet Service Co ltd; Jingchuang United Beijing Intellectual Property Service Co ltd
Priority date: 2015-08-03
Filing date: 2015-08-03
Publication date: 2015-12-09
Anticipated expiration: 2035-08-03
Also published as: CN105138607B

Abstract

The invention discloses a hybrid granularity distributional memory grid index-based KNN query method, and specifically, the method is realized by the following steps: performing a data pre-processing step: based on the grid and density, performing space division of overall data to obtain overview estimation of overall data distribution; performing a data query step: establishing a hybrid granularity distributional memory grid index structure, that is, establishing a non-bisecting coarse-grained grid index and a bisecting fine-grained grid grid index; and and on the basis, by designing a distributional KNN query algorithm, realizing a fast KNN query for mass data. Compared to the prior art, the hybrid granularity distributional memory grid index-based KNN query method, reduces the data skew of the cluster, improves the data index efficiency and supports the distributional algorithm by designing and establishing the hybrid granularity distributional memory grid index, and disperses the bottleneck in single processing performance of the centralized KNN query algorithm, and the real-time degradation problem led by intermediate result rewriting disc of the KNN query algorithem based on the MapReduce construction by using the KNN query algorithm based on the index structure.

Description

A kind of KNN querying method based on combination grain distributed memory grid index

Technical field

The present invention relates to information retrieval field, specifically a kind of practical, based on the KNN querying method of combination grain distributed memory grid index.

Background technology

Along with the fast development of the technology such as internet, Internet of Things, large data, KNN inquiry is widely used in various location-based application as the basic operation of one.But along with the continuous growth of data volume, traditional centralized KNN inquiry and the KNN querying method based on MapReduce framework effectively can not carry out fast processing to mass data.How under large data environment, expanding centralized search algorithm KNN of tradition in conjunction with main memory cluster characteristic, index technology and distributed memory computing technique is the fundamental way solving this problem.

Index technology is the key components of search algorithm KNN, and its basic thought utilizes the methods such as division, location, mapping to carry out the new index data structure of sequence composition to data object, utilizes index technology can improve the recall precision of data.The application of index technology in search algorithm KNN is very extensive, under centralized environment: 1. based on the index structure that the characteristic of Voronoi figure is set up, is widely applied in the KNN inquiry of spatial data.2. the KNN inquiry based on tree structure mainly contains: the KNN inquiry that the KNN inquiry that the index structure set up based on R tree is ripe is applied to spatial data, the index structure set up based on TPR tree are applied to the KNN inquiry of Mobile data object, Data Placement is applied to space-time data in conjunction with the index structure that B+ tree is set up.3. the KNN inquiry based on grid index structure mainly contains: the KNN inquiry that the index structure set up based on decile grid is widely used in wireless broadcast environment, the index structure set up based on tree and grid mixed structure be applied to Mobile data object KNN inquiry, be widely used in lasting KNN based on the grid index structure of main memory and inquire about; Under distributed environment, the KNN inquiry based on MapReduce framework there has also been some achievements, mainly contains: the parallel KNN based on R tree inquires about, based on the parallel KNN inquiry of inverted index, the parallel KNN inquiry etc. based on Voronoi figure.

Under large data environment, KNN querying method efficiency comparison is low, search algorithm KNN supported under lacking effective index structure and this structure is the key causing this problem, be mainly reflected in: the index structure 1. under centralized environment studies relative maturity, based on these index structures, a lot of relative efficiency KNN algorithm is also proposed, but along with data volume is explosive growth, under centralized environment, unit handling property becomes implacable bottleneck; 2. some are also had to study based on the index structure of MapReduce framework and relevant search algorithm KNN under distributed environment, but because MapReduce framework is batch processing model, intermediate result needs to write back disk, add I/O, the real-time which results in inquiry is poor, and existing algorithm, do not consider Data distribution8 situation, easily cause the data skew problem of cluster.

For the shortcoming of the low problem of KNN search efficiency and prior art under current large data environment, we propose a kind of KNN querying method based on combination grain distributed memory grid index, in conjunction with the characteristic of main memory cluster, we are by carrying out summary estimation to overall data, set up the distributed memory grid index structure of thick-thin combination grain, to reduce data skew and to improve data search efficiency; Design is without the neighbour's fine granularity grid-search algorithms lost, to ensure neighbour's grid of positioning searching object fast and accurately, finally, based on set up index structure with without the neighbour's fine granularity grid-search algorithms lost, in conjunction with distributed memory computation model, carry out distributedly expanding to tradition centralized KNN algorithm, to eliminate the unit performance bottleneck of centralized KNN algorithm and the I/O bottleneck based on MapReduce framework KNN algorithm, and then quick KNN inquiry is carried out to mass data.

Summary of the invention

Technical assignment of the present invention is for above weak point, provide a kind of practical, based on the KNN querying method of combination grain distributed memory grid index.

Based on a KNN querying method for combination grain distributed memory grid index, its specific implementation process is:

One, carry out the step of data prediction: based on grid and density, overall data is carried out spatial division, the summary obtaining conceptual data distribution is estimated;

Two, the step of data query is carried out:

Set up the distributed memory grid index structure of combination grain, namely set up non-decile coarseness grid index and decile fine granularity grid index;

Based on distributed search algorithm KNN of above-mentioned Index Structure Design, realize the quick KNN inquiry to mass data: namely based on non-decile coarseness grid index, by searching for the adjacent coarseness grid set of object place to be checked non-decile coarseness grid, determine the Slave node at each coarseness grid place in the set of adjacent coarseness grid; On these Slave nodes, based on decile fine granularity grid index, by searching for the adjacent fine granularity grid set of point fine granularity grids such as object place to be checked, each object comprise fine granularity grid each in the set of adjacent fine granularity grid and the distance of object to be checked contrast, and then obtain k arest neighbors object of inquiry.

The detailed process of described overall data spatial division is:

Each dimension in all data space is carried out decile partition by fine granularities according to unique step δ, forms fine granularity mesh space;

Data object p is mapped to corresponding grid;

Each grid proper vector g (gid, num) represents, the numbering recording each grid and the data object number comprised, wherein gid represents the numbering of grid, has uniqueness, and num represents the number of the data object that this grid comprises.

The process of establishing of the distributed memory grid index structure of described combination grain is:

According to the summary estimation of the Data distribution8 that step one draws, all data space is carried out to the stress and strain model of non-decile coarseness, set up the non-decile coarseness grid index in all data space, the distributed memory grid index structure C GGI of the coarseness in the Master node maintenance all data space of main memory cluster, this Master node is responsible for each Slave node distributing data of cluster;

The fine-grained division of decile is carried out to the subdata space represented by each coarseness grid of above-mentioned division, set up the fine-grained grid index in each subdata space, the fine-grained distributed memory grid index structure FGGI in each one or several subdata spaces of Slave node maintenance of main memory cluster, the subspace non-overlapping copies that namely the fine granularity grid index non-overlapping copies of each Slave node maintenance is safeguarded.

The concrete process of establishing of described non-decile coarseness distributed memory grid index structure is:

According to the process of step one, add up the number of the data object that each division of each dimension comprises;

Each division of every one dimension is made at least to comprise θ data object, when the data object number during certain divides is less than θ, then itself and adjacent division are merged, until its data object comprised data are greater than θ or data space has not had remaining division;

Through above-mentioned calculating and merging, all data space is divided into a non-decile coarseness mesh space, and the data object that each coarseness grid comprises is substantially even;

Set up the grid index CGGI of all data space coarseness, each coarseness grid tlv triple <Cgid, a Cgnum of CGGI, SIP> represents, wherein, Cgid represents the numbering of coarseness grid, with (<lb ₁, ub ₁>, <lb ₂, ub ₂> ..., <lb _i, ubi> ..., <lb _n, ub _n>) represent, <lb _i, ub _i> represents that this grid is in the lower bound of division of the i-th dimension and the upper bound; Cgnum represents the number of this coarseness data in grid object; SIP represents this Slave node address corresponding to coarseness grid.

The process of establishing of described decile fine-grained distributed memory grid index structure is: based on the non-decile coarseness distributed memory grid index structure set up, to each coarseness grid <Cgid, Cgnum, the subdata space that SIP> is corresponding is further segmented, get the granularity that fixed step size λ carries out as every one dimension dividing, each like this coarseness grid will be divided into the fine granularity mesh space of a unique step, based on point fine granularity distributed memory grid index FGGI such as this fine granularity mesh space foundation, each fine granularity grid tlv triple <Fgid of FGGI, Fgnum, List> represents, wherein, Fgid represents the numbering of fine granularity grid, use <l ₁, l ₂..., l ₃..., l _n> represents, has uniqueness, and Fgnum represents the number of this fine granularity data in grid object, and List represents the data object that this fine granularity grid comprises.

The data object of described CGGI, FGGI all can insert and delete, wherein:

The insertion process of CGGI data object is: for data object p (d ₁, d ₂... ..., d _n) insertion, by calculate d _ithe division at place (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, increased by 1;

Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out insertion renewal: first, data object p, by the update of CGGI, is distributed to corresponding coarseness grid and Slave node by Master node; Secondly, for data object p (d ₁, d ₂... ..., d _n) insertion, by calculate the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, increased by 1, meanwhile, data object p be inserted List;

The delete procedure of the data object of CGGI is: for data object p (d ₁, d ₂... ..., d _n) deletion, by calculate d _ithe division at place (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, reduced 1;

Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out deletion renewal: first, Master node, by the deletion action of CGGI, finds coarseness grid and the Slave node at data object p place; Secondly, for data object p (d ₁, d ₂... ..., d _n) deletion, by calculate the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, reduced 1, data object p be deleted from List meanwhile.

Described KNN query script is k the arest neighbors object searching object q, and its concrete query script is:

Master node is service data object map algorithm MOG algorithm first, data object is mapped to coarseness grid index CGGI, determines the coarseness grid Cg of q at CGGI place ^q;

Secondly run the adjacent grid of adjacent grid-search algorithms SNNG algorithm search coarseness grid, namely search for Cg ^qadjacent grid, judge Cg ^qand whether the object number summation in its adjacent grid is greater than k, if be less than k, continue the adjacent grid of its adjacent grid of search, until the total quantity of object is greater than k or searches for complete coarseness mesh space, finally obtain Cg ^qadjacent coarseness grid set C ^q, determine C ^qthe Slave node at middle coarseness grid place;

Comprising C ^qthe Slave node of middle coarseness grid runs SDKNN algorithm, and each Slave node exports Query Result;

The result reduction exported by each Slave node, to a Slave node, obtains result set S, carries out ascending sort to S, get a front k object and export as net result.

Described MOG algorithm runs at Master node, and its specific implementation process is: input data object q (d to be checked ₁, d ₂... ..., d _n), coarseness grid set C, determine the division of the coarseness grid at the every one dimension place of q, q be mapped in CGGI, determine the coarseness grid Cg at q place ^q.

Described SNNG algorithm runs at Master node, and the definition according to " adjacent grid " calculates q place coarseness grid Cg ^qadjacent grid, obtain coarseness and adjoin grid set C ^q, statistics C ^qobject number summation num in middle coarseness grid, if num>=k, then exports C ^q, otherwise, for C ^qin each coarseness grid perform SNNG algorithm, until C ^qobject number num>=k in middle coarseness grid or search for complete coarseness mesh space, exports C ^q.

Described SDKNN algorithm is distributed, easily extensible KNN algorithm, and this algorithm is distributed runs on the coarseness grid Cg storing object q place to be checked ^qand C ^qeach Slave node of middle coarseness grid, each Slave node returns k object of the arest neighbors of q;

Based on Slave node, the specific implementation process of this algorithm is:

Slave node 1 is made to store coarseness grid Cg _j, and Cg _j∈ C ^q, Cg _jbe fine granularity grid index Fg at this Slave node _j;

To Fg _jperform Circle-Traversal algorithm, at least comprised the adjacent fine granularity grid set F of k the arest neighbors object of q _j, this Circle-Traversal algorithm is the neighbour's fine granularity grid-search algorithms without losing, by input fine granularity grid index Fg _j, object q place to be checked fine granularity grid Fg ^q, the step-length λ of fine granularity stress and strain model, searching loop number of times i; With Fg ^qcentered by by circle search belong to Fg _jfine granularity grid, what obtain the periphery of q place fine granularity grid belongs to Fg _jfine granularity grid set F _j;

For F _jin object p i.e. { p|p ∈ Fg, the Fg ∈ F of any fine granularity grid Fg _j, calculate the distance dist (p, q) of p and q, sort according to distance, return the S set of the k nearest with a q object ₁;

Repeat above-mentioned steps, obtain C ^qin the S set of other coarseness grid place Slave node and nearest k the object of q ₂, S ₃..., S _n, according to the distance size with q to { S ₁, S ₂, S ₃..., S _nin object sort, finally return the S set of the k nearest with a q object.

A kind of KNN querying method based on combination grain distributed memory grid index of the present invention, have the following advantages: a kind of KNN querying method based on combination grain distributed memory grid index that the present invention proposes, first, overall data is utilized and analyzes based on the method for grid and density, show that the summary of Data distribution8 is estimated, to reduce the data skew of cluster as much as possible, secondly, basis is estimated as with the summary of Data distribution8, set up the distributed memory grid index structure of thick-thin combination grain, to eliminate the bottleneck of unit handling property, improve data search efficiency, support distributed algorithm, again, based on the fine granularity grid index set up, design is without the neighbour's fine granularity grid-search algorithms lost, ensure fast, neighbour's fine granularity grid of locating query object accurately, finally, based on this distributed memory index structure and neighbour's fine granularity grid-search algorithms, design easily extensible, distributed search algorithm KNN, to eliminate the bottleneck of the unit handling property of centralized search algorithm KNN and to write back based on search algorithm's KNN intermediate result of MapReduce framework the low problem of real-time that disk causes, practical, be easy to promote.

Accompanying drawing explanation

Accompanying drawing 1 is overall realization flow figure of the present invention.

Accompanying drawing 2 is combination grain distributed memory grid index structural drawing of the present invention.

Accompanying drawing 3 is overall data spatial distribution maps of the present invention.

Accompanying drawing 4 is overall data spatial division of the present invention and distribution plan.

Accompanying drawing 5 is of the present invention to the result schematic diagram after X dimension division merging.

Accompanying drawing 6 is of the present invention to the result schematic diagram after Y dimension division merging.

Accompanying drawing 7 is non-decile coarseness grid index CGGI schematic diagram of the present invention.

Accompanying drawing 8 is decile fine granularity grid index FGGI schematic diagram of the present invention.

Accompanying drawing 9 is process schematic of search fine granularity neighbour grid of the present invention.

Embodiment

Below in conjunction with the drawings and specific embodiments, the invention will be further described.

The invention provides a kind of KNN querying method based on combination grain distributed memory grid index, for the problem that traditional search algorithm's KNN search efficiency existing under large data environment is low, utilize and based on the method for grid and density, overall data is analyzed, to reduce data skew as far as possible, the distributed memory grid index structure of thick-thin combination grain of design, to improve data search efficiency, support distributed algorithm, on this basis, design the easily extensible based on distributed memory grid index, distributed search algorithm KNN, to realize the fast query to mass data.

The explanation of nouns related in the method is as follows: internal memory index is a kind of data structure of the some of data in EMS memory or multiple property value being carried out to tissue sequence; Distributed memory index refers to the internal memory index that easily can divide and also distributedly can be deployed to each processing node in main memory cluster; Grid refers to every one dimension A of d dimension data space A _i(i=1,2 ..., d) be divided into p _iindividual interval, each grid g by (c _i=1,2 ..., p _i) composition, be expressed as: g=(c ₁, c ₂..., c _d); Adjacent grid refers to for two grids with if one dimension i, satisfied arbitrarily or or or neighborQuery), refer to the result set searched k the object nearest with appointed object q and form, making whole object set be O, KNN query results is O ', for all there is the distance that dist (p', q)≤dist (p, q), dist (p, q) refers between object p and object q; Master node refers to the host node of cluster, is in charge of the execution of distributed data Sum decomposition task; Slave node refer to cluster from node, be responsible for Distributed Storage and tasks carrying; Easily extensible, distributed search algorithm KNN refer to based on distributed memory index, collaborative search algorithm KNN carrying out query processing on each processing node that distributedly can run on main memory cluster.

As shown in accompanying drawing 1, Fig. 2, the present invention analyzes overall data based on the method for grid and density by utilizing, and the summary forming conceptual data distribution is estimated; Based on this, set up the distributed memory grid index structure of combination grain, cluster Master node is set up and is safeguarded the non-decile coarseness grid index of the total space, and be responsible for each Slave node distributing data of cluster, each Slave node of cluster preserves one or several coarseness grids, and decile refinement is carried out to each coarseness grid, set up corresponding decile fine granularity grid index, the subspace non-overlapping copies that namely the fine granularity grid index non-overlapping copies of each Slave node maintenance is safeguarded; Finally, based on the distributed memory grid index structure of thick-thin combination grain with without the neighbour's fine granularity grid-search algorithms lost, design easily extensible, distributed search algorithm KNN, realize the quick KNN inquiry to mass data.Its specific implementation process is:

Two, carry out the step of data query: the distributed memory grid index structure setting up combination grain, namely set up non-decile coarseness grid index and decile fine granularity grid index; The basis of above-mentioned index structure is designed distributed search algorithm KNN, realize the quick KNN inquiry to mass data: namely based on non-decile coarseness grid index, by searching for the adjacent coarseness grid set of object place to be checked non-decile coarseness grid, determine the Slave node at each coarseness grid place in the set of adjacent coarseness grid; On these Slave nodes, based on decile fine granularity grid index, by searching for the adjacent fine granularity grid set of point fine granularity grids such as object place to be checked, each object comprise fine granularity grid each in the set of adjacent fine granularity grid and the distance of object to be checked contrast, and then obtain k arest neighbors object of inquiry.

The detailed process of described overall data spatial division is: each dimension in all data space is carried out decile partition by fine granularities according to unique step δ, forms fine granularity mesh space; Data object p is mapped to corresponding grid, such as, data object p (d ₁, d ₂... ..., d _n), n is the dimension in all data space, is mapped to grid each grid proper vector g (gid, num) represents, the numbering recording each grid and the data object number comprised, wherein gid represents the numbering of grid, has uniqueness, and num represents the number of the data object that this grid comprises.

The process of establishing of the distributed memory grid index structure of described combination grain is: according to the summary estimation of the Data distribution8 that step one draws, all data space is carried out to the stress and strain model of non-decile coarseness, set up the non-decile coarseness grid index in all data space, the distributed memory grid index structure C GGI of the coarseness in the Master node maintenance all data space of main memory cluster, i.e. coarsegrainedgridindex, this Master node is responsible for each Slave node distributing data of cluster; The fine-grained division of decile is carried out to the subdata space represented by each coarseness grid of above-mentioned division, set up the fine-grained grid index in each subdata space, the fine-grained distributed memory grid index structure FGGI in each one or several subdata spaces of Slave node maintenance of main memory cluster, i.e. finegrainedgridindex, the subspace non-overlapping copies that namely the fine granularity grid index non-overlapping copies of each Slave node maintenance is safeguarded.

The concrete process of establishing of described non-decile coarseness distributed memory grid index structure is: according to the process of step one, adds up the number of the data object that each division of each dimension comprises; Each division of every one dimension is made at least to comprise θ data object, when the data object number during certain divides is less than θ, then itself and adjacent division are merged, until its data object comprised data are greater than θ or data space has not had remaining division; Through above-mentioned calculating and merging, all data space is divided into a non-decile coarseness mesh space, and the data object number that each coarseness grid comprises is substantially even; Set up the grid index CGGI of all data space coarseness, each coarseness grid tlv triple <Cgid, a Cgnum of CGGI, SIP> represents, wherein, Cgid represents the numbering of coarseness grid, with (<lb ₁, ub ₁>, <lb ₂, ub ₂> ..., <lb _i, ubi> ..., <lb _n, ub _n>) represent, <lb _i, ub _i> represents that this grid is in the lower bound of division of the i-th dimension and the upper bound; Cgnum represents the number of this coarseness data in grid object; SIP represents this Slave node address corresponding to coarseness grid.

The process of establishing of described decile fine-grained distributed memory grid index structure is: based on the non-decile coarseness distributed memory grid index structure set up, to each coarseness grid <Cgid, Cgnum, the subdata space that SIP> is corresponding is further segmented, get the granularity that fixed step size λ carries out as every one dimension dividing, each like this coarseness grid will be divided into the fine granularity mesh space of a unique step, based on point fine granularity distributed memory grid index FGGI such as this fine granularity mesh space foundation, each fine granularity grid tlv triple <Fgid of FGGI, Fgnum, List> represents, wherein, Fgid represents the numbering of fine granularity grid, use <l ₁, l ₂, l ₃..., l _n> represents, has uniqueness, and Fgnum represents the number of this fine granularity data in grid object, and List represents the data object that this fine granularity grid comprises.

The data object of described CGGI, FGGI all can insert and delete, and wherein the insertion process of CGGI data object is: for data object p (d ₁, d ₂... ..., d _n) insertion, by calculate d _ithe division at place (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, increased by 1; Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out insertion renewal: first, data object p, by the update of CGGI, is distributed to corresponding coarseness grid and Slave node by Master node; Secondly, for data object p (d ₁, d ₂... ..., d _n) insertion, by calculate the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, increased by 1, meanwhile, data object p be inserted List;

The delete procedure of the data object of CGGI is: for data object p (d ₁, d ₂... ..., d _n) deletion, by calculate d _iplace division (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, reduced 1; Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out deletion renewal: first, Master node, by the deletion action of CGGI, finds coarseness grid and the Slave node at data object p place; Secondly, for data object p (d ₁, d ₂... ..., d _n) deletion, by calculate the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, reduced 1, data object p be deleted from List meanwhile.

Described KNN query script is k the arest neighbors object searching object q, its concrete query script is: Master node is service data object map algorithm MOG algorithm (mapobjecttogrid) first, data object q is mapped to coarseness grid index CGGI, determines the coarseness grid Cg of q at CGGI place ^q; Secondly run the adjacent grid that adjacent grid-search algorithms SNNG algorithm (searchnearestneighborgrid) searches for coarseness grid, namely search for Cg ^qadjacent grid, judge Cg ^qand whether the object number summation in its adjacent grid is greater than k, if be less than k, continue the adjacent grid of its adjacent grid of search, until the total quantity of object is greater than k or searches for complete coarseness mesh space, finally obtain Cg ^qadjacent coarseness grid set C ^q, determine C ^qthe Slave node at middle coarseness grid place; Comprising C ^qthe Slave node of middle coarseness grid runs SDKNN (scalabledistributedKNNalgorithm), and each Slave node exports Query Result; The result reduction exported by each Slave node, to a Slave node, obtains result set S, carries out ascending sort to S, get a front k object and export as net result.

Described MOG algorithm runs at Master node, and its specific implementation process is: input data object q (d to be checked ₁, d ₂... ..., d _n), coarseness grid set C, determine the division in the coarseness grid at the every one dimension place of q, q be mapped in CGGI, determine the coarseness grid Cg at q place ^q; Equaled for 2 (namely for 2-D data space) for n now, the detailed process of this algorithm is:

Described SNNG algorithm runs at Master node, and the definition according to " adjacent grid " calculates q place coarseness grid Cg ^qadjacent grid, obtain coarseness and adjoin grid set C ^q, statistics C ^qobject number summation num in middle coarseness grid, if num>=k, then exports C ^q, otherwise, for C ^qin each coarseness grid perform SNNG algorithm, until C ^qobject number num>=k in middle coarseness grid or search for complete coarseness mesh space, exports C ^q; Its specific implementation process is:

Described SDKNN algorithm is distributed, easily extensible KNN algorithm, and this algorithm is distributed runs on the coarseness grid Cg storing object q place to be checked ^qand Cg ^qadjacent grid set C ^qeach Slave node of middle coarseness grid, each Slave node returns k object of the arest neighbors of q; For Slave node 1, detailed process is:

Slave node 1 is made to store coarseness grid Cg _j, and Cg _j∈ C ^q, Cg _jbe fine granularity grid index Fg at Slave node 1 _j;

To Fg _jperform Circle-Traversal algorithm, at least comprised the fine granularity grid set F of k the arest neighbors object of q _j, this Circle-Traversal algorithm is the adjacent fine granularity grid-search algorithms without losing, by input fine granularity grid index Fg _j, object q place to be checked fine granularity grid Fg ^q, the step-length λ of fine granularity stress and strain model, searching loop number of times i; With Fg ^qcentered by by circle search belong to Fg _jfine granularity grid, at least comprised the fine granularity grid set F of k the arest neighbors object of q _j;

For F _jin any data object p i.e. { p|p ∈ Fg, the Fg ∈ F that comprises of fine granularity grid Fg _j, calculate the distance dist (p, q) of p and q, sort according to distance, return the S set of the k nearest with a q object ₁.

In like manner, C can be obtained ^qin the S set of other coarseness grid place Slave nodes and nearest k the object of q ₂, S ₃..., S _n, according to the distance size with q to { S ₁, S ₂, S ₃..., S _nin object sort, finally return the S set of the k nearest with a q object.

The detailed process of above-mentioned Circle-Traversal algorithm is:

The algorithm of described SDKNN is:

The present invention is in conjunction with the characteristic of main memory cluster, by proposing and set up slightly a kind of-thin distributed memory grid index structure of combination grain, and design is based on distributed search algorithm KNN of this index structure, KNN algorithm centralized for tradition is expanded to distributed memory cluster environment, the problem of centralized search algorithm KNN and the search algorithm's KNN inefficiency based on MapReduce framework under improving large data environment.

Specific embodiment:

1, without loss of generality, using the cluster of 6 station servers compositions for experiment porch (wherein 1 as Master node, 5 as Slave node), the detailed description of technical solution of the present invention is carried out for two-dimensional space data KNN inquiry.Overall data is as shown in the table, space distribution as shown in Figure 3,

(12,68)	(31,73)	(58,63)	(57,23)	(4,26)	(28,33)	(11,16)	(56,8)	(21,66)
									(16,72)	(13,56)	(62,78)	(52,29)	(7,34)	(32,19)	(13,26)	(53,16)	(23,61)
(18,61)	(26,57)	(65,66)	(59,49)	(6,23)	(28,13)	(26,23)	(56,43)	(27,71)
									(11,63)	(7,55)	(67,72)	(64,24)	(9,11)	(38,26)	(67,43)	(66,16)	(53,72)
(8,73)	(21,53)	(46,33)	(62,36)	(8,2)	(37,13)	(2,12)	(57,71)	(56,76)

2, utilize and carry out spatial division based on the data in the method his-and-hers watches 1 of grid and density.Get fixed step size δ=5, by the data-mapping in table 1 to corresponding grid, determine the numbering gid of grid belonging to each data object, result is as shown in the data-mapping result of following table.Such as: for data object p (d ₁, d ₂), by calculating the numbering of grid belonging to p

According to data-mapping result, the spatial division of overall data can be obtained based on grid and density, each grid proper vector g (gid, num), the proper vector of all non-NULL grids is as shown in the proper vector of following table non-NULL grid, and the spatial division of overall data and distribution are as shown in Figure 4.

g(<2,13>，1)	g(<6,14>，1)	g(<11,12>，1)	g(<11,4>，1)	g(<0,5>，1)
					g(<3,14>，1)	g(<2,11>，1)	g(<12,15>，1)	g(<10,5>，1)	g(<1,6>，1)
g(<3,12>，1)	g(<5,11>，1)	g(<13,13>，1)	g(<11,9>，1)	g(<1,4>，1)
					g(<2,12>，1)	g(<1,11>，1)	g(<13,14>，1)	g(<12,4>，1)	g(<1,2>，1)
g(<1,14>，1)	g(<4,10>，1)	g(<9,6>，1)	g(<12,7>，1)	g(<1,0>，1)
					g(<2,3>，1)	g(<2,5>，1)	g(<5,4>，1)	g(<13,8>，1)	g(<0,2>，1)
g(<10,3>，1)	g(<11,8>，1)	g(<13,3>，1)	g(<11,14>，1)	g(<4,13>，1
					g(<5,14>，1)	g(<10,14>，1)	g(<11,15>，1)	g(<5,6>，1)	g(<6,3>，1)
g(<5,2>，1)	g(<7,5>，1)	g(<7,2>，1)	g(<11,1>，1)	g(<4,12>，1)

The spatial division of the grid search-engine vector sum overall data 3, obtained based on above-mentioned steps 2, sets up non-decile coarseness distributed memory grid index.Concrete steps are as follows,

1) add up the division of the X dimension space shown in Fig. 3, concrete outcome is as shown in the table.

(a)

0th divides	1st divides	2nd divides	3rd divides
				g(<0,2>，1)	g(<1,14>，1)	g(<2,13>，1)	g(<3,14>，1)
g(<0,5>，1)	g(<1,11>，1)	g(<2,12>，1)	g(<3,12>，1)
					g(<1,6>，1)	g(<2,3>，1)
	g(<1,4>，1)	g(<2,11>，1)
					g(<1,2>，1)	g(<2,5>，1)
	g(<1,0>，1)

(b)

4th divides	5th divides	6th divides	7th divides
				g(<4,10>，1)	g(<5,14>，1)	g(<6,3>，1)	g(<7,5>，1)
g(<4,13>，1)	g(<5,2>，1)	g(<6,14>，1)	g(<7,2>，1)
				g(<4,12>，1)	g(<5,4>，1)
	g(<5,6>，1)
					g(<5,11>，1)

(c)

(d)

12nd divides	13rd divides	14th divides	15th divides
				g(<12,4>，1)	g(<13,14>，1)
g(<12,7>，1)	g(<13,8>，1)
				g(<12,15>，1)	g(<13,13>，1)
	g(<13,3>，1)

The division of the Y dimension space shown in Fig. 3 is added up, shown in the division statistics that concrete outcome is tieed up based on Y as following table.

(a)

0th divides	1st divides	2nd divides	3rd divides
				g(<1,0>，1)	g(<11,1>，1)	g(<5,2>，1)	g(<2,3>，1)
		g(<1,2>，1)	g(<12,3>，1)
						g(<0,2>，1)	g(<13,3>，1)
		g(<7,2>，1)	g(<6,3>，1)

(b)

4th divides	5th divides	6th divides	7th divides
				g(<5,4>，1)	g(<2,5>，1)	g(<9,6>，1)	g(<12,7>，1)
g(<11,4>，1)	g(<7,5>，1)	g(<5,6>，1)
				g(<12,4>，1)	g(<10,5>，1)	g(<1,6>，1)
g(<1,4>，1)	g(<0,5>，1)

(c)

8th divides	9th divides	10th divides	11st divides
				g(<11,8>，1)	g(<11,9>，1)	g(<4,10>，1)	g(<2,11>，1)
g(<13,8>，1)			g(<5,11>，1)
							g(<1,11>，1)

(d)

12nd divides	13rd divides	14th divides	15th divides
				g(<3,12>，1)	g(<2,13>，1)	g(<3,14>，1)	g(<12,15>，1)
g(<2,12>，1)	g(<13,13>，1)	g(<1,14>，1)	g(<11,15>，1)
				g(<11,12>，1)	g(<4,13>，1)	g(<5,14>，1)
g(<4,12>，1)		g(<6,14>，1)
						g(<10,14>，1)
		g(<13,14>，1)
						g(<11,14>，1)

2) get parameter θ=10, respectively the division of X dimension, Y dimension is merged, to carry out even partition to overall data as far as possible.

First X dimension is scanned from low to high according to the numbering divided, 0th division has 2 data objects, is less than θ, therefore needs to divide with the 1st to merge, 8 data objects are had after merging, be less than θ, therefore need to divide with the 2nd to merge, after merging, have 13 data objects, be greater than θ, stop merging, so far, divide X dimension the 0th, 1,2 three and merge.The like, the division that scanning is remaining, can tie up the 3rd, 4,5 three division at X and merge, the 6th, 7,8,9,10,11 6 division merges, and the 12nd, 13,14,15 4 divides merging, and result is as shown in Figure 5.

Secondly, by that analogy, the division of Y dimension is merged, 0th, 1,2,3 four division merges, 4th, 5,6 three divisions merge, and the 7th, 8,9,10,11,12 6 division merges, and the 13rd, 14 two division merges, 15th divides as a division, and result as shown in Figure 6.

3) based on 2) result merged is divided to X dimension, Y dimension, set up coarseness grid index CGGI as shown in Figure 7, the proper vector tlv triple <Cgid of each coarseness grid, Cgnum, SIP> represents, Cgid represents the numbering of coarseness grid, with (<lb ₁, ub ₁>, <lb ₂, ub ₂>) represent, wherein <lb _i, ub _i> represents that this grid is in the lower bound of division of the i-th dimension and the upper bound, and Cgnum represents the number of this coarseness data in grid object, and SIP represents this Slave node address corresponding to coarseness grid.We store coarseness grid with 4 Slave nodes, the numbering of 4 Slave nodes is respectively 001,002,003,004, scan all coarseness grids of CGGI successively, then 4 Slave nodes are distributed to, specific rules is: first, and the division along X dimension scans from small to large, and the division then tieed up along Y in each division of X dimension scans from small to large; Then, by coarseness Grid delivery to comprise data object few Slave node.Such as, initial 4 Slave nodes comprise data object and are 0, be (<0 from Cgid, 15>, <0, 20>) start scanning, be distributed to Slave node 001, (<0, 15>, <20, 35>) be distributed to Slave node 002, (<0, 15>, <35, 65>) be distributed to Slave node 003, (<0, 15>, <65, 75>) be distributed to Slave node 004, now 001, 002, 003, the data object number of 004 is respectively 4, 4, 3, 2, so (<0, 15>, <75, 80>) be distributed to 004 node, by that analogy by remaining coarseness Grid delivery to corresponding Slave node.Finally, shown in the proper vector of the memory node that CGGI all coarsenesses grid is corresponding and proper vector coarseness grids as all in following table CGGI.

Slave node 001	Slave node 002	Slave node 003	Slave node 004
				((<0,15>,<0,20>),4,001)	((<0,15>,<20,35>),4,002)	((<0,15>,<35,65>),3,003)	((<0,15>,<65,75>),2,004)
((<15,30>,<65,75>),3,001)	((<15,30>,<75,80>),0,002)	((<15,30>,<20,35>),2,003)	((<0,15>,<75,80>),0,004)
				((<30,60>,<35,65>),3,001)	((<30,60>,<0,20>),4,002)	((<30,60>,<20,35>),4,003)	((<15,30>,<0,20>),1,004)
((<60,80>,<35,65>),2,001)	((<30,60>,<75,80>),1,002)	((<60,80>,<20,35>),1,003)	((<15,30>,<35,65>),4,004)
					((<60,80>,<0,20>),1,002)	((<60,80>,<75,80>),1,003)	((<30,60>,<65,75>),3,004)
	((<60,80>,<65,75>),2,002)

4, to point fine granularity distributed memory grid index structure FGGI such as each coarseness grid foundation in the non-decile coarseness distributed memory grid index structure of above-mentioned steps 3 foundation.Get fixed step size λ=5, carry out partition by fine granularities to each coarseness grid, each fine granularity grid search-engine vector tlv triple <Fgid of FGGI, Fgnum, List> represent.Wherein, Fgid represents the numbering of fine granularity grid, with using <l ₁, l ₂> represents, has uniqueness, and Fgnum represents the number of this fine granularity data in grid object, and List represents the data object that this fine granularity grid comprises.Such as coarseness grid Cg ((<0, 15>, <0, 20>), 4, 001) fine granularity grid index as shown in Figure 8, the proper vector of non-NULL fine granularity grid is respectively (<0, 2>, 1, (2, 12)), (<1, 0>, 1, (8, 2)), (<1, 2>, 1, (9, 11)), (<2, 3>, 1, (11, 16)).The fine granularity grid index of all coarseness grids can be calculated by that analogy.

5, KNN inquiry: 2 the arest neighbors objects searching data object q (56,43).

1) based on distributed memory grid index structure C GGI and FGGI of 3 and 4 thick-thin combination grains set up, run MOG algorithm, determine the coarseness grid Cg of q at CGGI place ^q.As follows according to algorithm concrete operations: 1. initialization Cg ^qfor sky.2. determine that q ties up the division at place at X, the division tieed up the X of CGGI is according to ascending sort, and result is (<0,15>, <15,30>, <30,60>, <60,80>), q is 56 in the value that X ties up, that can determine that q ties up at X by comparison is divided into <30,60>.3. determine that q ties up the division at place at Y, the division tieed up the Y of CGGI is according to ascending sort, result is (<0,20>, <20,35>, <35,65>, <65,75>, <75,80>), q is 43 in the value that X ties up, that can determine that q ties up at Y by comparison is divided into <35,65>.4. pass through 2., 3. can determine the Cgid of the coarseness grid at q place for (<30,60>, <35,65>), export net result Cg ^q((<30,60>, <35,65>), 3,001).

2) based on distributed memory grid index structure C GGI and FGGI of 3 and 4 thick-thin combination grains set up, run SNNG algorithm and search Cg ^qcoarseness adjoin grid set C ^q.As follows according to algorithm concrete operations: all coarseness grids 1. for CGGI represent with set C, whether for coarseness grid each in C, calculating according to the definition of " adjacent grid " is successively Cg ^qadjacent grid, after comparison, it is deleted from C, until when C is empty, end operation.Such as: for the coarse mesh Cg in C ¹((<0,15>, <0,20>), 4,001), according to the definition of " adjacent grid " calculate whether with Cg ^qadjacent, first, whether both calculating adjoins in X dimension, Cg ¹x dimension is divided into <0,15>, Cg ^qx dimension is divided into <30,60> because 0 ≠ 60,15 ≠ 30,0 ≠ 30,15 ≠ 60, thus in X dimension Cg ¹and Cg ^qdo not adjoin, Cg ¹and Cg ^qit not adjacent grid; For the coarse mesh Cg in C ²((<15,30>, <20,35>), 2,003), according to the definition of " adjacent grid " calculate whether with Cg ^qadjacent, first, whether both calculating adjoins in X dimension, Cg ²x dimension is divided into <15,30>, Cg ^qx dimension is divided into <30,60>, 30==30 (i.e. Cg ²at the upper bound and the Cg of X dimension ^qx dimension lower bound equal), so in X dimension Cg ²and Cg ^qadjacent, secondly, whether both calculating adjoins in Y dimension, Cg ²y dimension is divided into <20,35>, Cg ^qx dimension is divided into <35,65>, 35==35 (i.e. Cg ²at the upper bound and the Cg of Y dimension ^qy dimension lower bound equal), so in Y dimension Cg ²and Cg ^qadjacent, because Cg ²with Cg ^qall adjacent in X peacekeeping Y dimension, so Cg ²cg ^qadjacent grid.By said method, Cg can be calculated ^qcoarseness adjoin grid set C ^q, result is as following table Cg ^qcoarseness adjoin grid set C ^qshown in.2. C is added up ^qthe number of middle data object, cumulative C ^qin the data object number of each coarseness grid, i.e. C ^q.num=C ^q.num+Cg.num, (initial C ^q.num=0), because q itself will be removed, so can C be obtained ^qmiddle data object adds up to 23.3. C is judged ^qthe size of the arest neighbors object number k of middle data object sum and inquiry, because 23>2, so C ^qexport as net result.

((<15,30>,<20,35>),2,003)	((<30,60>,<20,35>),4,003)	((<60,80>,<20,35>),1,003)
			((<15,30>,<35,65>),4,004)	((<30,60>,<35,65>),3,001)	((<60,80>,<35,65>),2,001)
((<15,30>,<65,75>),3,001)	((<30,60>,<65,75>),3,004)	((<60,80>,<65,75>),2,002)

3) at C ^qin each coarseness grid place Slave node on run SDKNN algorithm, k (k=2) the neighbour object of distributed calculating q.From step 3) front side table, C ^qin 9 coarsenesses grid-distributed be stored in Slave node 001,002,003,004, respectively SDKNN algorithm is performed to fine granularity grid index corresponding to these 9 coarseness grids at these 4 nodes.As follows according to algorithm concrete operations: with Cg ((<60,80>, <35,65>), 2,001) for example, 1. calculate the fine granularity grid Fg at q place ^q, fg ^q=(<11,8>, 1, (56,43)).2. Circle-Traversal algorithm is run, search Fg ^qarest neighbors grid set F in the fine granularity grid index Fg that Cg is corresponding ₁, all fine granularity grids that the fine granularity grid index Fg that Cg is corresponding comprises are as shown in the table, and calculating with (56,43) for the center of circle, is the Fg of radius with i × 5 ^qperiphery neighbour grid, return the S set of object nearest with q in this coarse mesh ₁, being specially (process as shown in Figure 9): as i=0, is the fine granularity lattice Fg at (56,43) place ^q, because so continue to extending out a circle, as i=1, first lap travels through, fg _startnot in Fg, along upwards searching, j=1, not in Fg, along upwards searching, j=2, not in Fg, now j=3>2*i, and F ' g unequal to Fg _start, change direction for search to the right, j=1, not in Fg, along searching to the right, j=2, in Fg, F ₁=Fg ' }={ (<12,9>, 0, null) }, now j=3>2*i, and Fg ' unequal to Fg _start, change direction for search downwards, j=1, in Fg, F ₁=F ₁∪ Fg '={ (<12,9>, 0, null), (<12,8>, 0, null) }, along searching downwards, j=2, in Fg, F ₁=F ₁∪ Fg '={ (<12,9>, 0, null), (<12,8>, 0, null), (<12,7>, 1, (62,36)), now j=3>2*i, and Fg ' unequal to Fg _start, change direction for search left, j=1, not in Fg, along searching left, j=2, now, Fg '=Fg _start, first lap traversal terminates, F ₁={ (<12,9>, 0, null), (<12,8>, 0, null), (<12,7>, 1, (62,36)), F ₁.num=1<k=2, so need to continue to extending out a circle, i=2, carry out the second circle traversal, traversal form is similar with first lap, after the second circle traversal terminates, and F ₁={ (<12, 9>, 0, null), (<12, 8>, 0, null), (<12, 7>, 1, (62, 36)), (<12, 10>, 0, null), (<13, 10>, 0, null), (<13, 9>, 0, null), (<13, 8>, 1, (67, 43)), (<13, 7>, 0, null) }, F ₁.num=2, in order to F _jensure at least to comprise the k nearest with a q object, need to expand a circle again, i=3 carries out the 3rd circle traversal, after traversal terminates, and F ₁={ (<12, 9>, 0, null), (<12, 8>, 0, null), (<12, 7>, 1, (62, 36)), (<12, 10>, 0, null), (<13, 10>, 0, null), (<13, 9>, 0, null), (<13, 8>, 1, (67, 43)), (<13, 7>, 0, null), (<12, 11>, 0, null), (<13, 11>, 0, null), (<14, 11>, 0, null), (<14, 10>, 0, null), (<14, 9>, 0, null), (<14, 8>, 0, null), (<14, 7>, 0, null) }, calculate q and F ₁the distance of object in middle fine granularity grid wherein p ₁=(<12,7>, 1, (62,36)), wherein p ₂=(<13,8>, 1, (67,43)), are got apart from minimum 2 data objects by sequence, obtain S ₁=(<9.2, (62,36) >, <11, (67,43) >).3. in like manner by running SDKNN algorithm, C can be calculated ^qin minimum k the object of the distance of other coarseness grids and q, specific as follows:

For coarse mesh ((<15, 30>, <65, 75>), 3, 001), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<5, 14>, 1, (27, 71)), (<4, 13>, 1, (21, 66)), (<3, 14>, 1, (16, 72)), by calculating distance, get the k nearest with a q object, S ₂=(<40.3, (27,71) >, <41.9, (21,66) >).

For coarse mesh ((<15, 30>, <35, 65>), 2, 004), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<5, 11>, 1, (26, 57)), (<4, 10>, 1, (21, 53)), (<4, 12>, 1, (23, 61)), (<3, 12>, 1, (18, 61)), by calculating distance, get the k nearest with a q object, S ₃=(<33.1, ((26,57) >, <36.4, (21,53) >).

For coarse mesh ((<15,30>, <20,35>), 2,003), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<5,4>, 1, (26,23)), (<5,6>, 1, (28,33)), by calculating distance, get the k nearest with a q object, S ₄=(<36.1, (26,23) >, <29.7, (28,33) >).

For coarse mesh ((<30, 60>, <65, 75>), 3, 004), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<6, 14>, 1, (31, 73)), (<10, 14>, 1, (53, 72)), (<11, 14>, 1, (57, 71)), by calculating distance, get the k nearest with a q object, S ₅=(<29.2, (53,72) >, <28, (57,71) >).

For coarse mesh ((<30,60>, <35,65>), 3,001), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<11,12>, 1, (58,63)), (<11,9>, 1, (59,49)), by calculating distance, get the k nearest with a q object, S ₆=(<20.1, (58,63) >, <6.7, (59,49) >).

For coarse mesh ((<30, 60>, <20, 35>), 4, 003), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<9, 6>, 1, (46, 33)), (<10, 5>, 1, (52, 29)), (<7, 5>, 1, (38, 26)), (<11, 4>, 1, (57, 23)), by calculating distance, get the k nearest with a q object, S ₇=(<14.1, (46,33) >, <14.6, (52,29) >).

For coarse mesh ((<60,80>, <65,75>), 2,002), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<13,13>, 1, (65,66)), (<13,14>, 1, (67,72)), by calculating distance, get the k nearest with a q object, S ₈=(<24.7, (65,66) >, <31.02, (67,72) >).

For coarse mesh ((<60,80>, <20,35>), 1,003), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<12,4>, 1, (64,24)), by calculating distance, get the k nearest with a q object (when number of objects is less than k, all getting), S ₉=(<20.6, (64,24) >).

(<12,7>,1,(62,36))	(<13,7>,0,null)	(<14,7>,0,null)	(<15,7>,0,null)
				(<12,8>,0,null)	(<13,8>,1,(67,43))	(<14,8>,0,null)	(<15,8>,0,null)
(<12,9>,0,null)	(<13,9>,0,null)	(<14,9>,0,null)	(<15,9>,0,null)
				(<12,10>,0,null)	(<13,10>,0,null)	(<14,10>,0,null)	(<15,10>,0,null)
(<12,11>,0,null)	(<13,11>,0,null)	(<14,11>,0,null)	(<15,11>,0,null)
				(<12,12>,0,null)	(<13,12>,0,null)	(<14,12>,0,null)	(<15,12>,0,null)

4) Slave node 001,002,003,004 is run the result S that SDKNN algorithm obtains ₁, S ₂, S ₃, S ₄, S ₅, S ₆, S ₇, S ₈, S ₉reduction to Slave node 005 carries out ascending sort, and get rear front 2 results of sequence, obtaining net result is S=(<6.7, (59,49) >, <9.2, (62,36) >).

S is exported as final query results.

The present invention utilizes and analyzes overall data based on the method for grid and density, showing that the summary of Data distribution8 is estimated, laying the foundation for setting up coarseness grid, for the data skew reducing cluster provides foundation; Basis is estimated as with the summary of Data distribution8, set up the combination grain distributed memory grid index structure based on non-decile coarseness and the thin combination grain of decile, the bottleneck of unit handling property can be eliminated, improve data search efficiency, supporting distributed algorithm, is the core technology designing efficient, distributed KNN algorithm; Based on set up decile fine granularity grid index, design, can neighbour's fine granularity grid of locating query object fast and accurately without the neighbour's fine granularity grid-search algorithms lost; Based on the distributed memory grid index structure of thick-thin combination grain with without the neighbour's fine granularity grid-search algorithms lost, devise easily extensible, distributed search algorithm KNN, eliminate the bottleneck of centralized search algorithm's KNN unit handling property and write back based on search algorithm's KNN intermediate result of MapReduce framework the low problem of real-time that disk causes, achieving the fast query to mass data.

Above-mentioned embodiment is only concrete case of the present invention; scope of patent protection of the present invention includes but not limited to above-mentioned embodiment; claims of any a kind of KNN querying method based on combination grain distributed memory grid index according to the invention and the those of ordinary skill of any described technical field to its suitable change done or replacement, all should fall into scope of patent protection of the present invention.

Claims

1. based on a KNN querying method for combination grain distributed memory grid index, it is characterized in that, specific implementation process is,

Two, the step of data query is carried out:

Based on distributed search algorithm KNN of above-mentioned Index Structure Design, realize the quick KNN inquiry to mass data: namely based on non-decile coarseness grid index, by searching for the adjacent coarseness grid set of object place to be checked non-decile coarseness grid, determine the Slave node at each coarseness grid place in the set of adjacent coarseness grid; On these Slave nodes, based on decile fine granularity grid index, by searching for the adjacent fine granularity grid set of point fine granularity grids such as object place to be checked, each object comprise fine granularity grid each in the set of adjacent fine granularity grid and the distance of object to be checked contrast, and then obtain several arest neighbors objects of inquiry.

2. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 1, it is characterized in that, the detailed process of described overall data spatial division is:

Data object p is mapped to corresponding grid;

3. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 1, is characterized in that, the process of establishing of the distributed memory grid index structure of described combination grain is:

4. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 3, is characterized in that, the concrete process of establishing of described non-decile coarseness distributed memory grid index structure is:

Each division of every one dimension is made at least to comprise θ data object, when the data object number during certain divides is less than θ, then itself and adjacent division are merged, until its data object comprised data are greater than θ or data space has not had remaining division in this dimension;

Through above-mentioned calculating and merging, all data space is divided into a non-decile coarseness mesh space, and the data object number that each coarseness grid comprises is substantially even;

5. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 3 or 4, it is characterized in that, the process of establishing of described decile fine-grained distributed memory grid index structure is: based on the non-decile coarseness distributed memory grid index structure set up, to each coarseness grid <Cgid, Cgnum, the subdata space that SIP> is corresponding is further segmented, get the granularity that fixed step size λ carries out as every one dimension dividing, each like this coarseness grid will be divided into the fine granularity mesh space of a unique step, based on point fine granularity distributed memory grid index FGGI such as this fine granularity mesh space foundation, each fine granularity grid tlv triple <Fgid of FGGI, Fgnum, List> represents, wherein, Fgid represents the numbering of fine granularity grid, use <l ₁, l ₂, l ₃..., l _n> represents, has uniqueness, and Fgnum represents the number of this fine granularity data in grid object, and List represents the data object that this fine granularity grid comprises.

6. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 5, is characterized in that, the data object of described CGGI, FGGI all can insert and delete, wherein,

The delete procedure of the data object of CGGI is: for data object p (d ₁, d ₂... ..., d _n) deletion, by calculate d _iplace division (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, reduced 1;

7. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 1, is characterized in that, described KNN query script is k the arest neighbors object searching object q, and its concrete query script is:

Master node is service data object map algorithm MOG algorithm first, and object q to be checked is mapped to coarseness grid index CGGI, determines the coarseness grid Cg of q at CGGI place ^q;

Secondly adjacent grid-search algorithms SNNG algorithm search coarseness grid Cg is run ^qadjacent grid, judge Cg ^qand whether the object number summation in its adjacent grid is greater than k, if be less than k, continue the adjacent grid of its adjacent grid of search, until the total quantity of object is greater than k or searches for complete coarseness mesh space, finally obtain Cg ^qadjacent coarseness grid set C ^q, determine C ^qthe Slave node at middle coarseness grid place;

The result reduction exported by each Slave node, to a Slave node, obtains result set S, presses ascending sort, get a front k object and export as net result the object in S according to the distance size with object q to be checked.

8. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 7, is characterized in that, described MOG algorithm runs at Master node, and its specific implementation process is: input object q (d to be checked ₁, d ₂... ..., d _n), coarseness grid set C, determine the division of the coarseness grid at the every one dimension place of q, q be mapped in CGGI, determine the coarseness grid Cg at q place ^q.

9. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 7, is characterized in that, described SNNG algorithm runs at Master node, and the definition according to " adjacent grid " calculates q place coarseness grid Cg ^qadjacent grid, obtain coarseness and adjoin grid set C ^q, statistics C ^qobject number summation num in middle coarseness grid, if num>=k, then exports C ^q, otherwise, for C ^qin each coarseness grid perform SNNG algorithm, until C ^qobject number num>=k in middle coarseness grid or search for complete coarseness mesh space, exports C ^q.

10. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 7, it is characterized in that, described SDKNN algorithm is distributed, easily extensible KNN algorithm, and this algorithm is distributed runs on the coarseness grid Cg storing object q place to be checked ^qand C ^qeach Slave node of middle coarseness grid, each Slave node returns k object of the arest neighbors of q;

Based on Slave node, the specific implementation process of this algorithm is: