CN105138607A - Hybrid granularity distributional memory grid index-based KNN query method - Google Patents

Hybrid granularity distributional memory grid index-based KNN query method Download PDF

Info

Publication number
CN105138607A
CN105138607A CN201510481594.7A CN201510481594A CN105138607A CN 105138607 A CN105138607 A CN 105138607A CN 201510481594 A CN201510481594 A CN 201510481594A CN 105138607 A CN105138607 A CN 105138607A
Authority
CN
China
Prior art keywords
grid
coarseness
data
fine granularity
index
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510481594.7A
Other languages
Chinese (zh)
Other versions
CN105138607B (en
Inventor
蔡斌雷
朱世伟
郭芹
杨子江
于俊凤
魏墨济
李思思
徐蓓蓓
李晨
巴志超
鞠镁隆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Southern Power Grid Internet Service Co ltd
Jingchuang United Beijing Intellectual Property Service Co ltd
Original Assignee
INFORMATION RESEARCH INSTITUTE OF SHANDONG ACADEMY OF SCIENCES
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by INFORMATION RESEARCH INSTITUTE OF SHANDONG ACADEMY OF SCIENCES filed Critical INFORMATION RESEARCH INSTITUTE OF SHANDONG ACADEMY OF SCIENCES
Priority to CN201510481594.7A priority Critical patent/CN105138607B/en
Publication of CN105138607A publication Critical patent/CN105138607A/en
Application granted granted Critical
Publication of CN105138607B publication Critical patent/CN105138607B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1847File system types specifically adapted to static storage, e.g. adapted to flash memory or SSD
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a hybrid granularity distributional memory grid index-based KNN query method, and specifically, the method is realized by the following steps: performing a data pre-processing step: based on the grid and density, performing space division of overall data to obtain overview estimation of overall data distribution; performing a data query step: establishing a hybrid granularity distributional memory grid index structure, that is, establishing a non-bisecting coarse-grained grid index and a bisecting fine-grained grid grid index; and and on the basis, by designing a distributional KNN query algorithm, realizing a fast KNN query for mass data. Compared to the prior art, the hybrid granularity distributional memory grid index-based KNN query method, reduces the data skew of the cluster, improves the data index efficiency and supports the distributional algorithm by designing and establishing the hybrid granularity distributional memory grid index, and disperses the bottleneck in single processing performance of the centralized KNN query algorithm, and the real-time degradation problem led by intermediate result rewriting disc of the KNN query algorithem based on the MapReduce construction by using the KNN query algorithm based on the index structure.

Description

A kind of KNN querying method based on combination grain distributed memory grid index
Technical field
The present invention relates to information retrieval field, specifically a kind of practical, based on the KNN querying method of combination grain distributed memory grid index.
Background technology
Along with the fast development of the technology such as internet, Internet of Things, large data, KNN inquiry is widely used in various location-based application as the basic operation of one.But along with the continuous growth of data volume, traditional centralized KNN inquiry and the KNN querying method based on MapReduce framework effectively can not carry out fast processing to mass data.How under large data environment, expanding centralized search algorithm KNN of tradition in conjunction with main memory cluster characteristic, index technology and distributed memory computing technique is the fundamental way solving this problem.
Index technology is the key components of search algorithm KNN, and its basic thought utilizes the methods such as division, location, mapping to carry out the new index data structure of sequence composition to data object, utilizes index technology can improve the recall precision of data.The application of index technology in search algorithm KNN is very extensive, under centralized environment: 1. based on the index structure that the characteristic of Voronoi figure is set up, is widely applied in the KNN inquiry of spatial data.2. the KNN inquiry based on tree structure mainly contains: the KNN inquiry that the KNN inquiry that the index structure set up based on R tree is ripe is applied to spatial data, the index structure set up based on TPR tree are applied to the KNN inquiry of Mobile data object, Data Placement is applied to space-time data in conjunction with the index structure that B+ tree is set up.3. the KNN inquiry based on grid index structure mainly contains: the KNN inquiry that the index structure set up based on decile grid is widely used in wireless broadcast environment, the index structure set up based on tree and grid mixed structure be applied to Mobile data object KNN inquiry, be widely used in lasting KNN based on the grid index structure of main memory and inquire about; Under distributed environment, the KNN inquiry based on MapReduce framework there has also been some achievements, mainly contains: the parallel KNN based on R tree inquires about, based on the parallel KNN inquiry of inverted index, the parallel KNN inquiry etc. based on Voronoi figure.
Under large data environment, KNN querying method efficiency comparison is low, search algorithm KNN supported under lacking effective index structure and this structure is the key causing this problem, be mainly reflected in: the index structure 1. under centralized environment studies relative maturity, based on these index structures, a lot of relative efficiency KNN algorithm is also proposed, but along with data volume is explosive growth, under centralized environment, unit handling property becomes implacable bottleneck; 2. some are also had to study based on the index structure of MapReduce framework and relevant search algorithm KNN under distributed environment, but because MapReduce framework is batch processing model, intermediate result needs to write back disk, add I/O, the real-time which results in inquiry is poor, and existing algorithm, do not consider Data distribution8 situation, easily cause the data skew problem of cluster.
For the shortcoming of the low problem of KNN search efficiency and prior art under current large data environment, we propose a kind of KNN querying method based on combination grain distributed memory grid index, in conjunction with the characteristic of main memory cluster, we are by carrying out summary estimation to overall data, set up the distributed memory grid index structure of thick-thin combination grain, to reduce data skew and to improve data search efficiency; Design is without the neighbour's fine granularity grid-search algorithms lost, to ensure neighbour's grid of positioning searching object fast and accurately, finally, based on set up index structure with without the neighbour's fine granularity grid-search algorithms lost, in conjunction with distributed memory computation model, carry out distributedly expanding to tradition centralized KNN algorithm, to eliminate the unit performance bottleneck of centralized KNN algorithm and the I/O bottleneck based on MapReduce framework KNN algorithm, and then quick KNN inquiry is carried out to mass data.
Summary of the invention
Technical assignment of the present invention is for above weak point, provide a kind of practical, based on the KNN querying method of combination grain distributed memory grid index.
Based on a KNN querying method for combination grain distributed memory grid index, its specific implementation process is:
One, carry out the step of data prediction: based on grid and density, overall data is carried out spatial division, the summary obtaining conceptual data distribution is estimated;
Two, the step of data query is carried out:
Set up the distributed memory grid index structure of combination grain, namely set up non-decile coarseness grid index and decile fine granularity grid index;
Based on distributed search algorithm KNN of above-mentioned Index Structure Design, realize the quick KNN inquiry to mass data: namely based on non-decile coarseness grid index, by searching for the adjacent coarseness grid set of object place to be checked non-decile coarseness grid, determine the Slave node at each coarseness grid place in the set of adjacent coarseness grid; On these Slave nodes, based on decile fine granularity grid index, by searching for the adjacent fine granularity grid set of point fine granularity grids such as object place to be checked, each object comprise fine granularity grid each in the set of adjacent fine granularity grid and the distance of object to be checked contrast, and then obtain k arest neighbors object of inquiry.
The detailed process of described overall data spatial division is:
Each dimension in all data space is carried out decile partition by fine granularities according to unique step δ, forms fine granularity mesh space;
Data object p is mapped to corresponding grid;
Each grid proper vector g (gid, num) represents, the numbering recording each grid and the data object number comprised, wherein gid represents the numbering of grid, has uniqueness, and num represents the number of the data object that this grid comprises.
The process of establishing of the distributed memory grid index structure of described combination grain is:
According to the summary estimation of the Data distribution8 that step one draws, all data space is carried out to the stress and strain model of non-decile coarseness, set up the non-decile coarseness grid index in all data space, the distributed memory grid index structure C GGI of the coarseness in the Master node maintenance all data space of main memory cluster, this Master node is responsible for each Slave node distributing data of cluster;
The fine-grained division of decile is carried out to the subdata space represented by each coarseness grid of above-mentioned division, set up the fine-grained grid index in each subdata space, the fine-grained distributed memory grid index structure FGGI in each one or several subdata spaces of Slave node maintenance of main memory cluster, the subspace non-overlapping copies that namely the fine granularity grid index non-overlapping copies of each Slave node maintenance is safeguarded.
The concrete process of establishing of described non-decile coarseness distributed memory grid index structure is:
According to the process of step one, add up the number of the data object that each division of each dimension comprises;
Each division of every one dimension is made at least to comprise θ data object, when the data object number during certain divides is less than θ, then itself and adjacent division are merged, until its data object comprised data are greater than θ or data space has not had remaining division;
Through above-mentioned calculating and merging, all data space is divided into a non-decile coarseness mesh space, and the data object that each coarseness grid comprises is substantially even;
Set up the grid index CGGI of all data space coarseness, each coarseness grid tlv triple <Cgid, a Cgnum of CGGI, SIP> represents, wherein, Cgid represents the numbering of coarseness grid, with (<lb 1, ub 1>, <lb 2, ub 2> ..., <lb i, ubi> ..., <lb n, ub n>) represent, <lb i, ub i> represents that this grid is in the lower bound of division of the i-th dimension and the upper bound; Cgnum represents the number of this coarseness data in grid object; SIP represents this Slave node address corresponding to coarseness grid.
The process of establishing of described decile fine-grained distributed memory grid index structure is: based on the non-decile coarseness distributed memory grid index structure set up, to each coarseness grid <Cgid, Cgnum, the subdata space that SIP> is corresponding is further segmented, get the granularity that fixed step size λ carries out as every one dimension dividing, each like this coarseness grid will be divided into the fine granularity mesh space of a unique step, based on point fine granularity distributed memory grid index FGGI such as this fine granularity mesh space foundation, each fine granularity grid tlv triple <Fgid of FGGI, Fgnum, List> represents, wherein, Fgid represents the numbering of fine granularity grid, use <l 1, l 2..., l 3..., l n> represents, has uniqueness, and Fgnum represents the number of this fine granularity data in grid object, and List represents the data object that this fine granularity grid comprises.
The data object of described CGGI, FGGI all can insert and delete, wherein:
The insertion process of CGGI data object is: for data object p (d 1, d 2... ..., d n) insertion, by calculate d ithe division at place (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, increased by 1;
Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out insertion renewal: first, data object p, by the update of CGGI, is distributed to corresponding coarseness grid and Slave node by Master node; Secondly, for data object p (d 1, d 2... ..., d n) insertion, by calculate the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, increased by 1, meanwhile, data object p be inserted List;
The delete procedure of the data object of CGGI is: for data object p (d 1, d 2... ..., d n) deletion, by calculate d ithe division at place (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, reduced 1;
Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out deletion renewal: first, Master node, by the deletion action of CGGI, finds coarseness grid and the Slave node at data object p place; Secondly, for data object p (d 1, d 2... ..., d n) deletion, by calculate the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, reduced 1, data object p be deleted from List meanwhile.
Described KNN query script is k the arest neighbors object searching object q, and its concrete query script is:
Master node is service data object map algorithm MOG algorithm first, data object is mapped to coarseness grid index CGGI, determines the coarseness grid Cg of q at CGGI place q;
Secondly run the adjacent grid of adjacent grid-search algorithms SNNG algorithm search coarseness grid, namely search for Cg qadjacent grid, judge Cg qand whether the object number summation in its adjacent grid is greater than k, if be less than k, continue the adjacent grid of its adjacent grid of search, until the total quantity of object is greater than k or searches for complete coarseness mesh space, finally obtain Cg qadjacent coarseness grid set C q, determine C qthe Slave node at middle coarseness grid place;
Comprising C qthe Slave node of middle coarseness grid runs SDKNN algorithm, and each Slave node exports Query Result;
The result reduction exported by each Slave node, to a Slave node, obtains result set S, carries out ascending sort to S, get a front k object and export as net result.
Described MOG algorithm runs at Master node, and its specific implementation process is: input data object q (d to be checked 1, d 2... ..., d n), coarseness grid set C, determine the division of the coarseness grid at the every one dimension place of q, q be mapped in CGGI, determine the coarseness grid Cg at q place q.
Described SNNG algorithm runs at Master node, and the definition according to " adjacent grid " calculates q place coarseness grid Cg qadjacent grid, obtain coarseness and adjoin grid set C q, statistics C qobject number summation num in middle coarseness grid, if num>=k, then exports C q, otherwise, for C qin each coarseness grid perform SNNG algorithm, until C qobject number num>=k in middle coarseness grid or search for complete coarseness mesh space, exports C q.
Described SDKNN algorithm is distributed, easily extensible KNN algorithm, and this algorithm is distributed runs on the coarseness grid Cg storing object q place to be checked qand C qeach Slave node of middle coarseness grid, each Slave node returns k object of the arest neighbors of q;
Based on Slave node, the specific implementation process of this algorithm is:
Slave node 1 is made to store coarseness grid Cg j, and Cg j∈ C q, Cg jbe fine granularity grid index Fg at this Slave node j;
To Fg jperform Circle-Traversal algorithm, at least comprised the adjacent fine granularity grid set F of k the arest neighbors object of q j, this Circle-Traversal algorithm is the neighbour's fine granularity grid-search algorithms without losing, by input fine granularity grid index Fg j, object q place to be checked fine granularity grid Fg q, the step-length λ of fine granularity stress and strain model, searching loop number of times i; With Fg qcentered by by circle search belong to Fg jfine granularity grid, what obtain the periphery of q place fine granularity grid belongs to Fg jfine granularity grid set F j;
For F jin object p i.e. { p|p ∈ Fg, the Fg ∈ F of any fine granularity grid Fg j, calculate the distance dist (p, q) of p and q, sort according to distance, return the S set of the k nearest with a q object 1;
Repeat above-mentioned steps, obtain C qin the S set of other coarseness grid place Slave node and nearest k the object of q 2, S 3..., S n, according to the distance size with q to { S 1, S 2, S 3..., S nin object sort, finally return the S set of the k nearest with a q object.
A kind of KNN querying method based on combination grain distributed memory grid index of the present invention, have the following advantages: a kind of KNN querying method based on combination grain distributed memory grid index that the present invention proposes, first, overall data is utilized and analyzes based on the method for grid and density, show that the summary of Data distribution8 is estimated, to reduce the data skew of cluster as much as possible, secondly, basis is estimated as with the summary of Data distribution8, set up the distributed memory grid index structure of thick-thin combination grain, to eliminate the bottleneck of unit handling property, improve data search efficiency, support distributed algorithm, again, based on the fine granularity grid index set up, design is without the neighbour's fine granularity grid-search algorithms lost, ensure fast, neighbour's fine granularity grid of locating query object accurately, finally, based on this distributed memory index structure and neighbour's fine granularity grid-search algorithms, design easily extensible, distributed search algorithm KNN, to eliminate the bottleneck of the unit handling property of centralized search algorithm KNN and to write back based on search algorithm's KNN intermediate result of MapReduce framework the low problem of real-time that disk causes, practical, be easy to promote.
Accompanying drawing explanation
Accompanying drawing 1 is overall realization flow figure of the present invention.
Accompanying drawing 2 is combination grain distributed memory grid index structural drawing of the present invention.
Accompanying drawing 3 is overall data spatial distribution maps of the present invention.
Accompanying drawing 4 is overall data spatial division of the present invention and distribution plan.
Accompanying drawing 5 is of the present invention to the result schematic diagram after X dimension division merging.
Accompanying drawing 6 is of the present invention to the result schematic diagram after Y dimension division merging.
Accompanying drawing 7 is non-decile coarseness grid index CGGI schematic diagram of the present invention.
Accompanying drawing 8 is decile fine granularity grid index FGGI schematic diagram of the present invention.
Accompanying drawing 9 is process schematic of search fine granularity neighbour grid of the present invention.
Embodiment
Below in conjunction with the drawings and specific embodiments, the invention will be further described.
The invention provides a kind of KNN querying method based on combination grain distributed memory grid index, for the problem that traditional search algorithm's KNN search efficiency existing under large data environment is low, utilize and based on the method for grid and density, overall data is analyzed, to reduce data skew as far as possible, the distributed memory grid index structure of thick-thin combination grain of design, to improve data search efficiency, support distributed algorithm, on this basis, design the easily extensible based on distributed memory grid index, distributed search algorithm KNN, to realize the fast query to mass data.
The explanation of nouns related in the method is as follows: internal memory index is a kind of data structure of the some of data in EMS memory or multiple property value being carried out to tissue sequence; Distributed memory index refers to the internal memory index that easily can divide and also distributedly can be deployed to each processing node in main memory cluster; Grid refers to every one dimension A of d dimension data space A i(i=1,2 ..., d) be divided into p iindividual interval, each grid g by (c i=1,2 ..., p i) composition, be expressed as: g=(c 1, c 2..., c d); Adjacent grid refers to for two grids with if one dimension i, satisfied arbitrarily or or or neighborQuery), refer to the result set searched k the object nearest with appointed object q and form, making whole object set be O, KNN query results is O ', for all there is the distance that dist (p', q)≤dist (p, q), dist (p, q) refers between object p and object q; Master node refers to the host node of cluster, is in charge of the execution of distributed data Sum decomposition task; Slave node refer to cluster from node, be responsible for Distributed Storage and tasks carrying; Easily extensible, distributed search algorithm KNN refer to based on distributed memory index, collaborative search algorithm KNN carrying out query processing on each processing node that distributedly can run on main memory cluster.
As shown in accompanying drawing 1, Fig. 2, the present invention analyzes overall data based on the method for grid and density by utilizing, and the summary forming conceptual data distribution is estimated; Based on this, set up the distributed memory grid index structure of combination grain, cluster Master node is set up and is safeguarded the non-decile coarseness grid index of the total space, and be responsible for each Slave node distributing data of cluster, each Slave node of cluster preserves one or several coarseness grids, and decile refinement is carried out to each coarseness grid, set up corresponding decile fine granularity grid index, the subspace non-overlapping copies that namely the fine granularity grid index non-overlapping copies of each Slave node maintenance is safeguarded; Finally, based on the distributed memory grid index structure of thick-thin combination grain with without the neighbour's fine granularity grid-search algorithms lost, design easily extensible, distributed search algorithm KNN, realize the quick KNN inquiry to mass data.Its specific implementation process is:
One, carry out the step of data prediction: based on grid and density, overall data is carried out spatial division, the summary obtaining conceptual data distribution is estimated;
Two, carry out the step of data query: the distributed memory grid index structure setting up combination grain, namely set up non-decile coarseness grid index and decile fine granularity grid index; The basis of above-mentioned index structure is designed distributed search algorithm KNN, realize the quick KNN inquiry to mass data: namely based on non-decile coarseness grid index, by searching for the adjacent coarseness grid set of object place to be checked non-decile coarseness grid, determine the Slave node at each coarseness grid place in the set of adjacent coarseness grid; On these Slave nodes, based on decile fine granularity grid index, by searching for the adjacent fine granularity grid set of point fine granularity grids such as object place to be checked, each object comprise fine granularity grid each in the set of adjacent fine granularity grid and the distance of object to be checked contrast, and then obtain k arest neighbors object of inquiry.
The detailed process of described overall data spatial division is: each dimension in all data space is carried out decile partition by fine granularities according to unique step δ, forms fine granularity mesh space; Data object p is mapped to corresponding grid, such as, data object p (d 1, d 2... ..., d n), n is the dimension in all data space, is mapped to grid each grid proper vector g (gid, num) represents, the numbering recording each grid and the data object number comprised, wherein gid represents the numbering of grid, has uniqueness, and num represents the number of the data object that this grid comprises.
The process of establishing of the distributed memory grid index structure of described combination grain is: according to the summary estimation of the Data distribution8 that step one draws, all data space is carried out to the stress and strain model of non-decile coarseness, set up the non-decile coarseness grid index in all data space, the distributed memory grid index structure C GGI of the coarseness in the Master node maintenance all data space of main memory cluster, i.e. coarsegrainedgridindex, this Master node is responsible for each Slave node distributing data of cluster; The fine-grained division of decile is carried out to the subdata space represented by each coarseness grid of above-mentioned division, set up the fine-grained grid index in each subdata space, the fine-grained distributed memory grid index structure FGGI in each one or several subdata spaces of Slave node maintenance of main memory cluster, i.e. finegrainedgridindex, the subspace non-overlapping copies that namely the fine granularity grid index non-overlapping copies of each Slave node maintenance is safeguarded.
The concrete process of establishing of described non-decile coarseness distributed memory grid index structure is: according to the process of step one, adds up the number of the data object that each division of each dimension comprises; Each division of every one dimension is made at least to comprise θ data object, when the data object number during certain divides is less than θ, then itself and adjacent division are merged, until its data object comprised data are greater than θ or data space has not had remaining division; Through above-mentioned calculating and merging, all data space is divided into a non-decile coarseness mesh space, and the data object number that each coarseness grid comprises is substantially even; Set up the grid index CGGI of all data space coarseness, each coarseness grid tlv triple <Cgid, a Cgnum of CGGI, SIP> represents, wherein, Cgid represents the numbering of coarseness grid, with (<lb 1, ub 1>, <lb 2, ub 2> ..., <lb i, ubi> ..., <lb n, ub n>) represent, <lb i, ub i> represents that this grid is in the lower bound of division of the i-th dimension and the upper bound; Cgnum represents the number of this coarseness data in grid object; SIP represents this Slave node address corresponding to coarseness grid.
The process of establishing of described decile fine-grained distributed memory grid index structure is: based on the non-decile coarseness distributed memory grid index structure set up, to each coarseness grid <Cgid, Cgnum, the subdata space that SIP> is corresponding is further segmented, get the granularity that fixed step size λ carries out as every one dimension dividing, each like this coarseness grid will be divided into the fine granularity mesh space of a unique step, based on point fine granularity distributed memory grid index FGGI such as this fine granularity mesh space foundation, each fine granularity grid tlv triple <Fgid of FGGI, Fgnum, List> represents, wherein, Fgid represents the numbering of fine granularity grid, use <l 1, l 2, l 3..., l n> represents, has uniqueness, and Fgnum represents the number of this fine granularity data in grid object, and List represents the data object that this fine granularity grid comprises.
The data object of described CGGI, FGGI all can insert and delete, and wherein the insertion process of CGGI data object is: for data object p (d 1, d 2... ..., d n) insertion, by calculate d ithe division at place (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, increased by 1; Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out insertion renewal: first, data object p, by the update of CGGI, is distributed to corresponding coarseness grid and Slave node by Master node; Secondly, for data object p (d 1, d 2... ..., d n) insertion, by calculate the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, increased by 1, meanwhile, data object p be inserted List;
The delete procedure of the data object of CGGI is: for data object p (d 1, d 2... ..., d n) deletion, by calculate d iplace division (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, reduced 1; Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out deletion renewal: first, Master node, by the deletion action of CGGI, finds coarseness grid and the Slave node at data object p place; Secondly, for data object p (d 1, d 2... ..., d n) deletion, by calculate the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, reduced 1, data object p be deleted from List meanwhile.
Described KNN query script is k the arest neighbors object searching object q, its concrete query script is: Master node is service data object map algorithm MOG algorithm (mapobjecttogrid) first, data object q is mapped to coarseness grid index CGGI, determines the coarseness grid Cg of q at CGGI place q; Secondly run the adjacent grid that adjacent grid-search algorithms SNNG algorithm (searchnearestneighborgrid) searches for coarseness grid, namely search for Cg qadjacent grid, judge Cg qand whether the object number summation in its adjacent grid is greater than k, if be less than k, continue the adjacent grid of its adjacent grid of search, until the total quantity of object is greater than k or searches for complete coarseness mesh space, finally obtain Cg qadjacent coarseness grid set C q, determine C qthe Slave node at middle coarseness grid place; Comprising C qthe Slave node of middle coarseness grid runs SDKNN (scalabledistributedKNNalgorithm), and each Slave node exports Query Result; The result reduction exported by each Slave node, to a Slave node, obtains result set S, carries out ascending sort to S, get a front k object and export as net result.
Described MOG algorithm runs at Master node, and its specific implementation process is: input data object q (d to be checked 1, d 2... ..., d n), coarseness grid set C, determine the division in the coarseness grid at the every one dimension place of q, q be mapped in CGGI, determine the coarseness grid Cg at q place q; Equaled for 2 (namely for 2-D data space) for n now, the detailed process of this algorithm is:
Described SNNG algorithm runs at Master node, and the definition according to " adjacent grid " calculates q place coarseness grid Cg qadjacent grid, obtain coarseness and adjoin grid set C q, statistics C qobject number summation num in middle coarseness grid, if num>=k, then exports C q, otherwise, for C qin each coarseness grid perform SNNG algorithm, until C qobject number num>=k in middle coarseness grid or search for complete coarseness mesh space, exports C q; Its specific implementation process is:
Described SDKNN algorithm is distributed, easily extensible KNN algorithm, and this algorithm is distributed runs on the coarseness grid Cg storing object q place to be checked qand Cg qadjacent grid set C qeach Slave node of middle coarseness grid, each Slave node returns k object of the arest neighbors of q; For Slave node 1, detailed process is:
Slave node 1 is made to store coarseness grid Cg j, and Cg j∈ C q, Cg jbe fine granularity grid index Fg at Slave node 1 j;
To Fg jperform Circle-Traversal algorithm, at least comprised the fine granularity grid set F of k the arest neighbors object of q j, this Circle-Traversal algorithm is the adjacent fine granularity grid-search algorithms without losing, by input fine granularity grid index Fg j, object q place to be checked fine granularity grid Fg q, the step-length λ of fine granularity stress and strain model, searching loop number of times i; With Fg qcentered by by circle search belong to Fg jfine granularity grid, at least comprised the fine granularity grid set F of k the arest neighbors object of q j;
For F jin any data object p i.e. { p|p ∈ Fg, the Fg ∈ F that comprises of fine granularity grid Fg j, calculate the distance dist (p, q) of p and q, sort according to distance, return the S set of the k nearest with a q object 1.
In like manner, C can be obtained qin the S set of other coarseness grid place Slave nodes and nearest k the object of q 2, S 3..., S n, according to the distance size with q to { S 1, S 2, S 3..., S nin object sort, finally return the S set of the k nearest with a q object.
The detailed process of above-mentioned Circle-Traversal algorithm is:
The algorithm of described SDKNN is:
The present invention is in conjunction with the characteristic of main memory cluster, by proposing and set up slightly a kind of-thin distributed memory grid index structure of combination grain, and design is based on distributed search algorithm KNN of this index structure, KNN algorithm centralized for tradition is expanded to distributed memory cluster environment, the problem of centralized search algorithm KNN and the search algorithm's KNN inefficiency based on MapReduce framework under improving large data environment.
Specific embodiment:
1, without loss of generality, using the cluster of 6 station servers compositions for experiment porch (wherein 1 as Master node, 5 as Slave node), the detailed description of technical solution of the present invention is carried out for two-dimensional space data KNN inquiry.Overall data is as shown in the table, space distribution as shown in Figure 3,
(12,68) (31,73) (58,63) (57,23) (4,26) (28,33) (11,16) (56,8) (21,66)
(16,72) (13,56) (62,78) (52,29) (7,34) (32,19) (13,26) (53,16) (23,61)
(18,61) (26,57) (65,66) (59,49) (6,23) (28,13) (26,23) (56,43) (27,71)
(11,63) (7,55) (67,72) (64,24) (9,11) (38,26) (67,43) (66,16) (53,72)
(8,73) (21,53) (46,33) (62,36) (8,2) (37,13) (2,12) (57,71) (56,76)
2, utilize and carry out spatial division based on the data in the method his-and-hers watches 1 of grid and density.Get fixed step size δ=5, by the data-mapping in table 1 to corresponding grid, determine the numbering gid of grid belonging to each data object, result is as shown in the data-mapping result of following table.Such as: for data object p (d 1, d 2), by calculating the numbering of grid belonging to p
According to data-mapping result, the spatial division of overall data can be obtained based on grid and density, each grid proper vector g (gid, num), the proper vector of all non-NULL grids is as shown in the proper vector of following table non-NULL grid, and the spatial division of overall data and distribution are as shown in Figure 4.
g(<2,13>,1) g(<6,14>,1) g(<11,12>,1) g(<11,4>,1) g(<0,5>,1)
g(<3,14>,1) g(<2,11>,1) g(<12,15>,1) g(<10,5>,1) g(<1,6>,1)
g(<3,12>,1) g(<5,11>,1) g(<13,13>,1) g(<11,9>,1) g(<1,4>,1)
g(<2,12>,1) g(<1,11>,1) g(<13,14>,1) g(<12,4>,1) g(<1,2>,1)
g(<1,14>,1) g(<4,10>,1) g(<9,6>,1) g(<12,7>,1) g(<1,0>,1)
g(<2,3>,1) g(<2,5>,1) g(<5,4>,1) g(<13,8>,1) g(<0,2>,1)
g(<10,3>,1) g(<11,8>,1) g(<13,3>,1) g(<11,14>,1) g(<4,13>,1
g(<5,14>,1) g(<10,14>,1) g(<11,15>,1) g(<5,6>,1) g(<6,3>,1)
g(<5,2>,1) g(<7,5>,1) g(<7,2>,1) g(<11,1>,1) g(<4,12>,1)
The spatial division of the grid search-engine vector sum overall data 3, obtained based on above-mentioned steps 2, sets up non-decile coarseness distributed memory grid index.Concrete steps are as follows,
1) add up the division of the X dimension space shown in Fig. 3, concrete outcome is as shown in the table.
(a)
0th divides 1st divides 2nd divides 3rd divides
g(<0,2>,1) g(<1,14>,1) g(<2,13>,1) g(<3,14>,1)
g(<0,5>,1) g(<1,11>,1) g(<2,12>,1) g(<3,12>,1)
g(<1,6>,1) g(<2,3>,1)
g(<1,4>,1) g(<2,11>,1)
g(<1,2>,1) g(<2,5>,1)
g(<1,0>,1)
(b)
4th divides 5th divides 6th divides 7th divides
g(<4,10>,1) g(<5,14>,1) g(<6,3>,1) g(<7,5>,1)
g(<4,13>,1) g(<5,2>,1) g(<6,14>,1) g(<7,2>,1)
g(<4,12>,1) g(<5,4>,1)
g(<5,6>,1)
g(<5,11>,1)
(c)
(d)
12nd divides 13rd divides 14th divides 15th divides
g(<12,4>,1) g(<13,14>,1)
g(<12,7>,1) g(<13,8>,1)
g(<12,15>,1) g(<13,13>,1)
g(<13,3>,1)
The division of the Y dimension space shown in Fig. 3 is added up, shown in the division statistics that concrete outcome is tieed up based on Y as following table.
(a)
0th divides 1st divides 2nd divides 3rd divides
g(<1,0>,1) g(<11,1>,1) g(<5,2>,1) g(<2,3>,1)
g(<1,2>,1) g(<12,3>,1)
g(<0,2>,1) g(<13,3>,1)
g(<7,2>,1) g(<6,3>,1)
(b)
4th divides 5th divides 6th divides 7th divides
g(<5,4>,1) g(<2,5>,1) g(<9,6>,1) g(<12,7>,1)
g(<11,4>,1) g(<7,5>,1) g(<5,6>,1)
g(<12,4>,1) g(<10,5>,1) g(<1,6>,1)
g(<1,4>,1) g(<0,5>,1)
(c)
8th divides 9th divides 10th divides 11st divides
g(<11,8>,1) g(<11,9>,1) g(<4,10>,1) g(<2,11>,1)
g(<13,8>,1) g(<5,11>,1)
g(<1,11>,1)
(d)
12nd divides 13rd divides 14th divides 15th divides
g(<3,12>,1) g(<2,13>,1) g(<3,14>,1) g(<12,15>,1)
g(<2,12>,1) g(<13,13>,1) g(<1,14>,1) g(<11,15>,1)
g(<11,12>,1) g(<4,13>,1) g(<5,14>,1)
g(<4,12>,1) g(<6,14>,1)
g(<10,14>,1)
g(<13,14>,1)
g(<11,14>,1)
2) get parameter θ=10, respectively the division of X dimension, Y dimension is merged, to carry out even partition to overall data as far as possible.
First X dimension is scanned from low to high according to the numbering divided, 0th division has 2 data objects, is less than θ, therefore needs to divide with the 1st to merge, 8 data objects are had after merging, be less than θ, therefore need to divide with the 2nd to merge, after merging, have 13 data objects, be greater than θ, stop merging, so far, divide X dimension the 0th, 1,2 three and merge.The like, the division that scanning is remaining, can tie up the 3rd, 4,5 three division at X and merge, the 6th, 7,8,9,10,11 6 division merges, and the 12nd, 13,14,15 4 divides merging, and result is as shown in Figure 5.
Secondly, by that analogy, the division of Y dimension is merged, 0th, 1,2,3 four division merges, 4th, 5,6 three divisions merge, and the 7th, 8,9,10,11,12 6 division merges, and the 13rd, 14 two division merges, 15th divides as a division, and result as shown in Figure 6.
3) based on 2) result merged is divided to X dimension, Y dimension, set up coarseness grid index CGGI as shown in Figure 7, the proper vector tlv triple <Cgid of each coarseness grid, Cgnum, SIP> represents, Cgid represents the numbering of coarseness grid, with (<lb 1, ub 1>, <lb 2, ub 2>) represent, wherein <lb i, ub i> represents that this grid is in the lower bound of division of the i-th dimension and the upper bound, and Cgnum represents the number of this coarseness data in grid object, and SIP represents this Slave node address corresponding to coarseness grid.We store coarseness grid with 4 Slave nodes, the numbering of 4 Slave nodes is respectively 001,002,003,004, scan all coarseness grids of CGGI successively, then 4 Slave nodes are distributed to, specific rules is: first, and the division along X dimension scans from small to large, and the division then tieed up along Y in each division of X dimension scans from small to large; Then, by coarseness Grid delivery to comprise data object few Slave node.Such as, initial 4 Slave nodes comprise data object and are 0, be (<0 from Cgid, 15>, <0, 20>) start scanning, be distributed to Slave node 001, (<0, 15>, <20, 35>) be distributed to Slave node 002, (<0, 15>, <35, 65>) be distributed to Slave node 003, (<0, 15>, <65, 75>) be distributed to Slave node 004, now 001, 002, 003, the data object number of 004 is respectively 4, 4, 3, 2, so (<0, 15>, <75, 80>) be distributed to 004 node, by that analogy by remaining coarseness Grid delivery to corresponding Slave node.Finally, shown in the proper vector of the memory node that CGGI all coarsenesses grid is corresponding and proper vector coarseness grids as all in following table CGGI.
Slave node 001 Slave node 002 Slave node 003 Slave node 004
((<0,15>,<0,20>),4,001) ((<0,15>,<20,35>),4,002) ((<0,15>,<35,65>),3,003) ((<0,15>,<65,75>),2,004)
((<15,30>,<65,75>),3,001) ((<15,30>,<75,80>),0,002) ((<15,30>,<20,35>),2,003) ((<0,15>,<75,80>),0,004)
((<30,60>,<35,65>),3,001) ((<30,60>,<0,20>),4,002) ((<30,60>,<20,35>),4,003) ((<15,30>,<0,20>),1,004)
((<60,80>,<35,65>),2,001) ((<30,60>,<75,80>),1,002) ((<60,80>,<20,35>),1,003) ((<15,30>,<35,65>),4,004)
((<60,80>,<0,20>),1,002) ((<60,80>,<75,80>),1,003) ((<30,60>,<65,75>),3,004)
((<60,80>,<65,75>),2,002)
4, to point fine granularity distributed memory grid index structure FGGI such as each coarseness grid foundation in the non-decile coarseness distributed memory grid index structure of above-mentioned steps 3 foundation.Get fixed step size λ=5, carry out partition by fine granularities to each coarseness grid, each fine granularity grid search-engine vector tlv triple <Fgid of FGGI, Fgnum, List> represent.Wherein, Fgid represents the numbering of fine granularity grid, with using <l 1, l 2> represents, has uniqueness, and Fgnum represents the number of this fine granularity data in grid object, and List represents the data object that this fine granularity grid comprises.Such as coarseness grid Cg ((<0, 15>, <0, 20>), 4, 001) fine granularity grid index as shown in Figure 8, the proper vector of non-NULL fine granularity grid is respectively (<0, 2>, 1, (2, 12)), (<1, 0>, 1, (8, 2)), (<1, 2>, 1, (9, 11)), (<2, 3>, 1, (11, 16)).The fine granularity grid index of all coarseness grids can be calculated by that analogy.
5, KNN inquiry: 2 the arest neighbors objects searching data object q (56,43).
1) based on distributed memory grid index structure C GGI and FGGI of 3 and 4 thick-thin combination grains set up, run MOG algorithm, determine the coarseness grid Cg of q at CGGI place q.As follows according to algorithm concrete operations: 1. initialization Cg qfor sky.2. determine that q ties up the division at place at X, the division tieed up the X of CGGI is according to ascending sort, and result is (<0,15>, <15,30>, <30,60>, <60,80>), q is 56 in the value that X ties up, that can determine that q ties up at X by comparison is divided into <30,60>.3. determine that q ties up the division at place at Y, the division tieed up the Y of CGGI is according to ascending sort, result is (<0,20>, <20,35>, <35,65>, <65,75>, <75,80>), q is 43 in the value that X ties up, that can determine that q ties up at Y by comparison is divided into <35,65>.4. pass through 2., 3. can determine the Cgid of the coarseness grid at q place for (<30,60>, <35,65>), export net result Cg q((<30,60>, <35,65>), 3,001).
2) based on distributed memory grid index structure C GGI and FGGI of 3 and 4 thick-thin combination grains set up, run SNNG algorithm and search Cg qcoarseness adjoin grid set C q.As follows according to algorithm concrete operations: all coarseness grids 1. for CGGI represent with set C, whether for coarseness grid each in C, calculating according to the definition of " adjacent grid " is successively Cg qadjacent grid, after comparison, it is deleted from C, until when C is empty, end operation.Such as: for the coarse mesh Cg in C 1((<0,15>, <0,20>), 4,001), according to the definition of " adjacent grid " calculate whether with Cg qadjacent, first, whether both calculating adjoins in X dimension, Cg 1x dimension is divided into <0,15>, Cg qx dimension is divided into <30,60> because 0 ≠ 60,15 ≠ 30,0 ≠ 30,15 ≠ 60, thus in X dimension Cg 1and Cg qdo not adjoin, Cg 1and Cg qit not adjacent grid; For the coarse mesh Cg in C 2((<15,30>, <20,35>), 2,003), according to the definition of " adjacent grid " calculate whether with Cg qadjacent, first, whether both calculating adjoins in X dimension, Cg 2x dimension is divided into <15,30>, Cg qx dimension is divided into <30,60>, 30==30 (i.e. Cg 2at the upper bound and the Cg of X dimension qx dimension lower bound equal), so in X dimension Cg 2and Cg qadjacent, secondly, whether both calculating adjoins in Y dimension, Cg 2y dimension is divided into <20,35>, Cg qx dimension is divided into <35,65>, 35==35 (i.e. Cg 2at the upper bound and the Cg of Y dimension qy dimension lower bound equal), so in Y dimension Cg 2and Cg qadjacent, because Cg 2with Cg qall adjacent in X peacekeeping Y dimension, so Cg 2cg qadjacent grid.By said method, Cg can be calculated qcoarseness adjoin grid set C q, result is as following table Cg qcoarseness adjoin grid set C qshown in.2. C is added up qthe number of middle data object, cumulative C qin the data object number of each coarseness grid, i.e. C q.num=C q.num+Cg.num, (initial C q.num=0), because q itself will be removed, so can C be obtained qmiddle data object adds up to 23.3. C is judged qthe size of the arest neighbors object number k of middle data object sum and inquiry, because 23>2, so C qexport as net result.
((<15,30>,<20,35>),2,003) ((<30,60>,<20,35>),4,003) ((<60,80>,<20,35>),1,003)
((<15,30>,<35,65>),4,004) ((<30,60>,<35,65>),3,001) ((<60,80>,<35,65>),2,001)
((<15,30>,<65,75>),3,001) ((<30,60>,<65,75>),3,004) ((<60,80>,<65,75>),2,002)
3) at C qin each coarseness grid place Slave node on run SDKNN algorithm, k (k=2) the neighbour object of distributed calculating q.From step 3) front side table, C qin 9 coarsenesses grid-distributed be stored in Slave node 001,002,003,004, respectively SDKNN algorithm is performed to fine granularity grid index corresponding to these 9 coarseness grids at these 4 nodes.As follows according to algorithm concrete operations: with Cg ((<60,80>, <35,65>), 2,001) for example, 1. calculate the fine granularity grid Fg at q place q, fg q=(<11,8>, 1, (56,43)).2. Circle-Traversal algorithm is run, search Fg qarest neighbors grid set F in the fine granularity grid index Fg that Cg is corresponding 1, all fine granularity grids that the fine granularity grid index Fg that Cg is corresponding comprises are as shown in the table, and calculating with (56,43) for the center of circle, is the Fg of radius with i × 5 qperiphery neighbour grid, return the S set of object nearest with q in this coarse mesh 1, being specially (process as shown in Figure 9): as i=0, is the fine granularity lattice Fg at (56,43) place q, because so continue to extending out a circle, as i=1, first lap travels through, fg startnot in Fg, along upwards searching, j=1, not in Fg, along upwards searching, j=2, not in Fg, now j=3>2*i, and F ' g unequal to Fg start, change direction for search to the right, j=1, not in Fg, along searching to the right, j=2, in Fg, F 1=Fg ' }={ (<12,9>, 0, null) }, now j=3>2*i, and Fg ' unequal to Fg start, change direction for search downwards, j=1, in Fg, F 1=F 1∪ Fg '={ (<12,9>, 0, null), (<12,8>, 0, null) }, along searching downwards, j=2, in Fg, F 1=F 1∪ Fg '={ (<12,9>, 0, null), (<12,8>, 0, null), (<12,7>, 1, (62,36)), now j=3>2*i, and Fg ' unequal to Fg start, change direction for search left, j=1, not in Fg, along searching left, j=2, now, Fg '=Fg start, first lap traversal terminates, F 1={ (<12,9>, 0, null), (<12,8>, 0, null), (<12,7>, 1, (62,36)), F 1.num=1<k=2, so need to continue to extending out a circle, i=2, carry out the second circle traversal, traversal form is similar with first lap, after the second circle traversal terminates, and F 1={ (<12, 9>, 0, null), (<12, 8>, 0, null), (<12, 7>, 1, (62, 36)), (<12, 10>, 0, null), (<13, 10>, 0, null), (<13, 9>, 0, null), (<13, 8>, 1, (67, 43)), (<13, 7>, 0, null) }, F 1.num=2, in order to F jensure at least to comprise the k nearest with a q object, need to expand a circle again, i=3 carries out the 3rd circle traversal, after traversal terminates, and F 1={ (<12, 9>, 0, null), (<12, 8>, 0, null), (<12, 7>, 1, (62, 36)), (<12, 10>, 0, null), (<13, 10>, 0, null), (<13, 9>, 0, null), (<13, 8>, 1, (67, 43)), (<13, 7>, 0, null), (<12, 11>, 0, null), (<13, 11>, 0, null), (<14, 11>, 0, null), (<14, 10>, 0, null), (<14, 9>, 0, null), (<14, 8>, 0, null), (<14, 7>, 0, null) }, calculate q and F 1the distance of object in middle fine granularity grid wherein p 1=(<12,7>, 1, (62,36)), wherein p 2=(<13,8>, 1, (67,43)), are got apart from minimum 2 data objects by sequence, obtain S 1=(<9.2, (62,36) >, <11, (67,43) >).3. in like manner by running SDKNN algorithm, C can be calculated qin minimum k the object of the distance of other coarseness grids and q, specific as follows:
For coarse mesh ((<15, 30>, <65, 75>), 3, 001), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<5, 14>, 1, (27, 71)), (<4, 13>, 1, (21, 66)), (<3, 14>, 1, (16, 72)), by calculating distance, get the k nearest with a q object, S 2=(<40.3, (27,71) >, <41.9, (21,66) >).
For coarse mesh ((<15, 30>, <35, 65>), 2, 004), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<5, 11>, 1, (26, 57)), (<4, 10>, 1, (21, 53)), (<4, 12>, 1, (23, 61)), (<3, 12>, 1, (18, 61)), by calculating distance, get the k nearest with a q object, S 3=(<33.1, ((26,57) >, <36.4, (21,53) >).
For coarse mesh ((<15,30>, <20,35>), 2,003), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<5,4>, 1, (26,23)), (<5,6>, 1, (28,33)), by calculating distance, get the k nearest with a q object, S 4=(<36.1, (26,23) >, <29.7, (28,33) >).
For coarse mesh ((<30, 60>, <65, 75>), 3, 004), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<6, 14>, 1, (31, 73)), (<10, 14>, 1, (53, 72)), (<11, 14>, 1, (57, 71)), by calculating distance, get the k nearest with a q object, S 5=(<29.2, (53,72) >, <28, (57,71) >).
For coarse mesh ((<30,60>, <35,65>), 3,001), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<11,12>, 1, (58,63)), (<11,9>, 1, (59,49)), by calculating distance, get the k nearest with a q object, S 6=(<20.1, (58,63) >, <6.7, (59,49) >).
For coarse mesh ((<30, 60>, <20, 35>), 4, 003), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<9, 6>, 1, (46, 33)), (<10, 5>, 1, (52, 29)), (<7, 5>, 1, (38, 26)), (<11, 4>, 1, (57, 23)), by calculating distance, get the k nearest with a q object, S 7=(<14.1, (46,33) >, <14.6, (52,29) >).
For coarse mesh ((<60,80>, <65,75>), 2,002), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<13,13>, 1, (65,66)), (<13,14>, 1, (67,72)), by calculating distance, get the k nearest with a q object, S 8=(<24.7, (65,66) >, <31.02, (67,72) >).
For coarse mesh ((<60,80>, <20,35>), 1,003), by running SDKNN algorithm, obtaining at least k nearest with q non-NULL fine granularity grid is (<12,4>, 1, (64,24)), by calculating distance, get the k nearest with a q object (when number of objects is less than k, all getting), S 9=(<20.6, (64,24) >).
(<12,7>,1,(62,36)) (<13,7>,0,null) (<14,7>,0,null) (<15,7>,0,null)
(<12,8>,0,null) (<13,8>,1,(67,43)) (<14,8>,0,null) (<15,8>,0,null)
(<12,9>,0,null) (<13,9>,0,null) (<14,9>,0,null) (<15,9>,0,null)
(<12,10>,0,null) (<13,10>,0,null) (<14,10>,0,null) (<15,10>,0,null)
(<12,11>,0,null) (<13,11>,0,null) (<14,11>,0,null) (<15,11>,0,null)
(<12,12>,0,null) (<13,12>,0,null) (<14,12>,0,null) (<15,12>,0,null)
4) Slave node 001,002,003,004 is run the result S that SDKNN algorithm obtains 1, S 2, S 3, S 4, S 5, S 6, S 7, S 8, S 9reduction to Slave node 005 carries out ascending sort, and get rear front 2 results of sequence, obtaining net result is S=(<6.7, (59,49) >, <9.2, (62,36) >).
S is exported as final query results.
The present invention utilizes and analyzes overall data based on the method for grid and density, showing that the summary of Data distribution8 is estimated, laying the foundation for setting up coarseness grid, for the data skew reducing cluster provides foundation; Basis is estimated as with the summary of Data distribution8, set up the combination grain distributed memory grid index structure based on non-decile coarseness and the thin combination grain of decile, the bottleneck of unit handling property can be eliminated, improve data search efficiency, supporting distributed algorithm, is the core technology designing efficient, distributed KNN algorithm; Based on set up decile fine granularity grid index, design, can neighbour's fine granularity grid of locating query object fast and accurately without the neighbour's fine granularity grid-search algorithms lost; Based on the distributed memory grid index structure of thick-thin combination grain with without the neighbour's fine granularity grid-search algorithms lost, devise easily extensible, distributed search algorithm KNN, eliminate the bottleneck of centralized search algorithm's KNN unit handling property and write back based on search algorithm's KNN intermediate result of MapReduce framework the low problem of real-time that disk causes, achieving the fast query to mass data.
Above-mentioned embodiment is only concrete case of the present invention; scope of patent protection of the present invention includes but not limited to above-mentioned embodiment; claims of any a kind of KNN querying method based on combination grain distributed memory grid index according to the invention and the those of ordinary skill of any described technical field to its suitable change done or replacement, all should fall into scope of patent protection of the present invention.

Claims (10)

1. based on a KNN querying method for combination grain distributed memory grid index, it is characterized in that, specific implementation process is,
One, carry out the step of data prediction: based on grid and density, overall data is carried out spatial division, the summary obtaining conceptual data distribution is estimated;
Two, the step of data query is carried out:
Set up the distributed memory grid index structure of combination grain, namely set up non-decile coarseness grid index and decile fine granularity grid index;
Based on distributed search algorithm KNN of above-mentioned Index Structure Design, realize the quick KNN inquiry to mass data: namely based on non-decile coarseness grid index, by searching for the adjacent coarseness grid set of object place to be checked non-decile coarseness grid, determine the Slave node at each coarseness grid place in the set of adjacent coarseness grid; On these Slave nodes, based on decile fine granularity grid index, by searching for the adjacent fine granularity grid set of point fine granularity grids such as object place to be checked, each object comprise fine granularity grid each in the set of adjacent fine granularity grid and the distance of object to be checked contrast, and then obtain several arest neighbors objects of inquiry.
2. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 1, it is characterized in that, the detailed process of described overall data spatial division is:
Each dimension in all data space is carried out decile partition by fine granularities according to unique step δ, forms fine granularity mesh space;
Data object p is mapped to corresponding grid;
Each grid proper vector g (gid, num) represents, the numbering recording each grid and the data object number comprised, wherein gid represents the numbering of grid, has uniqueness, and num represents the number of the data object that this grid comprises.
3. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 1, is characterized in that, the process of establishing of the distributed memory grid index structure of described combination grain is:
According to the summary estimation of the Data distribution8 that step one draws, all data space is carried out to the stress and strain model of non-decile coarseness, set up the non-decile coarseness grid index in all data space, the distributed memory grid index structure C GGI of the coarseness in the Master node maintenance all data space of main memory cluster, this Master node is responsible for each Slave node distributing data of cluster;
The fine-grained division of decile is carried out to the subdata space represented by each coarseness grid of above-mentioned division, set up the fine-grained grid index in each subdata space, the fine-grained distributed memory grid index structure FGGI in each one or several subdata spaces of Slave node maintenance of main memory cluster, the subspace non-overlapping copies that namely the fine granularity grid index non-overlapping copies of each Slave node maintenance is safeguarded.
4. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 3, is characterized in that, the concrete process of establishing of described non-decile coarseness distributed memory grid index structure is:
According to the process of step one, add up the number of the data object that each division of each dimension comprises;
Each division of every one dimension is made at least to comprise θ data object, when the data object number during certain divides is less than θ, then itself and adjacent division are merged, until its data object comprised data are greater than θ or data space has not had remaining division in this dimension;
Through above-mentioned calculating and merging, all data space is divided into a non-decile coarseness mesh space, and the data object number that each coarseness grid comprises is substantially even;
Set up the grid index CGGI of all data space coarseness, each coarseness grid tlv triple <Cgid, a Cgnum of CGGI, SIP> represents, wherein, Cgid represents the numbering of coarseness grid, with (<lb 1, ub 1>, <lb 2, ub 2> ..., <lb i, ubi> ..., <lb n, ub n>) represent, <lb i, ub i> represents that this grid is in the lower bound of division of the i-th dimension and the upper bound; Cgnum represents the number of this coarseness data in grid object; SIP represents this Slave node address corresponding to coarseness grid.
5. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 3 or 4, it is characterized in that, the process of establishing of described decile fine-grained distributed memory grid index structure is: based on the non-decile coarseness distributed memory grid index structure set up, to each coarseness grid <Cgid, Cgnum, the subdata space that SIP> is corresponding is further segmented, get the granularity that fixed step size λ carries out as every one dimension dividing, each like this coarseness grid will be divided into the fine granularity mesh space of a unique step, based on point fine granularity distributed memory grid index FGGI such as this fine granularity mesh space foundation, each fine granularity grid tlv triple <Fgid of FGGI, Fgnum, List> represents, wherein, Fgid represents the numbering of fine granularity grid, use <l 1, l 2, l 3..., l n> represents, has uniqueness, and Fgnum represents the number of this fine granularity data in grid object, and List represents the data object that this fine granularity grid comprises.
6. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 5, is characterized in that, the data object of described CGGI, FGGI all can insert and delete, wherein,
The insertion process of CGGI data object is: for data object p (d 1, d 2... ..., d n) insertion, by calculate d ithe division at place (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, increased by 1;
Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out insertion renewal: first, data object p, by the update of CGGI, is distributed to corresponding coarseness grid and Slave node by Master node; Secondly, for data object p (d 1, d 2... ..., d n) insertion, by calculate the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, increased by 1, meanwhile, data object p be inserted List;
The delete procedure of the data object of CGGI is: for data object p (d 1, d 2... ..., d n) deletion, by calculate d iplace division (i=1,2,3 ..., n), the coarseness grid at p place can be determined, upgrade the Cgnum of this coarseness grid, reduced 1;
Meanwhile, the fine granularity grid index FGGI corresponding to this coarseness grid carries out deletion renewal: first, Master node, by the deletion action of CGGI, finds coarseness grid and the Slave node at data object p place; Secondly, for data object p (d 1, d 2... ..., d n) deletion, by calculate the Fgid of p place fine granularity grid index can be determined, upgrade the Fgnum of this fine granularity grid, reduced 1, data object p be deleted from List meanwhile.
7. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 1, is characterized in that, described KNN query script is k the arest neighbors object searching object q, and its concrete query script is:
Master node is service data object map algorithm MOG algorithm first, and object q to be checked is mapped to coarseness grid index CGGI, determines the coarseness grid Cg of q at CGGI place q;
Secondly adjacent grid-search algorithms SNNG algorithm search coarseness grid Cg is run qadjacent grid, judge Cg qand whether the object number summation in its adjacent grid is greater than k, if be less than k, continue the adjacent grid of its adjacent grid of search, until the total quantity of object is greater than k or searches for complete coarseness mesh space, finally obtain Cg qadjacent coarseness grid set C q, determine C qthe Slave node at middle coarseness grid place;
Comprising C qthe Slave node of middle coarseness grid runs SDKNN algorithm, and each Slave node exports Query Result;
The result reduction exported by each Slave node, to a Slave node, obtains result set S, presses ascending sort, get a front k object and export as net result the object in S according to the distance size with object q to be checked.
8. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 7, is characterized in that, described MOG algorithm runs at Master node, and its specific implementation process is: input object q (d to be checked 1, d 2... ..., d n), coarseness grid set C, determine the division of the coarseness grid at the every one dimension place of q, q be mapped in CGGI, determine the coarseness grid Cg at q place q.
9. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 7, is characterized in that, described SNNG algorithm runs at Master node, and the definition according to " adjacent grid " calculates q place coarseness grid Cg qadjacent grid, obtain coarseness and adjoin grid set C q, statistics C qobject number summation num in middle coarseness grid, if num>=k, then exports C q, otherwise, for C qin each coarseness grid perform SNNG algorithm, until C qobject number num>=k in middle coarseness grid or search for complete coarseness mesh space, exports C q.
10. a kind of KNN querying method based on combination grain distributed memory grid index according to claim 7, it is characterized in that, described SDKNN algorithm is distributed, easily extensible KNN algorithm, and this algorithm is distributed runs on the coarseness grid Cg storing object q place to be checked qand C qeach Slave node of middle coarseness grid, each Slave node returns k object of the arest neighbors of q;
Based on Slave node, the specific implementation process of this algorithm is:
Slave node 1 is made to store coarseness grid Cg j, and Cg j∈ C q, Cg jbe fine granularity grid index Fg at this Slave node j;
To Fg jperform Circle-Traversal algorithm, at least comprised the adjacent fine granularity grid set F of k the arest neighbors object of q j, this Circle-Traversal algorithm is the neighbour's fine granularity grid-search algorithms without losing, by input fine granularity grid index Fg j, object q place to be checked fine granularity grid Fg q, the step-length λ of fine granularity stress and strain model, searching loop number of times i; With Fg qcentered by by circle search belong to Fg jfine granularity grid, what obtain the periphery of q place fine granularity grid belongs to Fg jfine granularity grid set F j;
For F jin object p i.e. { p|p ∈ Fg, the Fg ∈ F of any fine granularity grid Fg j, calculate the distance dist (p, q) of p and q, sort according to distance, return the S set of the k nearest with a q object 1;
Repeat above-mentioned steps, obtain C qin the S set of other coarseness grid place Slave node and nearest k the object of q 2, S 3..., S n, according to the distance size with q to { S 1, S 2, S 3..., S nin object sort, finally return the S set of the k nearest with a q object.
CN201510481594.7A 2015-08-03 2015-08-03 A kind of KNN querying methods based on combination grain distributed memory grid index Active CN105138607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510481594.7A CN105138607B (en) 2015-08-03 2015-08-03 A kind of KNN querying methods based on combination grain distributed memory grid index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510481594.7A CN105138607B (en) 2015-08-03 2015-08-03 A kind of KNN querying methods based on combination grain distributed memory grid index

Publications (2)

Publication Number Publication Date
CN105138607A true CN105138607A (en) 2015-12-09
CN105138607B CN105138607B (en) 2018-07-17

Family

ID=54723955

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510481594.7A Active CN105138607B (en) 2015-08-03 2015-08-03 A kind of KNN querying methods based on combination grain distributed memory grid index

Country Status (1)

Country Link
CN (1) CN105138607B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893605A (en) * 2016-04-25 2016-08-24 济南大学 Distributed calculating platform facing to spatio-temporal data k neighbor query and query method
CN106528773A (en) * 2016-11-07 2017-03-22 山东首讯信息技术有限公司 Spark platform supported spatial data management-based diagram calculation system and method
CN107562872A (en) * 2017-08-31 2018-01-09 中国人民大学 Metric space data similarity search method and device based on SQL
CN107832479A (en) * 2017-10-19 2018-03-23 大连大学 Medical aid request mobile calls method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294223A1 (en) * 2006-06-16 2007-12-20 Technion Research And Development Foundation Ltd. Text Categorization Using External Knowledge
CN102073689A (en) * 2010-12-27 2011-05-25 东北大学 Dynamic nearest neighbour inquiry method on basis of regional coverage

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294223A1 (en) * 2006-06-16 2007-12-20 Technion Research And Development Foundation Ltd. Text Categorization Using External Knowledge
CN102073689A (en) * 2010-12-27 2011-05-25 东北大学 Dynamic nearest neighbour inquiry method on basis of regional coverage

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
宋晓宇,等: "CYPK-KNN:一种改进的移动对象KNN查询算法", 《沈阳建筑大学学报(自然科学报)》 *
戴健,等: "基于MapReduce快速kNN Join方法", 《计算机学报》 *
赵敏超,等: "基于MapReduce和双层倒排网格索引的kNN算法", 《浙江大学学报(理学版)》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893605A (en) * 2016-04-25 2016-08-24 济南大学 Distributed calculating platform facing to spatio-temporal data k neighbor query and query method
CN105893605B (en) * 2016-04-25 2019-02-22 济南大学 Distributed Computing Platform and querying method towards space-time data k NN Query
CN106528773A (en) * 2016-11-07 2017-03-22 山东首讯信息技术有限公司 Spark platform supported spatial data management-based diagram calculation system and method
CN106528773B (en) * 2016-11-07 2020-06-26 山东联友通信科技发展有限公司 Map computing system and method based on Spark platform supporting spatial data management
CN107562872A (en) * 2017-08-31 2018-01-09 中国人民大学 Metric space data similarity search method and device based on SQL
CN107562872B (en) * 2017-08-31 2020-03-24 中国人民大学 SQL-based query method and device for measuring spatial data similarity
CN107832479A (en) * 2017-10-19 2018-03-23 大连大学 Medical aid request mobile calls method

Also Published As

Publication number Publication date
CN105138607B (en) 2018-07-17

Similar Documents

Publication Publication Date Title
CN106528773B (en) Map computing system and method based on Spark platform supporting spatial data management
CN102289466B (en) K-nearest neighbor searching method based on regional coverage
CN106777351B (en) Computing system and its method are stored based on ART tree distributed system figure
CN102663801B (en) Method for improving three-dimensional model rendering performance
CN109522428B (en) External memory access method of graph computing system based on index positioning
CN108228724A (en) Power grid GIS topology analyzing method and storage medium based on chart database
Hongchao et al. Distributed data organization and parallel data retrieval methods for huge laser scanner point clouds
CN105138607A (en) Hybrid granularity distributional memory grid index-based KNN query method
CN107015868B (en) Distributed parallel construction method of universal suffix tree
CN101299213A (en) N-dimension clustering order recording tree space index method
CN109492060A (en) A kind of map tile storage method based on MBTiles
CN109033340A (en) A kind of searching method and device of the point cloud K neighborhood based on Spark platform
Schlag et al. Scalable edge partitioning
CN101692230A (en) Three-dimensional R tree spacial index method considering levels of detail
Moutafis et al. Efficient processing of all-k-nearest-neighbor queries in the MapReduce programming framework
CN110838072A (en) Social network influence maximization method and system based on community discovery
CN112181991A (en) Earth simulation system grid remapping method based on rapid construction of KD tree
CN105205052A (en) Method and device for mining data
JP2023543004A (en) Merge update method, device, and medium for R-tree index based on Hilbert curve
CN110097581B (en) Method for constructing K-D tree based on point cloud registration ICP algorithm
Bakli et al. Distributed spatiotemporal trajectory query processing in SQL
CN102637227B (en) Land resource assessment factor scope dividing method based on shortest path
CN101515284A (en) Parallel space topology analyzing method based on discrete grid
CN103345509B (en) Obtain the level partition tree method and system of the most farthest multiple neighbours on road network
CN104598600B (en) A kind of parallel analysis of digital terrain optimization method based on distributed memory

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230105

Address after: Room 606-609, Compound Office Complex Building, No. 757, Dongfeng East Road, Yuexiu District, Guangzhou, Guangdong Province, 510699

Patentee after: China Southern Power Grid Internet Service Co.,Ltd.

Address before: Room 02A-084, Building C (Second Floor), No. 28, Xinxi Road, Haidian District, Beijing 100085

Patentee before: Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.

Effective date of registration: 20230105

Address after: Room 02A-084, Building C (Second Floor), No. 28, Xinxi Road, Haidian District, Beijing 100085

Patentee after: Jingchuang United (Beijing) Intellectual Property Service Co.,Ltd.

Address before: Information Room, Institute of Information, Shandong Academy of Sciences, No. 19, Keyuan Road, Jinan, Shandong 250014

Patentee before: INFORMATION Research Institute OF SHANDONG ACADEMY OF SCIENCES