CN106909639B - Chained multi-path space connection query processing method based on Spark - Google Patents

Chained multi-path space connection query processing method based on Spark Download PDF

Info

Publication number
CN106909639B
CN106909639B CN201710083816.9A CN201710083816A CN106909639B CN 106909639 B CN106909639 B CN 106909639B CN 201710083816 A CN201710083816 A CN 201710083816A CN 106909639 B CN106909639 B CN 106909639B
Authority
CN
China
Prior art keywords
connection
data
rdd
spatial
new
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710083816.9A
Other languages
Chinese (zh)
Other versions
CN106909639A (en
Inventor
乔百友
王秋杰
韩东红
王国仁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201710083816.9A priority Critical patent/CN106909639B/en
Publication of CN106909639A publication Critical patent/CN106909639A/en
Application granted granted Critical
Publication of CN106909639B publication Critical patent/CN106909639B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2264Multidimensional index structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Spark-based chained multi-path spatial connection query processing algorithm, which comprises the following steps: step 1: dividing the whole data space into a plurality of grid units with the same size, and coding each grid unit by adopting a Z-order filling curve technology; step 2: projecting each spatial object in the m-way spatially connected data sets to a corresponding grid cell according to its position in data space; and step 3: if the condition i is satisfied<m, then RDDresult is applied to the two data setsnew,RDDiExecuting spatial join operation Overlap; and 4, step 4: i ═ i +1, step 3 is performed until condition i<Until m is not satisfied; and 5: and executing the final spatial join operation Overlap. The invention is a Spark-based chained multi-path spatial connection query processing algorithm, which has obvious improvement on the processing efficiency and the reduction of the calculation cost.

Description

Chained multi-path space connection query processing method based on Spark
Technical Field
The invention relates to the technical field of spatial data query processing, in particular to a Spark-based chained multi-path spatial connection query processing method.
Background
The spatial join query is an important type of spatial data query, widely exists in spatial data management, and the spatial join query processing technology is a research hotspot in the field of spatial database management. The multi-path spatial join query is a common spatial join operation, which is one of the most time-consuming spatial operations to retrieve all spatial objects satisfying a certain spatial predicate (such as intersection, inclusion, etc.) from a plurality of spatial data sets, and the complexity and importance of the spatial join operation make the spatial objects one of the important factors for determining the overall performance of a spatial data management system, so that the improvement of the processing efficiency of the multi-path spatial join query is always a research hotspot problem in academia. Particularly, in recent years, with rapid development and wide application of internet of things technology, earth observation technology and location-based service technology, the size of spatial data is increased sharply, and the spatial data becomes important big data. How to perform efficient multi-path spatial connection query processing on such spatial big data has become an important challenge in the current spatial data management field. The traditional processing technology based on the spatial database has the problem of weak expansibility, so that the requirement of quick query and processing of spatial big data is difficult to meet, and Spark is widely paid attention to as a novel super-large-scale data distributed parallel processing platform and is also a key technology of big data processing at present. Therefore, combining with the large-scale data processing capability provided by the Spark distributed parallel processing platform, an efficient multi-path spatial connection query processing method for deeply researching spatial big data has become an important means for solving the above challenges.
In the multi-path spatial connection query processing, the following problems mainly exist in the existing method: (1) the traditional multipath connection processing method based on the spatial database mainly adopts a centralized processing mode, has poor expansibility and is difficult to meet the requirement of quick query processing of spatial big data; (2) most of the existing popular algorithms such as dynamic programming algorithm, hybrid connection algorithm and the like are centralized index construction, and the efficiency is low for massive data connection query; (3) the existing distributed processing method is mainly based on a Hadoop platform and focuses on the aspects of universal multi-path connection query processing optimization, and the problems of excessive data replication and weak filtering capability exist, so that the query processing efficiency is influenced; (4) at present, the latest distributed multi-path spatial connection algorithm is that Gupta et al propose two multi-path spatial connection query processing algorithms Controlled-repeat and Controlled-repeat based on MapReduce. And the Controlled-replay divides and copies the space objects in various connection data sets to all grid units in the fourth quadrant, and then performs multi-path connection operation. Obviously, this method causes duplication of a large number of spatial objects, which affects the efficiency of the connection process. For this author, an improved multi-path spatial connection query processing algorithm-Controlled-duplicate was proposed, which reduces data duplication to some extent and improves query processing efficiency, but also has the problem of excessive data duplication. For the Spark platform, too much data copy amount causes too large data amount loaded into the memory at one time, the advantage of Spark based on memory calculation cannot be well played, and the problem of low query efficiency and the like can also be caused.
The problems are deeply researched, and after a corresponding solution is provided, the method can be applied to relevant application fields such as connection query processing of spatial big data and the like. Therefore, the invention provides a spatial multi-path connection query processing algorithm under a Spark platform, which mainly aims at chain multi-path spatial connection query, adopts a grid-based data space dividing method and combines Z-order coding to realize data division and coding, and performs data projection and replication according to the spatial position of a data object. During the connection process, the algorithm adopts a boundary filtering method to reduce useless connection data, so that redundant calculation of subsequent connection and redundant projection and copying of a connection object are reduced. And a repeated avoidance strategy is adopted to reduce the output of repeated results, thereby comprehensively reducing the cost of subsequent connection calculation and improving the efficiency of multi-path connection query processing.
Disclosure of Invention
In view of the defects in the prior art, the present invention aims to provide a Spark-based chained multi-path spatial join query processing method, which mainly focuses on the problem of chained multi-path spatial join query processing, and focuses on reducing the amount of spatial data replication and calculation in the filtering stage, thereby reducing the subsequent join calculation cost, improving the query processing efficiency, and having good adaptability and expansibility.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a chain multi-path space connection query processing method based on Spark comprises the following steps:
step 1: dividing the whole data space into a plurality of grid units with the same size by using a grid division method, and coding each grid unit by adopting a Z-order filling curve technology;
step 2: m (m)>2) Road space junction dataset R1,R2,…,RmEach space object in the data space is projected to a corresponding grid unit according to the position of the space object in the data space, a series of key value pairs are formed, and the projection results are respectively stored in an elastic distributed data set RDD1,RDD2,…,RDDmIn (1), setting the loop variable i to 2, and setting the intermediate result data set RDDresultnew=RDD1
And step 3: if the condition i is satisfied<m, then RDDresult is applied to the two data setsnew,RDDiPerforming spatial join operation Overlap (RDDresult)new,RDDi). In the calculation process, data aggregation, boundary filtering, space connection calculation, repeated avoidance, data replication and other operations are sequentially carried out, and finally an intermediate result data set RDDresult is formednewNamely RDDresultnew=Overlap(RDDresultnew,RDDi);
And 4, step 4: i-i +1, executing step 3 until condition i < m is not satisfied;
and 5: performing the last spatial join operation Overlap (RDDresult)new,RDDm) And in the calculation process, sequentially carrying out data aggregation, boundary filtering and spatial connection calculation, directly outputting the result to form a final spatial connection result set, and storing the final spatial connection result set in the HDFS file system.
Further, the data partitioning and encoding method comprises: the method comprises the steps of dividing the whole data space into n grid units with equal size by adopting a grid-based division method, coding the grid units by adopting a Z-order filling curve, projecting a space data object to each grid unit according to the position of the space data object, and mapping all the grid units to a plurality of execution units of an execution unit in a Hash mode, so that the whole processing task is divided into a plurality of parallel processing tasks.
Further, the spatial object projection is: according to which the spatial data object is to be constructedIn the location mapping to the corresponding grid cell, let C ═ C1,c2,…,cn) Representing a data space division, ciRepresenting each grid cell, R is a kind of space object set to be connected, if a space object u ∈ R, its MBR and grid cell ciWith an overlap of ciFor Z-order encoding of grid cells, object u is mapped to grid cell ciAnd generates a corresponding key-value pair (c)iU), if a spatial object overlaps multiple grid cells, multiple key value pairs are generated accordingly.
Further, step 3 specifically includes the following steps:
step 3-1: calculate Overlap (RDDresult)new,RDDi) For RDDresultnew,RDDiExecuting Cogroup operation according to Key value, namely RDDresultnewAnd RDDiThe data in the RDD are gathered together according to Key values to obtain RDDnew
Step 3-2: RDD pair using filtering strategynewFiltering to remove data pairs which are impossible to result, and then performing actual space connection operation;
step 3-3: executing a repeat avoidance strategy to form a connection intermediate result, executing data copy operation on the connection intermediate result, and finally forming a new intermediate connection result data set RDDresultnew
Further, step 5 comprises the following steps:
step 5-1: calculate Overlap (RDDresult)new,RDDi) For RDDresultnew,RDDiExecuting Cogroup operation according to Key value, namely RDDresultnewAnd RDDiThe data in the RDD are gathered together according to Key values to obtain RDDnew
Step 5-2: RDD pair using filtering strategynewFiltering to remove data pairs which are impossible to result, and then performing actual space connection operation;
step 5-3: executing the repeated avoidance strategy to form a final connection result data set RDDresult formed by tuple pairsnewAnd is combined withSave to the HDFS file system.
Further, the data copy operation is to: for any tuple T in the intermediate connection result set T generated by the latest spatial connection operation on the current grid cell, if t.s is the spatial object related to the next spatial connection operation, if t.s is related to a certain grid cell ciIf there is an overlap, the tuple t is copied to the grid cell ciAnd generates a corresponding key-value pair (c)i,t)。
Further, the filtering strategy is as follows: in the process of executing the connection operation in parallel, a corresponding filtering strategy is adopted, tuples which can not generate the connection result are removed, and only tuples which can generate the connection result are copied.
Further, the filtering strategy comprises two parts:
boundary filtering, the boundary filtering being: before performing connection operation, firstly counting the boundary MBR of the space object related to the subsequent space connection in the completed connection intermediate result, and filtering the space object which is not intersected with the MBR in the subsequent data set to be connected by utilizing the MBR, thereby reducing the cost of the subsequent connection calculation;
replication phase filtering, wherein the replication phase filtering is as follows: in the multi-path connection inquiry processing, data copying operation is required to be carried out on intermediate results after the first paths of connection processing, and the intermediate results are only copied to other grid units which may generate connection results, so that loss of the connection results is avoided.
Further, the duplicate avoidance policy is: when two space objects which span a plurality of grid cells are connected, only the grid cell where the intersection point of the lower left corner of the two overlapped new objects is located is responsible for outputting the result.
Compared with the prior art, the invention has the beneficial effects that: the invention relates to a Spark-based chained multi-path spatial connection query processing method, which divides a data space by adopting a grid division method, projects and copies data based on the position of a spatial object, filters useless connection objects by adopting a boundary filtering mode in a calculation process, reduces data copying by reducing the copying range, obviously improves the processing efficiency and the calculation cost, and has good adaptability and expansibility.
Drawings
FIG. 1 is an exemplary diagram of Z-order curve coding in an embodiment of the present invention;
FIG. 2 is a diagram illustrating partitioning and task mapping of data according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating an exemplary operation of projecting and copying data in accordance with an embodiment of the present invention;
FIG. 4 is an exemplary diagram of boundary filtering as described in the detailed description of the invention;
FIG. 5 is an exemplary illustration of the avoidance of repetition described in the detailed description of the invention;
fig. 6 is a schematic processing flow diagram of a Spark-based chained multi-path spatial join query processing method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.
The invention provides a Spark-based chained multi-path spatial connection query processing method, which mainly focuses on the problem of chained multi-path spatial connection query processing, and is mainly characterized by reducing the spatial data copying and calculation amount in the filtering stage, thereby reducing the subsequent connection calculation cost and improving the query processing efficiency.
Example 1:
a chain multi-path space connection query processing method based on Spark mainly comprises the following steps:
1) data space partitioning and encoding: the method comprises the steps of dividing the whole space into n grid units with equal size by adopting a grid division method, coding the grid units by adopting a Z-order filling curve, projecting a spatial data object to each grid unit according to the position of the spatial data object, and mapping all the grid units to a plurality of execution units of the execution units of.
As shown in fig. 1, in order to maintain the spatial relationship between spatial objects, a space filling curve is used to encode the grid cells, and fig. 1 is a Z-order curve when the number of bits (bit) is 1 and 2. The Z-order curve is a space-filling curve. The Z-order technology uses bit to represent the attribute information of a space object, then uses a circular method to decompose the data space, and the divided subspace obtains a group of numbers, which are called the Z-ordering value of the subspace and used as the Key value of the subspace data object.
As shown in fig. 2, which is an example of task partition mapping, it can be seen that each divided grid unit and the data set divided thereon are allocated to n execution units in the spare platform in a Hash mapping manner to execute in parallel.
2) Spatial object projection: that is, the spatial data object is mapped to the corresponding grid cell according to the position of the spatial data object. Let C ═ C1,c2,…,cn) Representing a data space division, ciRepresenting each grid cell, R is a kind of space object set to be connected, if a space object u ∈ R, its MBR and grid cell ci(ciZ-order coding for grid cells), object u is mapped to grid cell ciAnd generates a corresponding key-value pair (c)iU), if there is an overlap between a spatial object and multiple grid cells, then multiple key value pairs are formed accordingly. The projection operation can be expressed as:
Figure GDA0002540039970000061
3) data replication: in the multi-path connection inquiry processing, a plurality of connection operations among a plurality of data sets are required, and the data replication is to empty the latest time on the current grid unitIf T ∈ T is the tuple in the intermediate result of the connection and t.s is the object to be connected in the subsequent space, then t.s is the object to be connected in the certain grid cell ciIf there is an overlap, the tuple t is copied to the grid cell ciAnd generates a corresponding key-value pair (c)iT). The copy operation may be represented as:
Figure GDA0002540039970000062
fig. 3 is an example of a data projection and replication operation from which it can be seen that a spatial object is projected onto a grid cell that overlaps it. Object r1Is projected on grid cells # 6 and # 12 r2Is projected on to units 9 and 12, r3Is projected to units 9 and 11, i.e. Project (r)1,C)={(6,r1),(12,r1)},Project(r2,C)={(9,r2),(12,r2)},Project(r3,C)={(9,r2),(11,r2)}. When executing r1,r2And r3When multiple connections are made in sequence, since r2Overlap with grid cell 9, so that r in grid cell 121And r2Connecting intermediate result (r)1,r2) To be copied into grid cell 9, a key-value pair (9, (r) is formed1,r2) To realize a spatial object r in the grid cell 93And the subsequent connection operation avoids the loss of the connection result.
4) And (3) filtering strategy: in the process of executing the connection operation in parallel, a boundary filtering strategy is adopted, tuples which cannot generate connection results are removed, and only tuples which may have results are copied, so that the cost of storage and subsequent calculation is greatly reduced. The method specifically comprises the following two filtering strategies:
a: and (3) boundary filtering: firstly, counting the MBR of the relevant connection objects in the previous connection results, and filtering spatial objects in the data sets to be connected, which are not intersected with the MBR, by using the MBR, so as to reduce the calculation cost of the subsequent connection:
FIG. 4 is an example of boundary filtering in which three data sets R, S and T are subjected to a three-way join operation in sequence
Figure GDA0002540039970000063
The spatial object projected into the grid cell 3 is shown,
Figure GDA0002540039970000064
respectively, are (r)1,s1),(r1,s2),(r1,s3) The object in the corresponding S set in the previous connection result set can be obtained as S1、s2And s3The boundary MBR is shown by a dotted line in the figure, and when the boundary MBR is connected with the objects in the data set T, the spatial objects T projected into the grid unit 3 and not intersected with the MBR can be directly filtered1、t4And t5Avoid the spatial objects from being respectively associated with s1、s2And s3And performing connection operation, thereby greatly reducing the cost of subsequent calculation.
B, replication stage filtration: in the multi-link query processing process, data copying operation needs to be performed on intermediate results after the first several connection processing, the intermediate results are copied to other grid units which may generate connection results, and subsequent connection operation is performed, so that the connection results are prevented from being lost. In the intermediate connection result copying, only the intermediate result related to the cross-grid connection object is copied, thereby avoiding redundant copying.
5) A duplicate avoidance strategy: when two space objects which span a plurality of grid units are connected, only the grid unit where the intersection point of the left lower corner of the new object formed by overlapping the two space objects is located is responsible for outputting the result, namely only one grid unit is responsible for outputting the result, so that the repeated output of the result is avoided, and the processing cost is reduced.
FIG. 5 shows an example of duplicate avoidance, where the objects S in the set of S are1Projected onto the mesh cells 2 which they overlap,3. 6, 8, 9, 12, object R in the set of R1Is projected to the grid cell 3, 6, 9, 12, r2The object is projected onto four grid cells 8, 9, 10, 11, and if no duplicate avoidance is performed, the grid cells 3, 6, 9, 12 output the same connection result (r) in the connection process1,s1) And the grid cells 8 and 9 will also output the same connection result (r)2,s1) Repetition is apparent. According to the proposed duplicate avoidance strategy, as shown in fig. 5, the grid cell where the lower left corner of the object (indicated by points P and Q in the figure) formed by the overlapping parts of the object is located is responsible for outputting the result, i.e. the grid cell 3 is responsible for processing the output r1And s1Result of connection of (r)1,s1) The grid cell 8 is responsible for processing the output r2And s1Result of connection of (r)2,s1) The strategy avoids repeated processing and repeated output of results, and reduces subsequent processing cost.
Chained multipath spatial join query Qm=Overlap(R1,R2,R3,...,Rm) According to its definition, may be represented as Qm=Overlap(…Overlap(Overlap(R1,R2),R3),…,Rm) The processing flow of the chained multi-path spatial connection query processing method based on Spark provided by the invention is shown in fig. 6, and mainly comprises the following steps:
a: for multi-way connection data set R according to grid division coding method1,R2,R3,…,RmProjecting, taking the coded Value as Key Value, taking the mark of each space object and attribute information such as MBR (Membrane biological reactor) thereof as Value to form a series of Key Value pairs, and respectively taking the data set R1,R2,R3,…,RmPut the projection result into an elastic distributed data set RDD1,RDD2,RDD3,…,RDDmPerforming the following steps;
b: calculate Overlap (R)1,R2) I.e. to RDD1And RDD2Performing Cogroup operation to convert RDD1And RDD2According to Key value, the data in (A) are gathered togetherTo obtain the RDDnewRDD pair using a border filtering strategynewFiltering to remove data objects which are impossible to have results, then performing actual space connection operation, executing a repeat avoidance strategy, and forming a connection intermediate result; performing data copy operation on the connection intermediate result to form an intermediate result data set RDDresultnew
C: calculating RDDresult according to the same calculation method as the step BnewAnd RDD3The latest R is obtained by the connection operation between the two1,R2,R3Result rddiesult in the middle of the connectionnew. Sequentially and circularly calculating RDDresult by adopting the same calculation methodnewAnd RDD4And RDD5…, and RDDm-1To finally obtain a data set R1,R2,R3,…,Rm-1RDDresult of the concatenated intermediate result data setnew
D:RDDresultnewAnd RDDmExecuting Cogroup operation to generate new RDDnewOn the basis, boundary filtering and connection operation processing are carried out, and the result is directly output to form a data set R1,R2,R3,…,RmOf the final spatially connected data set rddiesultnewSince it is the last spatial join operation, no copy operation is required.
Example 2:
a chain multi-path space connection query processing method based on Spark comprises the following steps: step 1: dividing the whole data space into a plurality of grid units with the same size, and coding each grid unit by adopting a Z-order filling curve technology; step 2: m (m)>2) Road space junction dataset R1,R2,…,RmAccording to the position of each space object in the data space, projecting the space object to a corresponding grid unit, and storing the projection result to an elastic distributed data set RDD1,RDD2,…,RDDmIn (1). Set the loop variable i to 2 and the intermediate result data set RDDresultnew=RDD1(ii) a And step 3: if the condition i is satisfied<m, then RDDresult is applied to the two data setsnew,RDDiPerforming spatial join operation Overlap (RDDresult)new,RDDi) In the calculation process, data aggregation, boundary filtering, space connection calculation, repeated avoidance, data replication and other operations are sequentially carried out, and finally a new intermediate result data set RDDresult is formednew,RDDresultnew=Overlap(RDDresultnew,RDDi) (ii) a And 4, step 4: i ═ i +1, step 3 is performed until condition i<Until m is not satisfied; and 5: performing the last spatial join operation Overlap (RDDresult)new,RDDm) And in the calculation process, sequentially carrying out data aggregation, boundary filtering and spatial connection calculation, directly outputting the result to form a final spatial connection result set, and storing the final spatial connection result set in the HDFS file system. The invention is a Spark-based chained multi-path spatial connection query processing method, which has obvious improvement on processing efficiency and calculation cost reduction and has good adaptability and expansibility.
Example 3:
a chain multi-path space connection query processing method based on Spark comprises the following steps:
step 1: dividing the whole data space into a plurality of grid units with the same size by using a grid division method, and coding each grid unit by adopting a Z-order filling curve technology;
step 2: m (m)>2) Road space junction dataset R1,R2,…,RmEach space object in the data space is projected to a corresponding grid unit according to the position of the space object in the data space, a series of key value pairs are formed, and the projection results are respectively stored in an elastic distributed data set RDD1,RDD2,…,RDDmIn (1), setting the loop variable i to 2, and setting the intermediate result data set RDDresultnew=RDD1
And step 3: if the condition i is satisfied<m, then RDDresult is applied to the two data setsnew,RDDiPerforming spatial join operation Overlap (RDDresult)new,RDDi). In the calculation process, the calculation is carried out in sequencePerforming operations such as row data aggregation, boundary filtering, space connection calculation, repeated avoidance, data replication and the like to finally form an intermediate result data set RDDresultnewNamely RDDresultnew=Overlap(RDDresultnew,RDDi);
And 4, step 4: i-i +1, executing step 3 until condition i < m is not satisfied;
and 5: performing the last spatial join operation Overlap (RDDresult)new,RDDm) And in the calculation process, sequentially carrying out data aggregation, boundary filtering and spatial connection calculation, directly outputting the result to form a final spatial connection result set, and storing the final spatial connection result set in the HDFS file system.
Further, the data partitioning and encoding method comprises: the method comprises the steps of dividing the whole data space into n grid units with equal size by adopting a grid-based division method, coding the grid units by adopting a Z-order filling curve, projecting a space data object to each grid unit according to the position of the space data object, and mapping all the grid units to a plurality of execution units of an execution unit in a Hash mode, so that the whole processing task is divided into a plurality of parallel processing tasks.
Further, the spatial object projection is: mapping the space data object to a corresponding grid unit according to the position of the space data object, and setting C as (C)1,c2,…,cn) Representing a data space division, ciRepresenting each grid cell, R is a kind of space object set to be connected, if a space object u ∈ R, its MBR and grid cell ciWith an overlap of ciFor Z-order encoding of grid cells, object u is mapped to grid cell ciAnd generates a corresponding key-value pair (c)iU), if a spatial object overlaps multiple grid cells, multiple key value pairs are generated accordingly.
Further, step 3 specifically includes the following steps:
step 3-1: calculate Overlap (RDDresult)new,RDDi) For RDDresultnew,RDDiExecuting Cogroup operation according to Key value, namely RDDresultnewAnd RDDiThe data in the RDD are gathered together according to Key values to obtain RDDnew
Step 3-2: RDD pair using filtering strategynewFiltering to remove data pairs which are impossible to result, and then performing actual space connection operation;
step 3-3: executing a repeat avoidance strategy to form a connection intermediate result, executing data copy operation on the connection intermediate result, and finally forming a new intermediate connection result data set RDDresultnew
Further, step 5 comprises the following steps:
step 5-1: calculate Overlap (RDDresult)new,RDDi) For RDDresultnew,RDDiExecuting Cogroup operation according to Key value, namely RDDresultnewAnd RDDiThe data in the RDD are gathered together according to Key values to obtain RDDnew
Step 5-2: RDD pair using filtering strategynewFiltering to remove data pairs which are impossible to result, and then performing actual space connection operation;
step 5-3: executing the repeated avoidance strategy to form a final connection result data set RDDresult formed by tuple pairsnewAnd saved to the HDFS file system.
Further, the data copy operation is to: for any tuple T in the intermediate connection result set T generated by the latest spatial connection operation on the current grid cell, if t.s is the spatial object related to the next spatial connection operation, if t.s is related to a certain grid cell ciIf there is an overlap, the tuple t is copied to the grid cell ciAnd generates a corresponding key-value pair (c)i,t)。
Further, the filtering strategy is as follows: in the process of executing the connection operation in parallel, a corresponding filtering strategy is adopted, tuples which can not generate the connection result are removed, and only tuples which can generate the connection result are copied.
Further, the filtering strategy comprises two parts:
boundary filtering, the boundary filtering being: before performing connection operation, firstly counting the boundary MBR of the space object related to the subsequent space connection in the completed connection intermediate result, and filtering the space object which is not intersected with the MBR in the subsequent data set to be connected by utilizing the MBR, thereby reducing the cost of the subsequent connection calculation;
replication phase filtering, wherein the replication phase filtering is as follows: in the multi-path connection inquiry processing, data copying operation is required to be carried out on intermediate results after the first paths of connection processing, and the intermediate results are only copied to other grid units which may generate connection results, so that loss of the connection results is avoided.
Further, the duplicate avoidance policy is: when two space objects which span a plurality of grid cells are connected, only the grid cell where the intersection point of the lower left corner of the two overlapped new objects is located is responsible for outputting the result.
Although specific embodiments of the present invention are described above, it should be understood by those skilled in the art that these are merely examples, and the present invention is a Spark-based chain multi-path spatial join query processing method, and thus the examples are only for illustrating the core ideas of filtering strategies, duplicate avoidance strategies, join processing procedures, and the like. After that, larger scale experiments can be performed, and the related methods can be further improved to improve the effects of data projection, replication and filtering, and the combination of the indexing technology can be considered to further improve the performance of the method without departing from the principle and essence of the invention. The scope of the invention is only limited by the appended claims.

Claims (9)

1. A chain multi-path space connection query processing method based on Spark is characterized in that: the method comprises the following steps:
step 1: dividing the whole data space into a plurality of grid units with the same size by using a grid division method, and coding each grid unit by adopting a Z-order filling curve technology;
step 2: connecting m spatial paths to a data set R1,R2,…,RmAccording to its position in the data space, and form a series of key-value pairs, where m>2; respectively storing the projection results into elastic distributed data sets RDD1,RDD2,…,RDDmIn (1), setting the loop variable i to 2, and setting the intermediate result data set RDDresultnew=RDD1
And step 3: if the condition i is satisfied<m, then RDDresult is applied to the two data setsnew,RDDiPerforming spatial join operation Overlap (RDDresult)new,RDDi) (ii) a In the calculation process, data aggregation, boundary filtering, space connection calculation, repeated avoidance and data copying operation are sequentially carried out, and finally an intermediate result data set RDDresult is formednewNamely RDDresultnew=Overlap(RDDresultnew,RDDi);
And 4, step 4: i-i +1, executing step 3 until condition i < m is not satisfied;
and 5: performing the last spatial join operation Overlap (RDDresult)new,RDDm) And in the calculation process, sequentially carrying out data aggregation, boundary filtering and spatial connection calculation, directly outputting the result to form a final spatial connection result set, and storing the final spatial connection result set in the HDFS file system.
2. The Spark-based chained multi-path spatial join query processing method according to claim 1, wherein: the grid division and coding method comprises the following steps: the method comprises the steps of dividing the whole data space into n grid units with equal size by adopting a grid-based division method, coding the grid units by adopting a Z-order filling curve, projecting a space data object to each grid unit according to the position of the space data object, and mapping all the grid units to a plurality of execution units of an execution unit in a Hash mode, so that the whole processing task is divided into a plurality of parallel processing tasks.
3. Spark-based chain multiplex as claimed in claim 1The space connection query processing method is characterized by comprising the following steps: connecting m spatial paths to a data set R1,R2,…,RmThe specific steps of projecting each spatial object in the data space to a corresponding grid cell according to its position in the data space are: mapping the space data object to a corresponding grid unit according to the position of the space data object, and setting C as (C)1,c2,…,cn) Representing a data space division, ciRepresenting each grid cell, R is a kind of space object set to be connected, if a space object u ∈ R, its MBR and grid cell ciWith an overlap of ciFor Z-order encoding of grid cells, object u is mapped to grid cell ciAnd generates a corresponding key-value pair (c)iU), if a spatial object overlaps multiple grid cells, multiple key value pairs are generated accordingly.
4. The Spark-based chained multi-path spatial join query processing method according to claim 1, wherein: the step 3 specifically comprises the following steps:
step 3-1: calculate Overlap (RDDresult)new,RDDi) For RDDresultnew,RDDiExecuting Cogroup operation according to Key value, namely RDDresultnewAnd RDDiThe data in the RDD are gathered together according to Key values to obtain RDDnew
Step 3-2: RDD pair using filtering strategynewFiltering to remove data pairs which are impossible to result, and then performing actual space connection operation;
step 3-3: executing a repeat avoidance strategy to form a connection intermediate result, executing data copy operation on the connection intermediate result, and finally forming a new intermediate connection result data set RDDresultnew
5. The Spark-based chained multi-path spatial join query processing method according to claim 1, wherein: step 5 comprises the following steps:
step 5-1: calculate Overlap (RDDresult)new,RDDi) For RDDresultnew,RDDiExecuting Cogroup operation according to Key value, namely RDDresultnewAnd RDDiThe data in the RDD are gathered together according to Key values to obtain RDDnew
Step 5-2: RDD pair using filtering strategynewFiltering to remove data pairs which are impossible to result, and then performing actual space connection operation;
step 5-3: executing the repeated avoidance strategy to form a final connection result data set RDDresult formed by tuple pairsnewAnd saved to the HDFS file system.
6. The Spark-based chained multi-path spatial join query processing method according to claim 1, wherein: the data copy operation is: for any tuple T in the intermediate connection result set T generated by the latest spatial connection operation on the current grid cell, if t.s is the spatial object related to the next spatial connection operation, if t.s is related to a certain grid cell ciIf there is an overlap, the tuple t is copied to the grid cell ciAnd generates a corresponding key-value pair (c)i,t)。
7. The Spark-based chained multi-way spatial join query processing method according to claim 4, wherein: the filtering strategy is as follows: in the process of executing the connection operation in parallel, a corresponding filtering strategy is adopted, tuples which can not generate the connection result are removed, and only tuples which can generate the connection result are copied.
8. The Spark-based chained multi-way spatial join query processing method according to claim 7, wherein: the filtering strategy comprises two parts:
boundary filtering, the boundary filtering being: before performing connection operation, firstly counting the boundary MBR of the space object related to the subsequent space connection in the completed connection intermediate result, and filtering the space object which is not intersected with the MBR in the subsequent data set to be connected by utilizing the MBR, thereby reducing the cost of the subsequent connection calculation;
replication phase filtering, wherein the replication phase filtering is as follows: in the multi-path connection inquiry processing, data copying operation is required to be carried out on intermediate results after the first paths of connection processing, and the intermediate results are only copied to other grid units which may generate connection results, so that loss of the connection results is avoided.
9. The Spark-based chained multi-way spatial join query processing method according to claim 4, wherein: the repeat avoidance strategy is: when two space objects which span a plurality of grid cells are connected, only the grid cell where the intersection point of the lower left corner of the two overlapped new objects is located is responsible for outputting the result.
CN201710083816.9A 2017-02-16 2017-02-16 Chained multi-path space connection query processing method based on Spark Active CN106909639B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710083816.9A CN106909639B (en) 2017-02-16 2017-02-16 Chained multi-path space connection query processing method based on Spark

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710083816.9A CN106909639B (en) 2017-02-16 2017-02-16 Chained multi-path space connection query processing method based on Spark

Publications (2)

Publication Number Publication Date
CN106909639A CN106909639A (en) 2017-06-30
CN106909639B true CN106909639B (en) 2020-09-29

Family

ID=59209302

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710083816.9A Active CN106909639B (en) 2017-02-16 2017-02-16 Chained multi-path space connection query processing method based on Spark

Country Status (1)

Country Link
CN (1) CN106909639B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113722314B (en) * 2020-12-31 2024-04-16 京东城市(北京)数字科技有限公司 Space connection query method and device, electronic equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101692230A (en) * 2009-07-28 2010-04-07 武汉大学 Three-dimensional R tree spacial index method considering levels of detail
CN104391679A (en) * 2014-11-18 2015-03-04 浪潮电子信息产业股份有限公司 GPU (graphics processing unit) processing method for high-dimensional data stream in irregular stream
CN106055563A (en) * 2016-05-19 2016-10-26 福建农林大学 Method for parallel space query based on grid division and system of same
CN106209989A (en) * 2016-06-29 2016-12-07 山东大学 Spatial data concurrent computational system based on spark platform and method thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9311380B2 (en) * 2013-03-29 2016-04-12 International Business Machines Corporation Processing spatial joins using a mapreduce framework
US9870397B2 (en) * 2014-08-19 2018-01-16 International Business Machines Corporation Processing multi-way theta join queries involving arithmetic operators on MapReduce

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101692230A (en) * 2009-07-28 2010-04-07 武汉大学 Three-dimensional R tree spacial index method considering levels of detail
CN104391679A (en) * 2014-11-18 2015-03-04 浪潮电子信息产业股份有限公司 GPU (graphics processing unit) processing method for high-dimensional data stream in irregular stream
CN106055563A (en) * 2016-05-19 2016-10-26 福建农林大学 Method for parallel space query based on grid division and system of same
CN106209989A (en) * 2016-06-29 2016-12-07 山东大学 Spatial data concurrent computational system based on spark platform and method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
A Boundary Filtering Based Spatial Join Query Processing Optimization Algorithm;Baiyou Qiao等;《2015 12th International Conference on Fuzzy Systems and Knowledge Discovery(FSKD)》;20150817;第1764-1769页 *

Also Published As

Publication number Publication date
CN106909639A (en) 2017-06-30

Similar Documents

Publication Publication Date Title
CN103246749B (en) The matrix database system and its querying method that Based on Distributed calculates
CN102663116B (en) Multi-dimensional OLAP (On Line Analytical Processing) inquiry processing method facing column storage data warehouse
Zhao et al. Modeling MongoDB with relational model
CN106844703A (en) A kind of internal storage data warehouse query processing implementation method of data base-oriented all-in-one
WO2012061312A1 (en) Homomorphism lemma for efficiently querying databases
WO2013155751A1 (en) Concurrent-olap-oriented database query processing method
CN104504154A (en) Method and device for data aggregate query
CN105677761A (en) Data sharding method and system
CN103440246A (en) Intermediate result data sequencing method and system for MapReduce
CN103793442A (en) Spatial data processing method and system
Phan et al. Toward intersection filter-based optimization for joins in mapreduce
CN103617276A (en) Method for storing distributed hierarchical RDF data
Nidzwetzki et al. Distributed secondo: an extensible and scalable database management system
US8099409B2 (en) System, method, and computer-readable medium for duplication optimization for parallel join operations on similarly large skewed tables
CN106909639B (en) Chained multi-path space connection query processing method based on Spark
CN107506394B (en) Optimization method for eliminating big data standard relation connection redundancy
CN108304264B (en) Erasure code filing method based on SPARK streaming calculation
CN105335135B (en) Data processing method and central node
CN103699627B (en) A kind of super large file in parallel data block localization method based on Hadoop clusters
Koutris Bloom filters in distributed query execution
CN114969110B (en) Query method and device
EP4209918A1 (en) Data query method and apparatus, device, and storage medium
Shou-Qiang et al. Research and design of hybrid collaborative filtering algorithm scalability reform based on genetic algorithm optimization
CN103593401A (en) Code conversion method and device
Li et al. The research of performance optimization methods based on Impala cluster

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant