CN106909639B

CN106909639B - Chained multi-path space connection query processing method based on Spark

Info

Publication number: CN106909639B
Application number: CN201710083816.9A
Authority: CN
Inventors: 乔百友; 王秋杰; 韩东红; 王国仁
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2017-02-16
Filing date: 2017-02-16
Publication date: 2020-09-29
Anticipated expiration: 2037-02-16
Also published as: CN106909639A

Abstract

The invention discloses a Spark-based chained multi-path spatial connection query processing algorithm, which comprises the following steps: step 1: dividing the whole data space into a plurality of grid units with the same size, and coding each grid unit by adopting a Z-order filling curve technology; step 2: projecting each spatial object in the m-way spatially connected data sets to a corresponding grid cell according to its position in data space; and step 3: if the condition i is satisfied<m, then RDDresult is applied to the two data sets_new，RDD_iExecuting spatial join operation Overlap; and 4, step 4: i ═ i +1, step 3 is performed until condition i<Until m is not satisfied; and 5: and executing the final spatial join operation Overlap. The invention is a Spark-based chained multi-path spatial connection query processing algorithm, which has obvious improvement on the processing efficiency and the reduction of the calculation cost.

Description

Chained multi-path space connection query processing method based on Spark

Technical Field

The invention relates to the technical field of spatial data query processing, in particular to a Spark-based chained multi-path spatial connection query processing method.

Background

The spatial join query is an important type of spatial data query, widely exists in spatial data management, and the spatial join query processing technology is a research hotspot in the field of spatial database management. The multi-path spatial join query is a common spatial join operation, which is one of the most time-consuming spatial operations to retrieve all spatial objects satisfying a certain spatial predicate (such as intersection, inclusion, etc.) from a plurality of spatial data sets, and the complexity and importance of the spatial join operation make the spatial objects one of the important factors for determining the overall performance of a spatial data management system, so that the improvement of the processing efficiency of the multi-path spatial join query is always a research hotspot problem in academia. Particularly, in recent years, with rapid development and wide application of internet of things technology, earth observation technology and location-based service technology, the size of spatial data is increased sharply, and the spatial data becomes important big data. How to perform efficient multi-path spatial connection query processing on such spatial big data has become an important challenge in the current spatial data management field. The traditional processing technology based on the spatial database has the problem of weak expansibility, so that the requirement of quick query and processing of spatial big data is difficult to meet, and Spark is widely paid attention to as a novel super-large-scale data distributed parallel processing platform and is also a key technology of big data processing at present. Therefore, combining with the large-scale data processing capability provided by the Spark distributed parallel processing platform, an efficient multi-path spatial connection query processing method for deeply researching spatial big data has become an important means for solving the above challenges.

In the multi-path spatial connection query processing, the following problems mainly exist in the existing method: (1) the traditional multipath connection processing method based on the spatial database mainly adopts a centralized processing mode, has poor expansibility and is difficult to meet the requirement of quick query processing of spatial big data; (2) most of the existing popular algorithms such as dynamic programming algorithm, hybrid connection algorithm and the like are centralized index construction, and the efficiency is low for massive data connection query; (3) the existing distributed processing method is mainly based on a Hadoop platform and focuses on the aspects of universal multi-path connection query processing optimization, and the problems of excessive data replication and weak filtering capability exist, so that the query processing efficiency is influenced; (4) at present, the latest distributed multi-path spatial connection algorithm is that Gupta et al propose two multi-path spatial connection query processing algorithms Controlled-repeat and Controlled-repeat based on MapReduce. And the Controlled-replay divides and copies the space objects in various connection data sets to all grid units in the fourth quadrant, and then performs multi-path connection operation. Obviously, this method causes duplication of a large number of spatial objects, which affects the efficiency of the connection process. For this author, an improved multi-path spatial connection query processing algorithm-Controlled-duplicate was proposed, which reduces data duplication to some extent and improves query processing efficiency, but also has the problem of excessive data duplication. For the Spark platform, too much data copy amount causes too large data amount loaded into the memory at one time, the advantage of Spark based on memory calculation cannot be well played, and the problem of low query efficiency and the like can also be caused.

The problems are deeply researched, and after a corresponding solution is provided, the method can be applied to relevant application fields such as connection query processing of spatial big data and the like. Therefore, the invention provides a spatial multi-path connection query processing algorithm under a Spark platform, which mainly aims at chain multi-path spatial connection query, adopts a grid-based data space dividing method and combines Z-order coding to realize data division and coding, and performs data projection and replication according to the spatial position of a data object. During the connection process, the algorithm adopts a boundary filtering method to reduce useless connection data, so that redundant calculation of subsequent connection and redundant projection and copying of a connection object are reduced. And a repeated avoidance strategy is adopted to reduce the output of repeated results, thereby comprehensively reducing the cost of subsequent connection calculation and improving the efficiency of multi-path connection query processing.

Disclosure of Invention

In view of the defects in the prior art, the present invention aims to provide a Spark-based chained multi-path spatial join query processing method, which mainly focuses on the problem of chained multi-path spatial join query processing, and focuses on reducing the amount of spatial data replication and calculation in the filtering stage, thereby reducing the subsequent join calculation cost, improving the query processing efficiency, and having good adaptability and expansibility.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a chain multi-path space connection query processing method based on Spark comprises the following steps:

step 1: dividing the whole data space into a plurality of grid units with the same size by using a grid division method, and coding each grid unit by adopting a Z-order filling curve technology;

step 2: m (m)>2) Road space junction dataset R₁，R₂，…，R_mEach space object in the data space is projected to a corresponding grid unit according to the position of the space object in the data space, a series of key value pairs are formed, and the projection results are respectively stored in an elastic distributed data set RDD₁，RDD₂，…，RDD_mIn (1), setting the loop variable i to 2, and setting the intermediate result data set RDDresult_new＝RDD₁；

And step 3: if the condition i is satisfied<m, then RDDresult is applied to the two data sets_new，RDD_iPerforming spatial join operation Overlap (RDDresult)_new,RDD_i). In the calculation process, data aggregation, boundary filtering, space connection calculation, repeated avoidance, data replication and other operations are sequentially carried out, and finally an intermediate result data set RDDresult is formed_newNamely RDDresult_new＝Overlap(RDDresult_new,RDD_i)；

And 4, step 4: i-i +1, executing step 3 until condition i < m is not satisfied;

and 5: performing the last spatial join operation Overlap (RDDresult)_new,RDD_m) And in the calculation process, sequentially carrying out data aggregation, boundary filtering and spatial connection calculation, directly outputting the result to form a final spatial connection result set, and storing the final spatial connection result set in the HDFS file system.

Further, the data partitioning and encoding method comprises: the method comprises the steps of dividing the whole data space into n grid units with equal size by adopting a grid-based division method, coding the grid units by adopting a Z-order filling curve, projecting a space data object to each grid unit according to the position of the space data object, and mapping all the grid units to a plurality of execution units of an execution unit in a Hash mode, so that the whole processing task is divided into a plurality of parallel processing tasks.

Further, the spatial object projection is: according to which the spatial data object is to be constructedIn the location mapping to the corresponding grid cell, let C ═ C₁,c₂,…,c_n) Representing a data space division, c_iRepresenting each grid cell, R is a kind of space object set to be connected, if a space object u ∈ R, its MBR and grid cell c_iWith an overlap of c_iFor Z-order encoding of grid cells, object u is mapped to grid cell c_iAnd generates a corresponding key-value pair (c)_iU), if a spatial object overlaps multiple grid cells, multiple key value pairs are generated accordingly.

Further, step 3 specifically includes the following steps:

step 3-1: calculate Overlap (RDDresult)_new,RDD_i) For RDDresult_new，RDD_iExecuting Cogroup operation according to Key value, namely RDDresult_newAnd RDD_iThe data in the RDD are gathered together according to Key values to obtain RDD_new；

Step 3-2: RDD pair using filtering strategy_newFiltering to remove data pairs which are impossible to result, and then performing actual space connection operation;

step 3-3: executing a repeat avoidance strategy to form a connection intermediate result, executing data copy operation on the connection intermediate result, and finally forming a new intermediate connection result data set RDDresult_new。

Further, step 5 comprises the following steps:

step 5-1: calculate Overlap (RDDresult)_new,RDD_i) For RDDresult_new，RDD_iExecuting Cogroup operation according to Key value, namely RDDresult_newAnd RDD_iThe data in the RDD are gathered together according to Key values to obtain RDD_new；

Step 5-2: RDD pair using filtering strategy_newFiltering to remove data pairs which are impossible to result, and then performing actual space connection operation;

step 5-3: executing the repeated avoidance strategy to form a final connection result data set RDDresult formed by tuple pairs_newAnd is combined withSave to the HDFS file system.

Further, the data copy operation is to: for any tuple T in the intermediate connection result set T generated by the latest spatial connection operation on the current grid cell, if t.s is the spatial object related to the next spatial connection operation, if t.s is related to a certain grid cell c_iIf there is an overlap, the tuple t is copied to the grid cell c_iAnd generates a corresponding key-value pair (c)_i,t)。

Further, the filtering strategy is as follows: in the process of executing the connection operation in parallel, a corresponding filtering strategy is adopted, tuples which can not generate the connection result are removed, and only tuples which can generate the connection result are copied.

Further, the filtering strategy comprises two parts:

boundary filtering, the boundary filtering being: before performing connection operation, firstly counting the boundary MBR of the space object related to the subsequent space connection in the completed connection intermediate result, and filtering the space object which is not intersected with the MBR in the subsequent data set to be connected by utilizing the MBR, thereby reducing the cost of the subsequent connection calculation;

replication phase filtering, wherein the replication phase filtering is as follows: in the multi-path connection inquiry processing, data copying operation is required to be carried out on intermediate results after the first paths of connection processing, and the intermediate results are only copied to other grid units which may generate connection results, so that loss of the connection results is avoided.

Further, the duplicate avoidance policy is: when two space objects which span a plurality of grid cells are connected, only the grid cell where the intersection point of the lower left corner of the two overlapped new objects is located is responsible for outputting the result.

Compared with the prior art, the invention has the beneficial effects that: the invention relates to a Spark-based chained multi-path spatial connection query processing method, which divides a data space by adopting a grid division method, projects and copies data based on the position of a spatial object, filters useless connection objects by adopting a boundary filtering mode in a calculation process, reduces data copying by reducing the copying range, obviously improves the processing efficiency and the calculation cost, and has good adaptability and expansibility.

Drawings

FIG. 1 is an exemplary diagram of Z-order curve coding in an embodiment of the present invention;

FIG. 2 is a diagram illustrating partitioning and task mapping of data according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an exemplary operation of projecting and copying data in accordance with an embodiment of the present invention;

FIG. 4 is an exemplary diagram of boundary filtering as described in the detailed description of the invention;

FIG. 5 is an exemplary illustration of the avoidance of repetition described in the detailed description of the invention;

fig. 6 is a schematic processing flow diagram of a Spark-based chained multi-path spatial join query processing method according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings.

The invention provides a Spark-based chained multi-path spatial connection query processing method, which mainly focuses on the problem of chained multi-path spatial connection query processing, and is mainly characterized by reducing the spatial data copying and calculation amount in the filtering stage, thereby reducing the subsequent connection calculation cost and improving the query processing efficiency.

Example 1:

a chain multi-path space connection query processing method based on Spark mainly comprises the following steps:

1) data space partitioning and encoding: the method comprises the steps of dividing the whole space into n grid units with equal size by adopting a grid division method, coding the grid units by adopting a Z-order filling curve, projecting a spatial data object to each grid unit according to the position of the spatial data object, and mapping all the grid units to a plurality of execution units of the execution units of.

As shown in fig. 1, in order to maintain the spatial relationship between spatial objects, a space filling curve is used to encode the grid cells, and fig. 1 is a Z-order curve when the number of bits (bit) is 1 and 2. The Z-order curve is a space-filling curve. The Z-order technology uses bit to represent the attribute information of a space object, then uses a circular method to decompose the data space, and the divided subspace obtains a group of numbers, which are called the Z-ordering value of the subspace and used as the Key value of the subspace data object.

As shown in fig. 2, which is an example of task partition mapping, it can be seen that each divided grid unit and the data set divided thereon are allocated to n execution units in the spare platform in a Hash mapping manner to execute in parallel.

2) Spatial object projection: that is, the spatial data object is mapped to the corresponding grid cell according to the position of the spatial data object. Let C ═ C₁,c₂,…,c_n) Representing a data space division, c_iRepresenting each grid cell, R is a kind of space object set to be connected, if a space object u ∈ R, its MBR and grid cell c_i(c_iZ-order coding for grid cells), object u is mapped to grid cell c_iAnd generates a corresponding key-value pair (c)_iU), if there is an overlap between a spatial object and multiple grid cells, then multiple key value pairs are formed accordingly. The projection operation can be expressed as:

3) data replication: in the multi-path connection inquiry processing, a plurality of connection operations among a plurality of data sets are required, and the data replication is to empty the latest time on the current grid unitIf T ∈ T is the tuple in the intermediate result of the connection and t.s is the object to be connected in the subsequent space, then t.s is the object to be connected in the certain grid cell c_iIf there is an overlap, the tuple t is copied to the grid cell c_iAnd generates a corresponding key-value pair (c)_iT). The copy operation may be represented as:

fig. 3 is an example of a data projection and replication operation from which it can be seen that a spatial object is projected onto a grid cell that overlaps it. Object r₁Is projected on grid cells # 6 and # 12 r₂Is projected on to

units

9 and 12, r₃Is projected to

units

9 and 11, i.e. Project (r)₁,C)＝{(6,r₁),(12,r₁)}，Project(r₂,C)＝{(9,r₂),(12,r₂)}，Project(r₃,C)＝{(9,r₂),(11,r₂)}. When executing r₁，r₂And r₃When multiple connections are made in sequence, since r₂Overlap with grid cell 9, so that r in grid cell 12₁And r₂Connecting intermediate result (r)₁,r₂) To be copied into grid cell 9, a key-value pair (9, (r) is formed₁,r₂) To realize a spatial object r in the grid cell 9₃And the subsequent connection operation avoids the loss of the connection result.

4) And (3) filtering strategy: in the process of executing the connection operation in parallel, a boundary filtering strategy is adopted, tuples which cannot generate connection results are removed, and only tuples which may have results are copied, so that the cost of storage and subsequent calculation is greatly reduced. The method specifically comprises the following two filtering strategies:

a: and (3) boundary filtering: firstly, counting the MBR of the relevant connection objects in the previous connection results, and filtering spatial objects in the data sets to be connected, which are not intersected with the MBR, by using the MBR, so as to reduce the calculation cost of the subsequent connection:

FIG. 4 is an example of boundary filtering in which three data sets R, S and T are subjected to a three-way join operation in sequence

The spatial object projected into the grid cell 3 is shown,

respectively, are (r)₁,s₁)，(r₁,s₂)，(r₁,s₃) The object in the corresponding S set in the previous connection result set can be obtained as S₁、s₂And s₃The boundary MBR is shown by a dotted line in the figure, and when the boundary MBR is connected with the objects in the data set T, the spatial objects T projected into the grid unit 3 and not intersected with the MBR can be directly filtered₁、t₄And t₅Avoid the spatial objects from being respectively associated with s₁、s₂And s₃And performing connection operation, thereby greatly reducing the cost of subsequent calculation.

B, replication stage filtration: in the multi-link query processing process, data copying operation needs to be performed on intermediate results after the first several connection processing, the intermediate results are copied to other grid units which may generate connection results, and subsequent connection operation is performed, so that the connection results are prevented from being lost. In the intermediate connection result copying, only the intermediate result related to the cross-grid connection object is copied, thereby avoiding redundant copying.

5) A duplicate avoidance strategy: when two space objects which span a plurality of grid units are connected, only the grid unit where the intersection point of the left lower corner of the new object formed by overlapping the two space objects is located is responsible for outputting the result, namely only one grid unit is responsible for outputting the result, so that the repeated output of the result is avoided, and the processing cost is reduced.

FIG. 5 shows an example of duplicate avoidance, where the objects S in the set of S are₁Projected onto the mesh cells 2 which they overlap,3. 6, 8, 9, 12, object R in the set of R₁Is projected to the

grid cell

3, 6, 9, 12, r₂The object is projected onto four

grid cells

8, 9, 10, 11, and if no duplicate avoidance is performed, the

grid cells

3, 6, 9, 12 output the same connection result (r) in the connection process₁,s₁) And the

grid cells

8 and 9 will also output the same connection result (r)₂,s₁) Repetition is apparent. According to the proposed duplicate avoidance strategy, as shown in fig. 5, the grid cell where the lower left corner of the object (indicated by points P and Q in the figure) formed by the overlapping parts of the object is located is responsible for outputting the result, i.e. the grid cell 3 is responsible for processing the output r₁And s₁Result of connection of (r)₁,s₁) The grid cell 8 is responsible for processing the output r₂And s₁Result of connection of (r)₂,s₁) The strategy avoids repeated processing and repeated output of results, and reduces subsequent processing cost.

Chained multipath spatial join query Q_m＝Overlap(R₁,R₂,R₃,...,R_m) According to its definition, may be represented as Q_m＝Overlap(…Overlap(Overlap(R₁,R₂),R₃),…,R_m) The processing flow of the chained multi-path spatial connection query processing method based on Spark provided by the invention is shown in fig. 6, and mainly comprises the following steps:

a: for multi-way connection data set R according to grid division coding method₁,R₂,R₃,…,R_mProjecting, taking the coded Value as Key Value, taking the mark of each space object and attribute information such as MBR (Membrane biological reactor) thereof as Value to form a series of Key Value pairs, and respectively taking the data set R₁,R₂,R₃,…,R_mPut the projection result into an elastic distributed data set RDD₁,RDD₂,RDD₃,…,RDD_mPerforming the following steps;

b: calculate Overlap (R)₁,R₂) I.e. to RDD₁And RDD₂Performing Cogroup operation to convert RDD₁And RDD₂According to Key value, the data in (A) are gathered togetherTo obtain the RDD_newRDD pair using a border filtering strategy_newFiltering to remove data objects which are impossible to have results, then performing actual space connection operation, executing a repeat avoidance strategy, and forming a connection intermediate result; performing data copy operation on the connection intermediate result to form an intermediate result data set RDDresult_new；

C: calculating RDDresult according to the same calculation method as the step B_newAnd RDD₃The latest R is obtained by the connection operation between the two₁,R₂,R₃Result rddiesult in the middle of the connection_new. Sequentially and circularly calculating RDDresult by adopting the same calculation method_newAnd RDD₄And RDD₅…, and RDD_m-1To finally obtain a data set R₁,R₂,R₃,…,R_m-1RDDresult of the concatenated intermediate result data set_new；

D：RDDresult_newAnd RDD_mExecuting Cogroup operation to generate new RDD_newOn the basis, boundary filtering and connection operation processing are carried out, and the result is directly output to form a data set R₁,R₂,R₃,…,R_mOf the final spatially connected data set rddiesult_newSince it is the last spatial join operation, no copy operation is required.

Example 2:

a chain multi-path space connection query processing method based on Spark comprises the following steps: step 1: dividing the whole data space into a plurality of grid units with the same size, and coding each grid unit by adopting a Z-order filling curve technology; step 2: m (m)>2) Road space junction dataset R₁，R₂，…，R_mAccording to the position of each space object in the data space, projecting the space object to a corresponding grid unit, and storing the projection result to an elastic distributed data set RDD₁，RDD₂，…，RDD_mIn (1). Set the loop variable i to 2 and the intermediate result data set RDDresult_new＝RDD₁(ii) a And step 3: if the condition i is satisfied<m, then RDDresult is applied to the two data sets_new，RDD_iPerforming spatial join operation Overlap (RDDresult)_new,RDD_i) In the calculation process, data aggregation, boundary filtering, space connection calculation, repeated avoidance, data replication and other operations are sequentially carried out, and finally a new intermediate result data set RDDresult is formed_new，RDDresult_new＝Overlap(RDDresult_new,RDD_i) (ii) a And 4, step 4: i ═ i +1, step 3 is performed until condition i<Until m is not satisfied; and 5: performing the last spatial join operation Overlap (RDDresult)_new,RDD_m) And in the calculation process, sequentially carrying out data aggregation, boundary filtering and spatial connection calculation, directly outputting the result to form a final spatial connection result set, and storing the final spatial connection result set in the HDFS file system. The invention is a Spark-based chained multi-path spatial connection query processing method, which has obvious improvement on processing efficiency and calculation cost reduction and has good adaptability and expansibility.

Example 3:

And step 3: if the condition i is satisfied<m, then RDDresult is applied to the two data sets_new，RDD_iPerforming spatial join operation Overlap (RDDresult)_new,RDD_i). In the calculation process, the calculation is carried out in sequencePerforming operations such as row data aggregation, boundary filtering, space connection calculation, repeated avoidance, data replication and the like to finally form an intermediate result data set RDDresult_newNamely RDDresult_new＝Overlap(RDDresult_new,RDD_i)；

And 4, step 4: i-i +1, executing step 3 until condition i < m is not satisfied;

Further, the spatial object projection is: mapping the space data object to a corresponding grid unit according to the position of the space data object, and setting C as (C)₁,c₂,…,c_n) Representing a data space division, c_iRepresenting each grid cell, R is a kind of space object set to be connected, if a space object u ∈ R, its MBR and grid cell c_iWith an overlap of c_iFor Z-order encoding of grid cells, object u is mapped to grid cell c_iAnd generates a corresponding key-value pair (c)_iU), if a spatial object overlaps multiple grid cells, multiple key value pairs are generated accordingly.

Further, step 3 specifically includes the following steps:

Further, step 5 comprises the following steps:

step 5-3: executing the repeated avoidance strategy to form a final connection result data set RDDresult formed by tuple pairs_newAnd saved to the HDFS file system.

Further, the filtering strategy comprises two parts:

Although specific embodiments of the present invention are described above, it should be understood by those skilled in the art that these are merely examples, and the present invention is a Spark-based chain multi-path spatial join query processing method, and thus the examples are only for illustrating the core ideas of filtering strategies, duplicate avoidance strategies, join processing procedures, and the like. After that, larger scale experiments can be performed, and the related methods can be further improved to improve the effects of data projection, replication and filtering, and the combination of the indexing technology can be considered to further improve the performance of the method without departing from the principle and essence of the invention. The scope of the invention is only limited by the appended claims.

Claims

1. A chain multi-path space connection query processing method based on Spark is characterized in that: the method comprises the following steps:

step 2: connecting m spatial paths to a data set R₁，R₂，…，R_mAccording to its position in the data space, and form a series of key-value pairs, where m>2; respectively storing the projection results into elastic distributed data sets RDD₁，RDD₂，…，RDD_mIn (1), setting the loop variable i to 2, and setting the intermediate result data set RDDresult_new＝RDD₁；

And step 3: if the condition i is satisfied<m, then RDDresult is applied to the two data sets_new，RDD_iPerforming spatial join operation Overlap (RDDresult)_new,RDD_i) (ii) a In the calculation process, data aggregation, boundary filtering, space connection calculation, repeated avoidance and data copying operation are sequentially carried out, and finally an intermediate result data set RDDresult is formed_newNamely RDDresult_new＝Overlap(RDDresult_new,RDD_i)；

And 4, step 4: i-i +1, executing step 3 until condition i < m is not satisfied;

2. The Spark-based chained multi-path spatial join query processing method according to claim 1, wherein: the grid division and coding method comprises the following steps: the method comprises the steps of dividing the whole data space into n grid units with equal size by adopting a grid-based division method, coding the grid units by adopting a Z-order filling curve, projecting a space data object to each grid unit according to the position of the space data object, and mapping all the grid units to a plurality of execution units of an execution unit in a Hash mode, so that the whole processing task is divided into a plurality of parallel processing tasks.

3. Spark-based chain multiplex as claimed in claim 1The space connection query processing method is characterized by comprising the following steps: connecting m spatial paths to a data set R₁，R₂，…，R_mThe specific steps of projecting each spatial object in the data space to a corresponding grid cell according to its position in the data space are: mapping the space data object to a corresponding grid unit according to the position of the space data object, and setting C as (C)₁,c₂,…,c_n) Representing a data space division, c_iRepresenting each grid cell, R is a kind of space object set to be connected, if a space object u ∈ R, its MBR and grid cell c_iWith an overlap of c_iFor Z-order encoding of grid cells, object u is mapped to grid cell c_iAnd generates a corresponding key-value pair (c)_iU), if a spatial object overlaps multiple grid cells, multiple key value pairs are generated accordingly.

4. The Spark-based chained multi-path spatial join query processing method according to claim 1, wherein: the step 3 specifically comprises the following steps:

5. The Spark-based chained multi-path spatial join query processing method according to claim 1, wherein: step 5 comprises the following steps:

6. The Spark-based chained multi-path spatial join query processing method according to claim 1, wherein: the data copy operation is: for any tuple T in the intermediate connection result set T generated by the latest spatial connection operation on the current grid cell, if t.s is the spatial object related to the next spatial connection operation, if t.s is related to a certain grid cell c_iIf there is an overlap, the tuple t is copied to the grid cell c_iAnd generates a corresponding key-value pair (c)_i,t)。

7. The Spark-based chained multi-way spatial join query processing method according to claim 4, wherein: the filtering strategy is as follows: in the process of executing the connection operation in parallel, a corresponding filtering strategy is adopted, tuples which can not generate the connection result are removed, and only tuples which can generate the connection result are copied.

8. The Spark-based chained multi-way spatial join query processing method according to claim 7, wherein: the filtering strategy comprises two parts:

9. The Spark-based chained multi-way spatial join query processing method according to claim 4, wherein: the repeat avoidance strategy is: when two space objects which span a plurality of grid cells are connected, only the grid cell where the intersection point of the lower left corner of the two overlapped new objects is located is responsible for outputting the result.