CN106909639A

CN106909639A - A kind of chain type Multi way spatial join Query Processing Algorithm based on Spark

Info

Publication number: CN106909639A
Application number: CN201710083816.9A
Authority: CN
Inventors: 乔百友; 王秋杰; 韩东红; 王国仁
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2017-02-16
Filing date: 2017-02-16
Publication date: 2017-06-30
Anticipated expiration: 2037-02-16
Also published as: CN106909639B

Abstract

The invention discloses a Spark-based chained multi-path spatial connection query processing algorithm, which comprises the following steps: Step 1: Divide the entire data space into many grid units of the same size, and use Z-order filling curve technology for each grid units for encoding; step 2: project each spatial object in the m-way spatial connection dataset to the corresponding grid unit according to its position in the data space; step 3: if the condition i<m is satisfied, then Two data sets RDDresult _new , RDD _i perform spatial join operation Overlap; step 4: i=i+1, execute step 3 until the condition i<m is not satisfied; step 5: perform the last spatial join operation Overlap. The invention is a Spark-based chained multi-way spatial join query processing algorithm, which has significantly improved processing efficiency and reduced calculation cost.

Description

A Chained Multi-way Spatial Join Query Processing Algorithm Based on Spark

技术领域technical field

本发明涉及空间数据查询处理技术领域，具体的说是涉及一种基于Spark的链式多路空间连接查询处理算法。The invention relates to the technical field of spatial data query processing, in particular to a Spark-based chained multi-way spatial join query processing algorithm.

背景技术Background technique

空间连接查询是一种重要的空间数据查询类型，广泛存在于空间数据管理中，空间连接查询处理技术也一直是空间数据库管理领域的研究热点。多路空间连接查询是一种常用的空间连接操作，它是从多个空间数据集合中检索出所有满足某一空间谓词(如相交、包含等)的空间对象，是最耗时的空间操作之一，其复杂性和重要性使之成为决定空间数据管理系统整体性能的重要因素之一，因此提高多路空间连接查询处理效率就一直成为学术界的研究热点问题。特别是近年来，随着物联网技术、对地观测技术和基于位置的服务技术的快速发展和广泛应用，使得空间数据规模急剧增加，已经成为一类重要的大数据。如何对这类空间大数据进行高效的多路空间连接查询处理，已成为当前空间数据管理领域所面临的重要挑战。传统的基于空间数据库的处理技术存在者扩展性弱的问题，因而难以满足空间大数据快速查询处理的要求，而Spark作为一种新型的超大规模数据分布式并行处理平台而受到人们的广泛重视，也是目前大数据处理的关键技术。因此结合Spark分布式并行处理平台所提供的大规模数据处理能力，来深入研究空间大数据的高效多路空间连接查询处理方法，已经成为解决上述挑战的重要手段。Spatial join query is an important type of spatial data query, which widely exists in spatial data management. Spatial join query processing technology has always been a research hotspot in the field of spatial database management. Multi-way spatial join query is a commonly used spatial join operation, which retrieves all spatial objects that satisfy a certain spatial predicate (such as intersection, inclusion, etc.) from multiple spatial data sets, and is one of the most time-consuming spatial operations. First, its complexity and importance make it one of the important factors that determine the overall performance of the spatial data management system, so improving the efficiency of multi-way spatial join query processing has always been a research hotspot in the academic circles. Especially in recent years, with the rapid development and wide application of Internet of Things technology, earth observation technology and location-based service technology, the scale of spatial data has increased dramatically, and it has become an important type of big data. How to perform efficient multi-way spatial join query processing on this kind of spatial big data has become an important challenge in the field of spatial data management. The traditional spatial database-based processing technology has the problem of weak scalability, so it is difficult to meet the requirements of fast query and processing of spatial big data, and Spark, as a new type of distributed parallel processing platform for ultra-large-scale data, has been widely valued by people. It is also the key technology of big data processing. Therefore, combined with the large-scale data processing capabilities provided by the Spark distributed parallel processing platform, it has become an important means to solve the above challenges to study the efficient multi-way spatial join query processing method of spatial big data.

在多路空间连接查询处理中，现有的方法主要存在的以下问题：(1)传统的基于空间数据库的多路连接处理方法，主要采用集中式的处理方式，其扩展性差，难以满足空间大数据快速查询处理的要求；(2)现有的一些流行算法，如动态规划算法、混合连接算法等多是集中式构建索引，对于海量的数据连接查询，效率比较低；(3)现有的分布式处理方法主要基于Hadoop平台，并聚焦于通用多路连接查询处理优化方面，存在的数据复制过多、过滤能力弱的问题，从而影响了查询处理的效率；(4)目前最新的分布式多路空间连接算法就是Gupta等人提出了两种基于MapReduce的多路空间连接查询处理算法Controlled-Replicate和ε-Controlled-Replicate。Controlled-Replicate将各类连接数据集中的空间对象划分并复制到第四象限中的所有网格单元，再进行多路连接运算。显然这种方法造成了大量空间对象的复制，影响连接处理效率。为此作者又提出了改进的多路空间连接查询处理算法ε-Controlled-Replicate，该算法在一定程度上减少了数据复制，提高了查询处理效率，但是还存在着数据复制过多的问题。对于Spark平台来说，数据复制量过多，会造成一次性加载到内存中的数据量太大，不能很好的发挥Spark的基于内存计算的优势，也会导致查询效率低下等问题。In the multi-way spatial join query processing, the existing methods mainly have the following problems: (1) The traditional multi-way join processing method based on the spatial database mainly adopts a centralized processing method, which has poor scalability and is difficult to meet the needs of large spaces. Requirements for fast query and processing of data; (2) Some existing popular algorithms, such as dynamic programming algorithms, hybrid join algorithms, etc., mostly build indexes in a centralized manner, and the efficiency is relatively low for massive data join queries; (3) Existing The distributed processing method is mainly based on the Hadoop platform, and focuses on the optimization of general multi-way connection query processing. There are problems such as excessive data replication and weak filtering capabilities, which affect the efficiency of query processing; (4) the latest distributed The multi-way spatial join algorithm is that Gupta et al. proposed two multi-way spatial join query processing algorithms based on MapReduce Controlled-Replicate and ε-Controlled-Replicate. Controlled-Replicate divides and replicates the spatial objects in various connection datasets to all grid cells in the fourth quadrant, and then performs multi-way connection operations. Apparently, this method has caused a large number of spatial objects to be copied, which affects the efficiency of connection processing. For this reason, the author proposes an improved multi-way spatial join query processing algorithm ε-Controlled-Replicate, which reduces data replication to a certain extent and improves query processing efficiency, but there is still the problem of excessive data replication. For the Spark platform, too much data replication will cause too much data to be loaded into the memory at one time, which cannot make full use of the advantages of Spark's memory-based computing, and will also lead to low query efficiency and other problems.

对上述问题进行深入研究，并提出相应的解决方法后，可将其应用到空间大数据的连接查询处理等相关应用领域。为此，本发明提出了一种Spark平台下的空间多路连接查询处理算法，该算法主要针对链式多路空间连接查询，采用基于网格的数据空间划方法，并结合Z-order编码来实现数据的划分和编码，按照数据对象所在空间位置来进行数据投影和复制。在连接过程中，该算法采用边界过滤方法来减少无用连接数据，从而减少后续连接的多余计算，以及连接对象的多余投影与复制。并采用重复避免策略来减少重复结果的输出，从而全面减少后续连接计算的代价，提高多路连接查询处理的效率。After in-depth research on the above problems and corresponding solutions, they can be applied to related application fields such as connection query processing of spatial big data. For this reason, the present invention proposes a kind of spatial multi-way connection query processing algorithm under the Spark platform, this algorithm is mainly aimed at chain type multi-way spatial connection query, adopts grid-based data space division method, and combines Z-order coding to Realize data division and coding, and perform data projection and replication according to the spatial location of data objects. During the connection process, the algorithm adopts the boundary filtering method to reduce useless connection data, thereby reducing the redundant calculation of subsequent connections, as well as the redundant projection and duplication of connected objects. And use the duplicate avoidance strategy to reduce the output of duplicate results, thereby reducing the cost of subsequent connection calculations and improving the efficiency of multi-way connection query processing.

发明内容Contents of the invention

鉴于已有技术存在的缺陷，本发明的目的是要提供一种基于Spark的链式多路空间连接查询处理算法，该算法主要聚焦于链式多路空间连接查询处理问题，重点在于减少过滤阶段的空间数据复制和计算量，从而减少后续的连接计算代价，提高查询处理效率，该算法并具有良好的适应性和扩展性。In view of the defects existing in the prior art, the purpose of the present invention is to provide a Spark-based chained multi-way spatial join query processing algorithm, which mainly focuses on the chained multi-way spatial join query processing problem, and focuses on reducing the filtering stage The amount of spatial data replication and calculation can reduce the cost of subsequent connection calculations and improve query processing efficiency. The algorithm also has good adaptability and scalability.

为了实现上述目的，本发明的技术方案：In order to achieve the above object, technical scheme of the present invention:

一种基于Spark的链式多路空间连接查询处理算法，包括如下步骤：A spark-based chained multi-way spatial join query processing algorithm, comprising the steps:

步骤1：利用网格划分方法，将整个数据空间划分成许多大小相同的网格单元，并采用Z-order填充曲线技术对每个网格单元进行编码；Step 1: Use the grid division method to divide the entire data space into many grid units of the same size, and use Z-order filling curve technology to encode each grid unit;

步骤2：将m(m>2)路空间连接数据集R₁，R₂，…，R_m中的每个空间对象根据其在数据空间中的位置投影到相应的网格单元，并形成一系列键值对，将投影结果分别存放到弹性分布式数据集RDD₁，RDD₂，…，RDD_m中，设定循环变量i＝2，中间结果数据集RDDresult_new＝RDD₁；Step 2: Project each spatial object in the m(m>2) way spatial connection data set R ₁ , R ₂ ,..., R _m to the corresponding grid unit according to its position in the data space, and form a A series of key-value pairs, store the projection results in the elastic distributed data sets RDD ₁ , RDD ₂ , ..., RDD _m , set the loop variable i=2, and set the intermediate result data set RDDresult _new =RDD ₁ ;

步骤3：如果满足条件i<m，则对两个数据集RDDresult_new，RDD_i执行空间连接运算Overlap(RDDresult_new,RDD_i)。计算过程中，依次进行数据聚集、边界过滤、空间连接计算、重复避免和数据复制等操作，最终形成中间结果数据集RDDresult_new，即RDDresult_new＝Overlap(RDDresult_new,RDD_i)；Step 3: If the condition i<m is satisfied, perform the spatial join operation Overlap(RDDresult _new , RDD _i ) on the two datasets RDDresult _new and RDD _i . In the calculation process, operations such as data aggregation, boundary filtering, spatial connection calculation, repetition avoidance, and data replication are performed in sequence, and finally an intermediate result data set RDDresult _new is formed, that is, RDDresult _new =Overlap(RDDresult _new ,RDD _i );

步骤4：i＝i+1，执行步骤3直到条件i<m不满足为止；Step 4: i=i+1, execute step 3 until the condition i<m is not satisfied;

步骤5：执行最后一次空间连接运算Overlap(RDDresult_new,RDD_m)，计算过程中，依次进行数据聚集、边界过滤、空间连接计算，并将结果直接输出，形成最终空间连接结果集合，并保存到HDFS文件系统。Step 5: Execute the last spatial connection operation Overlap(RDDresult _new ,RDD _m ). During the calculation process, data aggregation, boundary filtering, and spatial connection calculations are performed in sequence, and the results are directly output to form the final spatial connection result set and saved to HDFS file system.

进一步的，所述数据划分和编码方法为：采用基于网格的划分方法将整个数据空间划分成n个大小相等的网格单元，采用Z-order填充曲线对网格单元进行编码，空间数据对象根据其位置被投影到各个网格单元，并采用Hash方式将所有网格单元映射给多个Executor执行单元，使得整个处理任务被划分成多个并行的处理任务。Further, the data division and encoding method is: using a grid-based division method to divide the entire data space into n grid units of equal size, using Z-order filling curves to encode the grid units, and the spatial data object According to its position, it is projected to each grid unit, and all grid units are mapped to multiple Executor execution units by means of Hash, so that the entire processing task is divided into multiple parallel processing tasks.

进一步的，所述空间对象投影为：将空间数据对象根据其所在位置映射到相应的网格单元中，设C＝(c₁,c₂,…,c_n)代表一个数据空间划分，c_i代表每一个网格单元；设R为一类待连接处理的空间对象集合，若一个空间对象u∈R，其MBR与网格单元c_i有交叠，所述c_i为网格单元的Z-order编码，则将对象u映射到网格单元c_i中，并生成相应键值对(c_i,u)，如果一个空间对象与多个网格单元有交叠，则相应的会生成多个键值对。Further, the projection of the spatial object is: mapping the spatial data object to the corresponding grid unit according to its location, let C=(c ₁ ,c ₂ ,...,c _n ) represent a data space division, c _i Represents each grid unit; Let R be a set of spatial objects to be connected. If a spatial object _u∈R , its MBR overlaps with the grid unit ci, and the _ci is the Z of the grid unit -order encoding, the object _u is mapped to the grid unit ci, and the corresponding key-value pair (ci, _u ) is generated. If a spatial object overlaps with multiple grid units, multiple key-value pairs.

进一步的，步骤3具体包括以下步骤：Further, step 3 specifically includes the following steps:

步骤3-1：计算Overlap(RDDresult_new,RDD_i)，即对RDDresult_new，RDD_i按照Key值执行Cogroup操作，即将RDDresult_new和RDD_i中的数据根据Key值聚集到一起得到RDD_new；Step 3-1: Calculate Overlap(RDDresult _new , RDD _i ), that is, perform Cogroup operation on RDDresult _new and RDD _i according to the Key value, that is, gather the data in RDDresult _new and RDD _i together according to the Key value to obtain RDD _new ;

步骤3-2：利用过滤策略对RDD_new进行过滤，去掉不可能有结果的数据对，然后进行实际空间连接运算；Step 3-2: Use the filtering strategy to filter the RDD _new , remove the data pairs that cannot have results, and then perform the actual spatial connection operation;

步骤3-3：执行重复避免策略，形成连接中间结果，并对连接中间结果执行数据复制操作，最终形成新的中间连接结果数据集RDDresult_new。Step 3-3: Execute the duplicate avoidance strategy, form the intermediate result of the connection, and perform data copy operation on the intermediate result of the connection, and finally form a new intermediate connection result data set RDDresult _new .

进一步的，步骤5包括以下步骤：Further, step 5 includes the following steps:

步骤5-1：计算Overlap(RDDresult_new,RDD_i)，即对RDDresult_new，RDD_i按照Key值执行Cogroup操作，即将RDDresult_new和RDD_i中的数据根据Key值聚集到一起得到RDD_new；Step 5-1: Calculate Overlap(RDDresult _new , RDD _i ), that is, perform Cogroup operation on RDDresult _new and RDD _i according to the Key value, that is, gather the data in RDDresult _new and RDD _i together according to the Key value to obtain RDD _new ;

步骤5-2：利用过滤策略对RDD_new进行过滤，去掉不可能有结果的数据对，然后进行实际空间连接运算；Step 5-2: Use the filtering strategy to filter the RDD _new , remove the data pairs that cannot have results, and then perform the actual spatial connection operation;

步骤5-3：执行重复避免策略，形成由元组对构成的最终连接结果数据集RDDresult_new，并保存到HDFS文件系统。Step 5-3: Execute the duplicate avoidance strategy to form the final connection result data set RDDresult _new consisting of tuple pairs, and save it to the HDFS file system.

进一步的，所述数据复制操作为：对于当前网格单元上的最近一次空间连接运算所生成的中间连接结果集合T中的任一元组t，若t.s为与下一次空间连接运算相关的空间对象，则若t.s与某一网格单元c_i存在交叠，则将元组t复制到网格单元c_i，并生成相应的键值对(c_i,t)。Further, the data copy operation is: for any tuple t in the intermediate connection result set T generated by the latest spatial join operation on the current grid unit, if ts is a spatial object related to the next spatial join operation , then if _ts overlaps with a certain grid unit ci, copy the tuple t to the grid unit _ci and generate the corresponding key-value pair ( _ci ,t).

进一步的，所述过滤策略为：在并行执行连接运算的过程中，采用相应过滤策略，去掉不可能产生连接结果的元组，并仅对可能产生连接结果的元组进行复制。Further, the filtering strategy is: in the process of executing the connection operation in parallel, adopt a corresponding filtering strategy to remove the tuples that cannot generate the connection result, and only copy the tuples that may generate the connection result.

进一步的，所述过滤策略包括两部分：Further, the filtering strategy includes two parts:

边界过滤，所述边界过滤为：在进行连接运算之前，首先统计已经完成的连接中间结果中与后续空间连接相关的空间对象的边界MBR，并利用该MBR来过滤掉后续要连接数据集中与该MBR不相交的空间对象，从而减少后续连接计算代价；Boundary filtering, the boundary filtering is as follows: before performing the connection operation, first count the boundary MBR of the spatial object related to the subsequent spatial connection in the completed connection intermediate result, and use the MBR to filter out the subsequent data set to be connected. MBR disjoint spatial objects, thereby reducing the calculation cost of subsequent connections;

复制阶段过滤，所述复制阶段过滤为：在多路连结查询处理过程中，需要对前几路连接处理之后的中间结果进行数据复制操作，仅将其复制到可能会产生连接结果的其他网格单元中，从而避免连接结果的丢失，在对中间连接结果复制中，仅对包含跨网格连接对象的中间结果进行复制。Filtering in the copy stage, the filter in the copy stage is: in the process of multi-way join query processing, it is necessary to perform data copy operations on the intermediate results after the first few joins are processed, and only copy them to other grids that may generate join results In the unit, so as to avoid the loss of connection results, in the copying of intermediate connection results, only the intermediate results containing cross-grid connection objects are copied.

进一步的，所述重复避免策略为：在两个跨多个网格单元的空间对象进行连接时，仅让这两个相交叠而形成的新的对象的左下角交点所在的网格单元负责输出结果。Further, the repetition avoidance strategy is: when two spatial objects spanning multiple grid units are connected, only the grid unit where the intersection point of the lower left corner of the new object formed by the two overlapping is responsible for the output result.

与现有技术相比，本发明的有益效果：本发明是一种基于Spark的链式多路空间连接查询处理算法，采用网格划分方法对数据空间进行划分，并基于空间对象所在的位置来进行数据投影和复制，计算过程中采用边界过滤方式，来过滤掉无用的连接对象，并通过缩小复制范围，减少数据复制，在处理效率和减少计算代价方面都有了显著的提高，并具有良好的适应性和扩展性。Compared with the prior art, the beneficial effects of the present invention: the present invention is a Spark-based chained multi-way spatial join query processing algorithm, adopts the grid division method to divide the data space, and based on the location of the spatial object to For data projection and replication, the boundary filtering method is used in the calculation process to filter out useless connection objects, and by reducing the replication range and reducing data replication, the processing efficiency and calculation cost have been significantly improved, and it has a good adaptability and scalability.

附图说明Description of drawings

图1为本发明具体实施方式中Z-order曲线编码的示例图；Fig. 1 is an example diagram of Z-order curve coding in the specific embodiment of the present invention;

图2为本发明具体实施方式中的对数据进行划分和任务映射的示意图；Fig. 2 is a schematic diagram of dividing data and task mapping in a specific embodiment of the present invention;

图3为本发明具体实施方式中的对数据进行投影与复制操作的示例图；Fig. 3 is an example diagram of performing projection and copying operations on data in a specific embodiment of the present invention;

图4为本发明具体实施方式中所述的边界过滤的示例图；Fig. 4 is an example diagram of boundary filtering described in the specific embodiment of the present invention;

图5为本发明具体实施方式中所述的重复避免的示例图；Fig. 5 is an example diagram of the repetition avoidance described in the specific embodiment of the present invention;

图6为本发明具体实施方式中的基于Spark的链式多路空间连接查询处理算法的处理流程示意图。6 is a schematic diagram of the processing flow of the Spark-based chained multi-way spatial join query processing algorithm in a specific embodiment of the present invention.

具体实施方式detailed description

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图，对本发明进行进一步详细说明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings.

本发明提出的一种基于Spark的链式多路空间连接查询处理算法，该算法主要聚焦于链式多路空间连接查询处理问题，重点在于减少过滤阶段的空间数据复制和计算量，从而减少后续的连接计算代价，提高查询处理效率，该算法并具有良好的适应性和扩展性。A Spark-based chained multi-way spatial join query processing algorithm proposed by the present invention, the algorithm mainly focuses on the chained multi-way spatial join query processing problem, the key point is to reduce the spatial data copy and calculation amount in the filtering stage, thereby reducing the follow-up The connection calculation cost can be improved, and the query processing efficiency can be improved. The algorithm also has good adaptability and scalability.

实施例1：Example 1:

一种基于Spark的链式多路空间连接查询处理算法，关键技术主要包括以下几个部分：A Spark-based chained multi-way spatial join query processing algorithm, the key technologies mainly include the following parts:

1)数据空间划分和编码：采用网格划分方法将整个空间划分成n个大小相等的网格单元，采用Z-order填充曲线对网格单元进行编码，空间数据对象根据其位置被投影到各个网格单元，并采用Hash方式将所有网格单元映射给多个Executor执行单元，使得整个处理任务划分成多个并行的计算任务，从而以并行的方式提升整个算法的执行性能。1) Data space division and coding: the whole space is divided into n equal-sized grid units by using the grid division method, and the grid units are encoded by using the Z-order filling curve, and the spatial data objects are projected to each grid unit according to its position. Grid units, and use the Hash method to map all grid units to multiple Executor execution units, so that the entire processing task is divided into multiple parallel computing tasks, thereby improving the execution performance of the entire algorithm in a parallel manner.

如图1所示，为了保持空间对象间的空间关系，采用空间填充曲线对网格单元进行编码，图1为位(bit)数为1、2时的Z-order曲线。Z-order曲线是一种空间填充曲线。Z-order(z-排序)技术是利用bit位来表示空间对象的属性信息，然后利用循环的方法将数据空间分解，划分后的子空间会得到一组数字，被称为该子空间的z-排序值，并作为该子空间数据对象的Key值。As shown in Figure 1, in order to maintain the spatial relationship between spatial objects, a space-filling curve is used to encode the grid unit. Figure 1 shows the Z-order curve when the number of bits is 1 or 2. A Z-order curve is a space-filling curve. Z-order (z-sorting) technology is to use bits to represent the attribute information of spatial objects, and then use the loop method to decompose the data space, and the divided subspace will get a set of numbers, which is called the z of the subspace -Sort value, and serve as the Key value of the subspace data object.

如图2所示为任务划分映射示例，从图中可以看出被划分后的各个网格单元与划分到其上的数据集通过Hash映射方式分配给Spark平台中的n个Executor执行单元来并行执行。Figure 2 shows an example of task division mapping. It can be seen from the figure that each divided grid unit and the data set divided on it are allocated to n Executor execution units in the Spark platform through Hash mapping to parallelize implement.

2)空间对象投影：就是将空间数据对象根据其所在位置映射到相应的网格单元中。设C＝(c₁,c₂,…,c_n)代表一个数据空间划分，c_i代表每一个网格单元；设R为一类待连接处理的空间对象集合，若一个空间对象u∈R，其MBR与网格单元c_i(c_i为网格单元的Z-order编码)有交叠，则将对象u映射到网格单元c_i中，并生成相应键值对(c_i,u)，如果一个空间对象和多个网格单元有交叠，则相应的会形成多个键值对。投影操作可以表示为：2) Spatial object projection: it is to map the spatial data object to the corresponding grid unit according to its location. Let C=(c ₁ ,c ₂ ,…,c _n ) represent a data space division, and c _i represent each grid unit; let R be a set of spatial objects to be connected, if a spatial object u∈R , its MBR overlaps with the grid unit ci (ci is the Z-order code of the grid unit), then the object _u is mapped to the grid unit _ci , and the corresponding key-value pair ( _ci , _u ), if a spatial object overlaps with multiple grid cells, multiple key-value pairs will be formed accordingly. The projection operation can be expressed as:

3)数据复制：在多路连结查询处理中，需要多个数据集之间进行多次连接运算，数据复制则是将当前网格单元上的最近一次空间连接的中间结果复制到相关的其他网格单元，从而进行后续的连接操作，其结果与投影操作类似，会生成一系列的键值对。若t∈T为连接中间结果中的元组，t.s为将要进行后续空间连接的对象，则若t.s与某一网格单元c_i存在交叠，则将元组t复制到网格单元c_i，并生成相应的键值对(c_i,t)。复制操作可表示为：3) Data replication: In multi-way connection query processing, multiple connection operations between multiple data sets are required, and data replication is to copy the intermediate results of the latest spatial connection on the current grid unit to other related network cell, so that the subsequent connection operation, the result is similar to the projection operation, which will generate a series of key-value pairs. If t ∈ T is the tuple in the intermediate result of the connection, and ts is the object to be connected in the subsequent space, then if _ts overlaps with a certain grid unit ci, then the tuple t is copied to the grid unit _ci , and generate the corresponding key-value pair (c _i ,t). A copy operation can be expressed as:

图3为数据投影与复制操作的例子，从中可以看出，空间对象被投影到了与之交叠的网格单元。对象r₁被投影到6和12号网格单元，r₂被投影到9和12号单元，r₃则被投影到9和11号单元，即Project(r₁,C)＝{(6,r₁),(12,r₁)}，Project(r₂,C)＝{(9,r₂),(12,r₂)}，Project(r₃,C)＝{(9,r₂),(11,r₂)}。当执行r₁，r₂和r₃依次进行多路连接时，由于r₂和网格单元9有交叠，因此网格单元12中r₁和r₂的连接中间结果(r₁,r₂)要被复制到网格单元9中，形成键值对(9,(r₁,r₂))，从而实现与网格单元9中的空间对象r₃后续连接操作，避免了连接结果的丢失。Figure 3 is an example of data projection and copy operations, from which it can be seen that spatial objects are projected to overlapping grid cells. Object r ₁ is projected to grid cells 6 and 12, r ₂ is projected to cells 9 and 12, r ₃ is projected to cells 9 and 11, that is, Project(r ₁ ,C)={(6, r ₁ ),(12,r ₁ )}, Project(r ₂ ,C)={(9,r ₂ ),(12,r ₂ )}, Project(r ₃ ,C)={(9,r _{2 )} ),(11,r ₂ )}. When executing r ₁ , r ₂ and r ₃ to perform multi-way connection sequentially, since r ₂ overlaps with grid unit 9, the intermediate result of the connection between r ₁ and r ₂ in grid unit 12 (r ₁ , r ₂ ) to be copied to grid unit 9 to form a key-value pair (9,(r ₁ ,r ₂ )), so as to realize the subsequent connection operation with the spatial object r ₃ in grid unit 9, avoiding the loss of the connection result .

4)过滤策略：在并行执行连接运算的过程中，采用边界过滤策略，去掉不可能产生连接结果的元组，并仅对可能有结果的元组进行复制，大大减少存储和后续计算的代价。具体包括以下两种过滤策略：4) Filtering strategy: In the process of executing the connection operation in parallel, the boundary filtering strategy is adopted to remove the tuples that cannot produce connection results, and only copy the tuples that may have results, which greatly reduces the cost of storage and subsequent calculations. Specifically, the following two filtering strategies are included:

A：边界过滤：首先统计前面几次已完成连接结果中相关连接对象的边界MBR，并利用该MBR来过滤掉后续要连接数据集中与该MBR不相交的空间对象，从而减少后续连接计算代价：A: Boundary filtering: first count the boundary MBR of the related connection objects in the connection results of the previous several times, and use this MBR to filter out the spatial objects that are not intersected with the MBR in the subsequent data set to be connected, thereby reducing the calculation cost of subsequent connections:

图4为一个边界过滤的例子，图中三个数据集R、S和T依次进行三路连接运算投影到网格单元3中的空间对象如图所示，的结果分别为(r₁,s₁)，(r₁,s₂)，(r₁,s₃)，可以得到前一次连接结果集中的对应S集合中的对象为s₁、s₂和s₃，其边界MBR为图中虚线所示，在与数据集T中对象进行连接运算时，可以直接过滤掉投影到网格单元3中的与该MBR不相交的空间对象t₁、t₄和t₅，避免了这些空间对象分别与s₁、s₂和s₃进行连接运算，从而大幅减少了后续计算的代价。Figure 4 is an example of boundary filtering. In the figure, three data sets R, S and T are followed by a three-way connection operation The spatial object projected into grid cell 3 is shown in the figure, The results are (r ₁ , s ₁ ), (r ₁ , s ₂ ), (r ₁ , s ₃ ), and the objects in the corresponding S collection in the previous connection result set are s ₁ , s ₂ and s ₃ , its boundary MBR is shown by the dotted line in the figure. When performing connection operations with objects in the data set T, the spatial objects t ₁ , t ₄ and t ₅ , which avoids the connection operation of these spatial objects with s ₁ , s ₂ and s ₃ respectively, thereby greatly reducing the cost of subsequent calculations.

B:复制阶段过滤：在多路连结查询处理过程中，需要对前几路连接处理之后的中间结果进行数据复制操作，将其复制到其他可能会产生连接结果的网格单元中，执行后续连接操作，避免丢失连接结果。在对中间连接结果复制中，仅对涉及跨网格连接对象的中间结果进行复制，从而避免了多余复制。B: Copy stage filtering: In the process of multi-way connection query processing, it is necessary to perform data copy operations on the intermediate results after the first few connection processes, copy them to other grid units that may generate connection results, and perform subsequent connections operation to avoid losing connection results. In the copying of intermediate connection results, only intermediate results involving cross-grid connection objects are copied, thereby avoiding redundant copying.

5)重复避免策略：在两个跨多个网格单元的空间对象进行连接时，仅让这两个相交叠而形成的新的对象的左下角交点所在的网格单元负责输出结果，也就是仅让一个网格单元来负责输出结果，这样就避免了结果的重复输出，减少了处理代价。5) Duplication avoidance strategy: When two spatial objects spanning multiple grid units are connected, only the grid unit where the intersection point of the lower left corner of the new object formed by the two overlaps is responsible for outputting the result, that is, Only one grid unit is responsible for outputting results, which avoids repeated output of results and reduces processing costs.

图5所示为重复避免的一个例子，其中S集合中的对象s₁被投影到其所交叠的网格单元2、3、6、8、9、12，R集合中的对象r₁则被投影到网格单元3、6、9、12，r₂对象被投影到了8、9、10、11四个网格单元，如果不进行重复避免，在进行连接处理中，网格单元3、6、9、12就会输出相同的连接结果(r₁,s₁)，而网格单元8和9也会输出相同连接结果(r₂,s₁)，显然出现了重复。根据所提出的重复避免策略，如图5中所示，对象交叠部分所形成的对象的左下角(图中P和Q点所示)所在的网格单元负责输出结果，即由网格单元3负责处理输出r₁和s₁的连接结果(r₁,s₁)，网格单元8负责处理输出r₂和s₁的连接结果(r₂,s₁)，该策略避免了重复处理和结果的重复输出，降低了后续处理代价。Figure 5 shows an example of duplicate avoidance, where the object s ₁ in the S set is projected to its overlapping grid cells 2, 3, 6, 8, 9, 12, and the object r ₁ in the R set is It is projected to grid units 3, 6, 9, and 12, and the _r2 object is projected to four grid units 8, 9, 10, and 11. If duplication avoidance is not performed, during connection processing, grid units 3, 6, 9, and 12 will output the same connection result (r ₁ , s ₁ ), and grid cells 8 and 9 will also output the same connection result (r ₂ , s ₁ ), obviously duplication occurs. According to the proposed repetition avoidance strategy, as shown in Figure 5, the grid unit where the lower left corner of the object formed by the overlapping part of the object (shown by points P and Q in the figure) is located is responsible for outputting the result, that is, the grid unit 3 is responsible for processing the connection result (r ₁ , s ₁ ) of output r ₁ and s ₁ , grid unit 8 is responsible for processing the connection result (r ₂ , s ₁ ) of output r ₂ and s ₁ , this strategy avoids repeated processing and The repeated output of the results reduces the cost of subsequent processing.

链式多路空间连接查询Q_m＝Overlap(R₁,R₂,R₃,...,R_m)，根据其定义，可以表示为Q_m＝Overlap(…Overlap(Overlap(R₁,R₂),R₃),…,R_m)，本发明提出的基于Spark的链式多路空间连接查询处理算法的处理流程如图6所示，主要包括以下几个步骤：Chained multi-way spatial join query Q _m =Overlap(R ₁ ,R ₂ ,R ₃ ,...,R _m ), according to its definition, can be expressed as Q _m =Overlap(...Overlap(Overlap(R ₁ ,R ₂ ), R ₃ ),..., R _m ), the processing flow of the Spark-based chained multi-way spatial join query processing algorithm proposed by the present invention is shown in Figure 6, and mainly includes the following steps:

A：根据网格划分编码方法对多路连接数据集R₁,R₂,R₃,…,R_m进行投影，并将编码值作为Key值，将每个空间对象的标识及其MBR等属性信息作为Value值，形成一系列的键值对，并分别将数据集R₁,R₂,R₃,…,R_m的投影结果放到弹性分布式数据集RDD₁,RDD₂,RDD₃,…,RDD_m中；A: According to the grid partition coding method, project the multi-connection data set R ₁ , R ₂ , R ₃ ,...,R _m , and use the coded value as the Key value, and set the identity of each spatial object and its attributes such as MBR Information is used as the Value value to form a series of key-value pairs, and the projection results of the data sets R ₁ , R ₂ , R ₃ ,...,R _m are placed in the elastic distributed data sets RDD ₁ , RDD ₂ , RDD ₃ , ..., RDD _m ;

B：计算Overlap(R₁,R₂)，即对RDD₁和RDD₂执行Cogroup操作，将RDD₁和RDD₂中的数据根据Key值聚集到一起得到RDD_new，利用边界过滤策略对RDD_new进行过滤，去掉不可能有结果的数据对象，然后进行实际空间连接运算，执行重复避免策略，并形成连接中间结果；对连接中间结果执行数据复制操作，形成中间结果数据集RDDresult_new；B: Calculate Overlap(R ₁ , R ₂ ), that is, perform the Cogroup operation on RDD ₁ and RDD ₂ , gather the data in RDD ₁ and RDD ₂ together according to the Key value to obtain RDD _new , and use the boundary filtering strategy to perform RDD _new Filter, remove data objects that cannot have results, and then perform actual spatial connection operations, implement repetition avoidance strategies, and form connection intermediate results; perform data copy operations on connection intermediate results to form intermediate result data sets RDDresult _new ;

C：按照与步骤B相同的计算方法计算RDDresult_new和RDD₃之间的连接运算，得到最新的R₁,R₂,R₃的连接中间结果RDDresult_new。采取相同的计算过方法，依次循环计算RDDresult_new与RDD₄，与RDD₅，…，与RDD_m-1的连接运算，最终得到数据集R₁,R₂,R₃,…,R_m-1的连接中间结果数据集RDDresult_new；C: Calculate the connection operation between RDDresult _new and RDD ₃ according to the same calculation method as step B, and obtain the latest R ₁ , R ₂ , R ₃ connection intermediate result RDDresult _new . Adopt the same calculation method, sequentially calculate the connection operation between RDDresult _new and RDD ₄ , and RDD ₅ , ..., and RDD _m-1 , and finally get the data set R ₁ , R ₂ , R ₃ , ..., R _m-1 The connection intermediate result data set RDDresult _new ;

D：RDDresult_new与RDD_m执行Cogroup操作，生成新的RDD_new，在此基础上进行边界过滤、连接运算处理，并将结果直接输出，形成数据集R₁,R₂,R₃,…,R_m的最终空间连接数据集RDDresult_new，并将结果保存到HDFS文件系统.由于是最后一次空间连接操作，故不再需要进行复制操作。D: RDDresult _new and RDD _m execute the Cogroup operation to generate a new RDD _new , on this basis, perform boundary filtering, connection operation processing, and directly output the results to form data sets R ₁ , R ₂ , R ₃ ,…,R The final spatial join data set RDDresult _new of _m , and save the result to the HDFS file system. Since it is the last spatial join operation, no copy operation is required.

实施例2：Example 2:

一种基于Spark的链式多路空间连接查询处理算法，包括如下步骤：步骤1：将整个数据空间划分成许多大小相同的网格单元，并采用Z-order填充曲线技术对每个网格单元进行编码；步骤2：将m(m>2)路空间连接数据集R₁，R₂，…，R_m中的每个空间对象根据其在数据空间中的位置投影到相应的网格单元，并将投影结果存放到弹性分布式数据集RDD₁，RDD₂，…，RDD_m中。设定循环变量i＝2，中间结果数据集RDDresult_new＝RDD₁；步骤3：如果满足条件i<m，则对两个数据集RDDresult_new，RDD_i执行空间连接运算Overlap(RDDresult_new,RDD_i)，计算过程中，依次进行数据聚集、边界过滤、空间连接计算、重复避免和数据复制等操作，最终形成新的中间结果数据集RDDresult_new，RDDresult_new＝Overlap(RDDresult_new,RDD_i)；步骤4：i＝i+1，执行步骤3直到条件i<m不满足为止；步骤5：执行最后一次空间连接运算Overlap(RDDresult_new,RDD_m)，计算过程中，依次进行数据聚集、边界过滤、空间连接计算，并将结果直接输出，形成最终空间连接结果集合，并保存到HDFS文件系统。本发明是一种基于Spark的链式多路空间连接查询处理算法，在处理效率和减少计算代价方面都有了显著的提高，并具有良好的适应性和扩展性。A Spark-based chained multi-way spatial join query processing algorithm, including the following steps: Step 1: Divide the entire data space into many grid units of the same size, and use Z-order filling curve technology to process each grid unit Encoding; Step 2: Project each spatial object in the m(m>2) way spatially connected data set R ₁ , R ₂ ,..., R _m to the corresponding grid unit according to its position in the data space, And store the projection results in elastic distributed datasets RDD ₁ , RDD ₂ , ..., RDD _m . Set loop variable i=2, intermediate result data set RDDresult _new =RDD ₁ ; Step 3: If the condition i<m is satisfied, perform spatial join operation Overlap(RDDresult _new , RDD _i for two data sets RDDresult _new , RDD _i ), in the calculation process, operations such as data aggregation, boundary filtering, spatial connection calculation, repetition avoidance and data replication are performed in sequence, and finally a new intermediate result data set RDDresult _new is formed, and RDDresult _new = Overlap(RDDresult _new , RDD _i ); steps 4: i=i+1, execute step 3 until the condition i<m is not satisfied; step 5: execute the last spatial connection operation Overlap(RDDresult _new ,RDD _m ), during the calculation process, data aggregation, boundary filtering, Calculate the spatial join, and output the result directly to form the final spatial join result set, and save it to the HDFS file system. The invention is a Spark-based chained multi-way spatial join query processing algorithm, which has significantly improved processing efficiency and reduced calculation cost, and has good adaptability and expansibility.

实施例3：Example 3:

虽然以上描述了本发明的具体实施方式，但是熟悉本领域的研究人员应当理解，这些仅是举例说明，本发明是一种基于Spark的链式多路空间连接查询处理算法，因此举例说明仅仅是为了说明过滤策略、重复避免策略、连接处理流程等的核心思想。在之后可以进行更大规模的实验，并进一步改进相关算法，提高数据投影、复制以及过滤的效果，同时也可以考虑结合索引技术来进一步提高算法的性能，而不背离本发明的原理和实质。本发明的范围仅由所附权利要求书限定。Although the specific implementation manner of the present invention has been described above, those who are familiar with the field should be understood that these are only illustrations, and the present invention is a chained multi-way spatial join query processing algorithm based on Spark, so illustrations are only In order to illustrate the core ideas of filtering strategy, duplicate avoidance strategy, connection processing flow, etc. Larger-scale experiments can be carried out later, and related algorithms can be further improved to improve the effects of data projection, replication and filtering. At the same time, it is also possible to consider combining indexing technology to further improve the performance of the algorithm without departing from the principle and essence of the present invention. The scope of the invention is limited only by the appended claims.

Claims

1. a chained multi-way spatial join query processing algorithm based on Spark, is characterized in that: comprise the steps:

Step 1: Use the grid division method to divide the entire data space into many grid units of the same size, and use Z-order filling curve technology to encode each grid unit;

Step 2: Project each spatial object in the m(m>2) way spatial connection data set R ₁ , R ₂ ,..., R _m to the corresponding grid unit according to its position in the data space, and form a A series of key-value pairs, store the projection results in the elastic distributed data sets RDD ₁ , RDD ₂ , ..., RDD _m , set the loop variable i=2, and set the intermediate result data set RDDresult _new =RDD ₁ ;

Step 3: If the condition i<m is satisfied, perform the spatial join operation Overlap(RDDresult _new , RDD _i ) on the two datasets RDDresult _new and RDD _i . In the calculation process, operations such as data aggregation, boundary filtering, spatial connection calculation, repetition avoidance, and data replication are performed in sequence, and finally an intermediate result data set RDDresult _new is formed, that is, RDDresult _new =Overlap(RDDresult _new ,RDD _i );

Step 4: i=i+1, execute step 3 until the condition i<m is not satisfied;

Step 5: Execute the last spatial connection operation Overlap(RDDresult _new ,RDD _m ). During the calculation process, data aggregation, boundary filtering, and spatial connection calculations are performed in sequence, and the results are directly output to form the final spatial connection result set and saved to HDFS file system.

2. the chained multi-way spatial join query processing algorithm based on Spark according to claim 1, is characterized in that: described data division and coding method are: adopt grid-based division method to divide whole data space into n For grid units of equal size, the Z-order filling curve is used to encode the grid units, and the spatial data objects are projected to each grid unit according to their positions, and all grid units are mapped to multiple Executor execution units by Hash method , so that the entire processing task is divided into multiple parallel processing tasks.

3. the chained multi-way spatial join query processing algorithm based on Spark according to claim 1, characterized in that: the spatial object projection is: the spatial data object is mapped into the corresponding grid unit according to its position, Let C=(c ₁ ,c ₂ ,…,c _n ) represent a data space division, and c _i represent each grid unit; let R be a set of spatial objects to be connected, if a spatial object u∈R , its MBR overlaps with the grid unit ci, and the _ci is the Z-order code of the grid unit, then the object _u is mapped to the grid unit _ci , and corresponding key-value pairs ( _ci , u), if a spatial object overlaps with multiple grid cells, multiple key-value pairs will be generated accordingly.

4. the chained multi-way spatial join query processing algorithm based on Spark according to claim 1, is characterized in that: step 3 specifically comprises the following steps:

Step 3-1: Calculate Overlap(RDDresult _new , RDD _i ), that is, perform Cogroup operation on RDDresult _new and RDD _i according to the Key value, that is, gather the data in RDDresult _new and RDD _i together according to the Key value to obtain RDD _new ;

Step 3-2: Use the filtering strategy to filter the RDD _new , remove the data pairs that cannot have results, and then perform the actual spatial connection operation;

Step 3-3: Execute the duplicate avoidance strategy, form the intermediate result of the connection, and perform data copy operation on the intermediate result of the connection, and finally form a new intermediate connection result data set RDDresult _new .

5. the chained multi-way spatial join query processing algorithm based on Spark according to claim 1, is characterized in that: step 5 comprises the following steps:

Step 5-1: Calculate Overlap(RDDresult _new , RDD _i ), that is, perform Cogroup operation on RDDresult _new and RDD _i according to the Key value, that is, gather the data in RDDresult _new and RDD _i together according to the Key value to obtain RDD _new ;

Step 5-2: Use the filtering strategy to filter the RDD _new , remove the data pairs that cannot have results, and then perform the actual spatial connection operation;

Step 5-3: Execute the duplicate avoidance strategy to form the final connection result data set RDDresult _new consisting of tuple pairs, and save it to the HDFS file system.

6. the chained multi-way spatial join query processing algorithm based on Spark according to claim 1, characterized in that: the data copy operation is: for the intermediate connection generated by the latest spatial join operation on the current grid unit For any tuple t in the result set T, if ts is a spatial object related to the next spatial join operation, if ts overlaps with a certain grid unit c _i , then copy the tuple t to the grid unit c _i , and generate the corresponding key-value pair (c _i ,t).

7. the chained multi-way spatial join query processing algorithm based on Spark according to claim 4, characterized in that: the filtering strategy is: in the process of performing the connection operation in parallel, adopt the corresponding filtering strategy to remove the impossible Tuples of concatenated results, and copy only those tuples that could have resulted in concatenated results.

8. the chained multi-way spatial join query processing algorithm based on Spark according to claim 7, is characterized in that: described filtering strategy comprises two parts:

Boundary filtering, the boundary filtering is as follows: before performing the connection operation, first count the boundary MBR of the spatial object related to the subsequent spatial connection in the completed connection intermediate result, and use the MBR to filter out the subsequent data set to be connected. MBR disjoint spatial objects, thereby reducing the calculation cost of subsequent connections;

Filtering in the copy stage, the filter in the copy stage is: in the process of multi-way join query processing, it is necessary to perform data copy operations on the intermediate results after the first few joins are processed, and only copy them to other grids that may generate join results In the unit, so as to avoid the loss of connection results, in the copying of intermediate connection results, only the intermediate results containing cross-grid connection objects are copied.

9. the chained multi-way spatial join query processing algorithm based on Spark according to claim 4, is characterized in that: described repetition avoidance strategy is: when two spatial objects across a plurality of grid units are connected, only Let the grid cell where the intersection point of the lower left corner of the new object formed by these two overlap is responsible for outputting the result.