CN115391351A

CN115391351A - Multi-layer vector data superposition method based on Spark big data

Info

Publication number: CN115391351A
Application number: CN202211113903.1A
Authority: CN
Inventors: 陈涌嘉; 邹一荣
Original assignee: China Unicom Guangdong Industrial Internet Co Ltd
Current assignee: China Unicom Guangdong Industrial Internet Co Ltd
Priority date: 2022-09-14
Filing date: 2022-09-14
Publication date: 2022-11-25

Abstract

The invention provides a Spark big data-based multi-layer vector data superposition method, which comprises the steps of encoding each layer by a quadtree, selecting the optimal quadtree from the layers as a standard quadtree of all layers, cutting the layers by the quadtree, distributing the cut layers to Spark actuators for superposition calculation, and finally fusing graphs. Compared with the prior art, the method provided by the invention optimizes the distribution mode of Spark by using the characteristic of Spark distributed computation and combining the quadtree, and can be used for efficiently superposing the vector data of multiple layers.

Description

Spark big data-based multi-layer vector data superposition method

Technical Field

The invention relates to the field of vector data superposition calculation, in particular to a Spark big data-based multilayer vector data superposition method.

Background

With the increasing importance of people on land resources and the development of processing and analysis of vector data, the processing of vector data becomes more and more important, the data amount needing to be processed becomes more and more, and the superposition processing of vector data of multiple layers occurs. For analyzing a plurality of layers with large data volume, according to the data volume of vector data, a long time of analysis processing is often needed to obtain results, and efficient overlay analysis of vector data of the plurality of layers cannot be realized. In the traditional massive vector data superposition analysis, the vector data is usually subjected to superposition analysis by using ArcGis as a tool, but the ArcGis is subjected to data superposition analysis based on a single computer, and the data superposition analysis cannot be efficiently performed by using the computing power of the single computer.

Based on the requirement, massive multi-layer vector data can be analyzed by utilizing the high-performance and extensible computing capacity of Spark, and the invention patent application with the publication number of CN 11156081A discloses a vector element parallel computing method, a device, a storage medium and a terminal, wherein the method comprises the following steps: constructing a distributed element data set model according to a Spark calculation framework; performing data reading on external data according to the distributed element data set model; performing data repartitioning according to the read data; combining the quad-tree index and the binary-tree index to create a local spatial index for the data after the repartition and the partition; and carrying out data processing and analysis on the data after the local spatial index is created. The expanded development based on Spark module can allow users to freely combine various interfaces, and the processing and analyzing interface of the distributed element data set model can comprise filtering, obtaining geographic and time range, clipping, space query, attribute summary, grid superposition, polygon superposition, column extraction, column addition and the like.

Although the above patent discloses that a large amount of vector data can be subjected to data processing and analysis based on Spark, it does not disclose how to superimpose vector data of multiple layers, and does not disclose in detail a processing method of vector data.

Disclosure of Invention

The invention aims to overcome at least one defect of the prior art, and provides a Spark big data-based multi-layer vector data superposition method, which is used for solving the problem of superposition analysis and calculation of massive multi-layer vector data.

The technical scheme adopted by the invention is as follows:

the invention provides a Spark big data-based multilayer vector data superposition method, which comprises the following steps:

s1: carrying out layer combination on a multi-layer vector graph to be processed, and generating a distributed data set RDD1 in Spark;

s2: establishing a minimum inclusion rectangle z which is a rectangular layer containing all RDD1 vector graphics;

s3: traversing all layers in the RDD1, generating a shortest quadtree for each layer according to the boundary which takes the minimum containing rectangle z as the boundary, and recording the maximum coding length len of the shortest quadtree;

s4: selecting the length with the most image layers in the maximum length len in S3, and taking the length as a standard quadtree coding length len1;

s5: dividing a minimum contained rectangle z according to a standard quadtree coding length len1 to generate a grid structure, and generating RDD2 by taking quadtree coding as an index value;

s6: the data in the RDD2 are subjected to re-partition, the data in the Spark partition are subjected to re-distribution, and RDD3 is generated;

s7: decompiling the quadtree codes of the vector graphics in the RDD3, acquiring the specific positions of grids of the vector graphics in the vector graphics, cutting the grids, and generating RDD4 by using data in the grids;

s8: superposing the vector graphics in the RDD4 according to the grid region data of the quadtree coding, then disassembling and disassembling the data, and generating RDD5 from the disassembled data;

s9: and fusing the vector graphics layer data in the RDD5, and generating RDD6 from the fused data.

The Spark is used for carrying out partition storage on the vector data of the multiple image layers, massive data are split, the data volume of each partition is reduced, and on the basis, the vector data are subjected to superposition analysis, so that the calculation efficiency is improved. The optimal partition mode is selected by introducing the quadtree coding, and the vector data is subjected to gridding partition and cutting through the quadtree coding, so that each grid has a corresponding quadtree coding, a large amount of calculation can be spread into each grid, and the calculation amount of each grid partition is as uniform as possible. By distributing the data to each grid and taking each grid as a unit to perform superposition and disassembly of the data, the data volume of each grid partition is small, so the calculation amount for carrying out superposition and disassembly processing on the vector data of multiple layers in each grid is small, the vector data can be efficiently superposed and fused, and the massive vector data of multiple layers can be efficiently analyzed.

Further, the data format of the RDD1 is: (vector graphic element ID, layer coding, graphic WKT, minimum coordinate x, maximum coordinate x ₁ Minimum coordinate y, maximum coordinate y ₁ Original element json information).

Coding each layer, and recording as layer coding; endowing and recording an ID for each vector graph in all layers, and recording the ID as a vector graph element ID, wherein each vector graph corresponds to one vector graph element ID; establishing a uniform coordinate system for all layers, and recording the minimum coordinate x and the maximum coordinate x of each vector graph in all layers in the coordinate system ₁ Minimum coordinate y, maximum coordinate y ₁ (ii) a Original element json information for each vector graphic is generated, and the data is recorded in RDD1 with a vector graphic element ID as an index value. Determining the vector graphics data recording format facilitates subsequent traversal of the vector graphics data and various operations.

Further, the step S4 of selecting the standard quadtree coding length len1 further includes:

when the maximum length occurs the same number of times, the longest one of them is selected as the standard quadtree coding length len1.

In step S4, the generated quadtree code is selected, and the maximum length len of the code with the largest occurrence frequency is selected, so that the grid generated on the minimum inclusion rectangle z can completely wrap the vector graphics as much as possible, and the monotonous data in the grid is prevented from being further divided, thereby reducing the amount of calculation for clipping. When different lengths with the same occurrence frequency exist, the length with the largest length is taken as the standard quadtree coding length len1, on the premise of ensuring reasonable grid division, the number of layers of grid division is increased as much as possible, and a large number of vector graphics are prevented from being collected in one grid to cause data inclination.

Further, the data format of the RDD2 is: (quad-tree coding, (vector graphic element ID, layer coding, graphic WKT, original element json)).

The minimum containing rectangle z is subjected to quad tree coding according to the standard quad tree coding length len1, the z is divided into grids, the z is cut according to a grid structure, each grid corresponds to one quad tree code, the quad tree codes are used as spatial index values to generate new RDD2 as the z is the minimum containing rectangle of the vector data of all layers, the data of the vector graphics of all layers in the grid corresponding to the z can be determined through the spatial index values, and the data searching and overlapping processing is facilitated.

Further, step 6 is to re-partition the data in the RDD2, re-distribute the data in the Spark partition, and generate the RDD3, specifically:

traversing the RDD2, adding random numbers to the index values in the RDD2, generating new index values, and re-partitioning the RDD2 according to the index values to generate RDD3;

the data format of the RDD3 is as follows: (quad-tree coding plus random number, (vector layer element ID, layer coding, graph WKT, original element json information)).

Since Spark may have data skew when creating partitions or transforming, that is, more data may be concentrated on the same actuator, which causes uneven resource allocation and reduced data calculation efficiency, specifically, because the numbers of layers with vector graphics under different grids are different, if re-partitioning by quadtree coding may cause different data amounts of each partition, and then when allocating a partition task to the actuator of Spark, a data skew may occur, a random number may be added to the index value in RDD2, in an embodiment, the random number may be taken from 0 to 9 by a random function, a four-bit random number is constructed, and the random number and a one-bit identification connector are added to the quadtree coding to form a new index value, and then RDD2 is re-partitioned according to the generated new index value, and since the random number has randomness, the number may ensure that the occurrence number of random numbers in a defined range is balanced; the re-partition makes the index value of the data in each partition the same, and after adding the random number, makes the quadtree and the random number of the data in each partition the same, at this time, because the random number has certain averageness, the index value with the random number can ensure that the data amount of each partition is even after the vector graphics data is re-partitioned, which makes the data of the RDD in the Spark distributed to each actuator as evenly as possible after the shuffle. And as for vector graphics data, the consumption of the shuffle process in Spark is less than that of space calculation such as cropping, and the shuffle mode is selected to evenly distribute the data to each actuator to balance the calculated amount.

Further, the step 7 decompilates the quadtree coding of the vector graphics in the RDD3, obtains the position of the coding in the vector graphics, generates a specific grid of the vector graphics and cuts the specific grid to generate the RDD4, specifically:

traversing the RDD3, removing random numbers from the index values to restore a quadtree code, performing decompiling on the quadtree code to generate a specific grid graph of each vector graph, cutting the graph WKT of the vector graph according to the grids to form a cut graph WKT of each grid, and generating an RDD4 by taking the quadtree code as the index values;

the data format of the RDD4 is as follows: (quad-tree coding, (layer coding, vector graphic element ID, cropped graphic WKT, original element json information)).

At this time, the vector layer of each layer in the RDD3 is cut into individual grids, and a specific grid of the vector graphics can be determined according to the quadtree coding and the layer coding and data in the grid can be obtained, so that the vector graphics of each layer can be conveniently superimposed by using the grid as a unit.

Further, in step 8, the vector image layer data of each image layer is superimposed and disassembled according to the grid in S7, specifically:

grouping the RDD4 according to the quad-tree coding, superposing the vector graphics under the same grid, disassembling the superposed vector graphics into an overlapped part and a non-overlapped part, generating a fusion element coding, and generating an RDD5 by the disassembled vector graphics;

the data format of the RDD5 is (fusion element coding, cutting graph WKT and JSON information).

The specific steps of stacking and disassembling are as follows:

grouping vector graphics under the same grid according to layer coding;

when only one packet appears, no operation is performed;

when a plurality of groups appear, traversing all objects under the grid, sequencing according to layer codes, pairwise sequencing and combining and superposing vector graphics overlapped in the objects, cutting out superposed parts, combining the layer codes, and repeating the step until all the layer codes are the same.

Specifically, the layer codes are merged, and the layer code of the topmost layer in the merged layers is taken as the merged layer code.

Further, the fusion element codes are divided into overlapping part codes and non-overlapping part codes;

the overlapped part codes are the connected combination of the vector graphic element IDs of all the overlapped parts;

the non-overlapping portion is encoded as a vector graphic element ID of a single non-overlapping portion;

the JSON information is a combination of multiple overlapping original element JSON information or single non-overlapping original element JSON information.

Specifically, the data after being superposed is divided into an overlapped part and a non-overlapped part, and the overlapped part and the non-overlapped part are distinguished by adopting the fused element codes, and grouping operation is carried out according to the fused element codes during subsequent operation, so that the data operation is more convenient, and the data processing efficiency is improved.

Further, in step S9, vector graphics layer data in the RDD5 is fused, and the fused data is generated into an RDD6, specifically:

traversing the vector graphics layer data in the RDD5, grouping according to the fusion element codes, judging whether the data in the groups are connected, if so, fusing the vector graphics layer data to generate a fusion graphics WKT and fusion JSON information, and generating RDD6 from the fused data;

the data format of the RDD6 is (fusion element coding, fusion graph WKT and fusion JSON information).

Specifically, in one embodiment, touch may be used to determine whether vector data are connected, and if vector graphics are connected, vector graphics are fused; if not, no operation is performed.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the method, the vector data of multiple layers are overlapped, and the vector data of the multiple layers can be efficiently processed.

2. And determining the optimal grid division condition by adopting a quadtree manner, and distributing the vector graphics data to the Spark actuator according to the grid condition by combining random numbers, thereby improving the efficiency of processing mass data.

3. And the vector data of multiple layers are superposed according to the grid structure, so that the superposition calculation of the vector data is reduced, the data superposition efficiency is improved, and the vector data is conveniently processed.

Drawings

Fig. 1 is a schematic flowchart of step S5 according to an embodiment of the present invention.

Fig. 2 is a detailed flowchart of step S7 according to an embodiment of the present invention.

Fig. 3 is a detailed flowchart of step S8 according to an embodiment of the present invention.

Fig. 4 is a detailed flowchart of step S9 according to an embodiment of the present invention.

Detailed description of the preferred embodiment

The drawings are only for purposes of illustration and are not to be construed as limiting the invention. For the purpose of better illustrating the following embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

Examples

As shown in fig. 1, in the present embodiment, the outer frame rectangle is a minimum containing rectangle z, and two vector graphics are superimposed.

Specifically, firstly, reading multi-layer vector graphics data in the same coordinate system: coding each layer, and recording as layer coding; endowing and recording an ID for each vector graph in all layers, and recording the ID as a vector graph element ID, wherein each vector graph corresponds to one vector graph element ID; establishing a uniform coordinate system for all layers, and recording the minimum coordinate x and the maximum coordinate x of each vector graph in all layers in the coordinate system ₁ Minimum coordinate y, maximum coordinate y ₁ (ii) a Generating original element json information of each vector graph, then merging multiple layers, and generating a distributed data set RDD1 containing all layers and having the same data format in Spark, wherein the data format of the RDD1 is (vector graph element ID, layer coding, graph WKT, minimum coordinate x, maximum coordinate x) ₁ Minimum coordinate y, maximum coordinate y ₁ Original element json information).

For the vector data of RDD1, a determination on the spatial range, i.e., the spatial range of the quadtree spatial clipping, is made. Specifically, all vector graphs in all layers are traversed to obtain the maximum coordinate x of each vector graph ₁ 、y ₁ And the minimum coordinates x and y, by comparing the maximum coordinate x of each vector graphic of the same layer ₁ 、y ₁ And minimum coordinates x and y, and acquiring a layer minimum rectangle z which can contain all vector graphics of the layer in the layer ₁ All the layers z ₁ Making a comparison to determine that all z can be contained ₁ The rectangle is the smallest containing rectangle z.

After the minimum inclusion rectangle z is determined, traversing all the vector graphics in the minimum inclusion rectangle z, generating a code of a shortest quadtree for the vector graphics of each layer according to the minimum inclusion rectangle z, wherein the shortest quadtree is a quadtree with minimum depth which cannot be formed by continuously dividing all the areas, recording the maximum coding length len of each vector graphics, as shown in fig. 1, the quadtree of a grid "00" in the graph is coded as "00", the coding length is 2, extracting the length len with the maximum number of times from the recorded maximum coding lengths len of each vector graphics, and if a plurality of lengths len with the same times exist, selecting the longest len from the length len as the standard quadtree coding length len1 with the minimum inclusion rectangle z.

As shown in fig. 1, after the minimum containing rectangle z is quadtree coded according to the standard quadtree coding length len1, a grid structure of z is generated, and z is cut according to the grid structure, so that the vector graphics data of each layer is also divided according to the grid. At this time, a new RDD2 may be generated using a quad tree code as an index value, and the data format of the RDD2 is (quad tree code, (vector graphic element ID, layer code, graphic WKT, original element json)). In this embodiment, when we obtain data corresponding to two vector graphics in the grid, such as the graphic WKT, according to the index value "03".

In order to distribute data evenly on actuators in the RDD, a random number is added to the index value in the RDD2, in this embodiment, the data in the RDD2 is traversed, a random function is used to obtain random values from 0 to 9, a 4-bit random number, such as "6120", is constructed, a one-bit separation identifier, such as "_", is set, and the random number is added to the quadtree coding of the RDD2 to form a new index value, such as "03 xu 6120", and then re-partitioning is performed according to the new index value, at this time, all data with the index value of "03 xu 6120" are divided into the same partition, because the random number is introduced, the quadtree coding is "03" and different values of the random number are divided into other partitions, and because the random number can be kept average within a certain range, the number of data in other partitions can be relatively averaged; if no random number is introduced and quad-tree coding is used as an index value, data with the index value of '03' is divided into the same partition, data with the index value of '20' is divided into another partition, and since the corresponding data amount with different index values may be different, partitioning in this way causes different calculation amounts when the partition data is distributed to an actuator, and a situation of data skew is easy to occur. Then Spark will perform shuffle on the data to generate RDD3, and the data format of RDD3 is (quadtree coding + random number, (vector graphic element ID, layer coding, graphic WKT, primitive element json)).

And traversing the RDD3, removing the separation identifier and the random number from the index value, and recovering the index value into the quadtree coding. As shown in fig. 2, since the data stored in the RDD3 is the image WKT, a specific vector graph indexed according to the quadtree code and the vector element ID is actually WKT data of the entire vector graph (i.e., a graph from the left to the first step in fig. 2), and the decompiling is performed according to the quadtree code, so that a specific grid graph of each layer vector graph can be generated, each grid corresponds to one quadtree code (corresponding to a graph from the left to the second step in fig. 2), the vector graph is cut according to the grid, specifically, the graph WKT generates a cut graph WKT (corresponding to a graph from the left to the third step in fig. 2) according to the grid condition, and the cut graph WKT generates the RDD4, wherein the data format of the RDD4 is (quadtree code, (layer code, vector graph element ID, cut graph WKT, original element json information)). In step S6, a shuffle operation is performed on the data, in which the data index in the partition is not changed, and the data content is changed from the graph WKT to the cropped graph WKT (corresponding to the graph from the fourth step from the left in fig. 2).

After the operation, a plurality of grids containing the vector graphics data of the multiple layers can be obtained, the grids are grouped by using groupByKey of Spark according to the quad-tree coding, and the vector data of the multiple layers under the same grid are overlapped and disassembled.

The superposition and disassembly processes are that grouping is carried out according to layer codes under the same grid, and if only one grouping appears, no operation is carried out; if a plurality of groups appear, traversing all the objects in the groups, sequencing the vector graphics of the objects, then arranging and combining two by two, judging whether the graphics are overlapped or not, if not, continuing the operation, if so, overlapping the graphics and cutting the overlapped part, then merging the layer codes, and carrying out the operation again on the processed result until no graphics are overlapped to generate the fused element code.

The vector graphics can be divided into two parts through the superposition and disassembly operations, wherein one part is a graphics layer overlapped part, and the other part is a graphics non-overlapped part. The fusion element coding is also divided into an overlapped part coding and a non-overlapped part coding, wherein the overlapped part coding is a vector graphic element ID connection combination of all overlapped parts, the non-overlapped part coding is a vector graphic element ID of a single non-overlapped part, JSON information of the overlapped part is a combination of a plurality of JSON information of original elements, and the JSON information of the non-overlapped part is single JSON information of the original elements.

In this embodiment, specifically, as shown in fig. 3, the grid with the quadtree code "20" is superimposed and disassembled in task1 of Spark, and if two layer codes are determined, the two layer codes are superimposed, and then the superimposed part is cut out, and the graph is divided into an overlapped part and two non-overlapped parts. Similarly, performing superposition and disassembly operations on other grids in other actuators, and generating an RDD5 according to the disassembled vector graphics data, wherein the data format of the RDD5 is (fusion element coding, cutting graphics WKT and JSON information), the fusion element coding is that two superposed vector graphics are connected according to the sequence of layer coding, and the layer coding is combined, in particular to the layer coding of the layer where the vector graphics of the uppermost layer is set as the layer coding.

At this time, if there are data of different layer codes in the grid corresponding to the quadtree code "20", operations of superposition and disassembly are also required until the layer codes of all the data are the same.

Traversing the vector graphics layer data in the RDD5, grouping according to the fusion element codes, judging whether the data in the groups are connected, and fusing the vector graphics layer data if the data in the groups are connected.

Specifically, in this embodiment, as shown in fig. 4, after the operations of superposition and disassembly, the vector graphics are still divided into each actuator by the grid, grouped by the fusion element codes, the vector graphics with the same superposition relationship are divided into one group, whether the vector graphics in the group are connected is determined by the touch function, and if so, the vector graphics with the connection relationship are fused.

It should be understood that the above-mentioned embodiments of the present invention are only examples for clearly illustrating the technical solutions of the present invention, and are not intended to limit the specific embodiments of the present invention. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention claims should be included in the protection scope of the present invention claims.

Claims

1. A method for superposing multi-layer vector data based on Spark big data is characterized by comprising the following steps:

s1: carrying out layer merging on a multi-layer vector graph to be processed, and generating a distributed data set RDD1 in Spark;

s6: the data in the RDD2 are subjected to re-partition by using the random number, and RDD3 is generated by the re-partitioned data;

s7: decompiling the quadtree codes of the vector graphics in the RDD3, acquiring the specific positions of grids of the vector graphics at the vector graphics, cutting the grids, and generating RDD4 by using data in the grids;

s8: superposing and disassembling a vector graph in the RDD4 according to a grid area of the quadtree coding, and generating RDD5 from disassembled data;

s9: and fusing the vector graphics layer data in the RDD5, and generating the RDD6 from the fused data.

2. The Spark big data-based multilayer vector data superposition method according to claim 1, wherein the data format of the RDD1 is: (vector graphic element ID, layer coding, graphic WKT, minimum coordinate x, maximum coordinate x ₁ Minimum coordinate y, maximum coordinate y ₁ Original element json information).

3. The method for overlaying multilayer vector data based on Spark big data according to claim 2, wherein the step S4 of selecting the standard quadtree coding length len1 further comprises:

4. The method as claimed in claim 1, wherein the data format of RDD2 is as follows: (quad-tree coding, (vector graphic element ID, layer coding, graphic WKT, original element json)).

5. The method as claimed in claim 4, wherein the step 6 is to use a random number to re-partition the data in the RDD2, and generate RDD3 from the re-partitioned data; the method comprises the following specific steps:

6. The method as claimed in claim 5, wherein the step 7 is implemented by performing decompiling on a quadtree code of a vector graphic in the RDD3, obtaining a position of the code in the vector graphic, generating a specific grid of the vector graphic, and performing cropping to generate the RDD4, specifically:

7. The method according to any one of claim 6, wherein in the step 8, a vector graphic in RDD4 is superimposed and disassembled according to a grid region of a quadtree code, and the disassembled data is used to generate RDD5, specifically:

grouping the RDD4 according to the quad-tree codes, superposing vector graphics under the same grid, disassembling the superposed vector graphics into an overlapped part and a non-overlapped part, generating a fusion element code, and generating an RDD5 by the disassembled vector graphics;

8. The Spark big data-based multilayer vector data superposition method according to claim 7, wherein the steps of superposition and disassembly comprise:

grouping vector graphics under the same grid according to layer coding;

when only one packet appears, no operation is performed;

9. The method for overlaying vector data of multiple image layers based on Spark big data according to claim 7, wherein said fusion element coding is divided into an overlapped part coding and a non-overlapped part coding;

the non-overlapping portions are encoded as vector graphic element IDs of a single non-overlapping portion;

10. The method for superimposing multilayer vector data based on Spark big data according to claim 9, wherein in step S9, vector graphics layer data in RDD5 are fused, and the fused data is used to generate RDD6, specifically:

traversing the vector graphics layer data in the RDD5, grouping according to the fusion element codes, judging whether the data in the grouping are connected, if so, fusing the vector graphics layer data to generate a fusion graphics WKT and fusion JSON information, and generating RDD6 from the fused data;