CN106528773B - Map computing system and method based on Spark platform supporting spatial data management - Google Patents

Map computing system and method based on Spark platform supporting spatial data management Download PDF

Info

Publication number
CN106528773B
CN106528773B CN201610975847.0A CN201610975847A CN106528773B CN 106528773 B CN106528773 B CN 106528773B CN 201610975847 A CN201610975847 A CN 201610975847A CN 106528773 B CN106528773 B CN 106528773B
Authority
CN
China
Prior art keywords
data
graph
spatial
query
range
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610975847.0A
Other languages
Chinese (zh)
Other versions
CN106528773A (en
Inventor
郭山清
王昌圆
韩艳祥
张艮
赵凯
张学圣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHANDONG LIANYOU COMMUNICATION TECHNOLOGY DEVELOPMENT Co.,Ltd.
Original Assignee
Shandong Lianyou Communication Technology Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Lianyou Communication Technology Development Co ltd filed Critical Shandong Lianyou Communication Technology Development Co ltd
Priority to CN201610975847.0A priority Critical patent/CN106528773B/en
Publication of CN106528773A publication Critical patent/CN106528773A/en
Application granted granted Critical
Publication of CN106528773B publication Critical patent/CN106528773B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Remote Sensing (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a graph computing system and method supporting spatial data management based on a Spark platform, which comprises the steps of receiving spatial data, dividing the spatial range of the spatial data into a plurality of rectangular areas according to geographical position information, distributing the data in each rectangle to different partitions, dividing the partitions into grids, sequencing the grids, performing spatial mapping of the data, and establishing a quadtree index; receiving a query request of a graph, converting the query request into a data query, and searching in an index of a data storage layer in the graph range of the request; and according to the fed-back query result, if a plurality of graph data exist, carrying out space connection of a plurality of graphs, and distributing edges, which are within a set range, in the query result to the same partition by a graph partition strategy based on the position to realize local graph construction. The invention realizes direct space range query and space connection operation on the graph by expanding graph query, and meets the requirements of a plurality of scenes.

Description

Map computing system and method based on Spark platform supporting spatial data management
Technical Field
The invention relates to a map computing system and method based on Spark platform support space data management.
Background
With the popularization of information, a large amount of data containing position attributes is generated at every moment, and the data becomes an essential part of the digital life of people gradually. Data attached to the geographic position attribute is called spatial data, a plurality of valuable data relationships are hidden in the data, and a graph is the most intuitive and popular tool to show the relationships in the data. More and more researchers are beginning to focus on the study of graph computation and graph analysis of spatial data. Since there are many scene requirements in real life, such as surveying big explosions in new york, we only need to survey the relevant data of a local area in one region. For the calculation and analysis of large-scale local spatial data, a framework spatialgraph supporting rapid and direct regional subgraph construction and analysis has been proposed in the prior art on the basis of a graph framework (graph parallel computing framework) of Spark (fast general-purpose engine for big data processing). The spatialgraph framework enables the construction of regional subgraphs and the optimization of subgraph analysis by using a quadtree index tree and a novel graph partitioning strategy, which has been published in the COMPSAC 2016 conference.
However, through further research, we find that the spatialgraph framework has the following two important defects, and the two defects limit the performance of the spatialization graph in processing.
The first point is as follows: spatialgraph does not well account for load imbalances caused by different sizes of graphs maintained by each slave node when performing spatial graph queries. This is because the generation amount of spatial data in different regions is different in practice due to differences in economic level, traffic development, and the like of different regions, and the partitioning of the overall data by spatialgraph x is uniformly divided according to regions (as in fig. 1), the size of the data amount inside each region is different, which causes the size of the graph maintained by each slave node to be different, and when a user wants to query the subgraph of the left rectangular portion shown in the graph, only the slave node a in the spatialgraph graph cluster is in a working state, so that a serious load imbalance is caused to the cluster in both the storage aspect of the graph and the query operation aspect of the graph, thereby reducing the query efficiency.
And a second point: the data reading interface of the diagram does not meet the requirements of the real scene, such as the analysis of local data for general data and the cross analysis of the same region between two spatial data sets. The corresponding real scene such as big explosion time of New York City needs to query the range of the whole relevant data of New York City, and then analyze the data in the range; furthermore, an attack event like an IS terrorist in a region needs to use data in different time to perform cross analysis on the region so as to research the connection relationship between the data in the region.
Important components relevant to supporting spatial graph operations in Spark clusters include Graphx and Spark sql (a module that processes structured data internally by Spark). Graph x is a graph computation framework used to perform large-scale graph computation and analysis, and is performed in parallel for each point and each edge in the graph. However, in an actual figure, since some edges are relatively sparse and some edges are relatively dense, the distribution of edges in the figure is not uniform. For network data, two vertexes are connected in communication, communication between points with relatively short distances is frequent, points in a cluster are distributed to different partitions, and therefore the distance between the two points determines the communication cost to a certain extent. GraphX employs Hadoop's HDFS (Hadoop's elastic distributed file system) to store data, all of which are stored in the form of RDD (elastic distributed data set). Graph x provides many operations such as subgraph query, map, redetbykey on a graph, and the operations are directed at a complete large graph and need to traverse each edge and each point in the graph, so that the operations on a local graph are performed on the basis of subgraph query, that is, to perform correlation operations on a local graph of a large graph, it is necessary to first traverse the large graph to query out a required small graph.
There are 4 strategies for graph partitioning in the Graphx framework, graphLab 2.0 proposes and proves that the performance of the point cut algorithm stored graph is better than the edge cut stored graph, and the point cut can minimize the communication between edges. The graph partitioning strategy in graph 4 adopts a point cutting mode, namely RandomVertexCut, Canonical RandomVertexCut, EdgePartion 1D and EdgePartion 2D. The method comprises the steps of firstly calculating ids of a source vertex and a destination vertex for hashing, using a hashed value as a partition id of an edge connected with the source vertex and the destination vertex, secondly ignoring the direction of the edge, adopting a vertex with smaller id for hashing, using a hashed result as a partition id of the edge, thirdly considering only the source vertex for partitioning the edge, fourthly most complicated, and determining the partition id of the edge by establishing a connection matrix of the source vertex id and the destination vertex id. These partitioning methods allocate edge partitions to different machine nodes, and even those edges that communicate relatively frequently with each other are divided into different machines, which results in a relatively expensive communication between the nodes across the edges.
SparkSQL is a structured data processing model within Spark that provides a DataFrame programming abstraction that is somewhat similar to RDD, except that the dataset represented by DataFrame is organized in columns and RDD is organized in rows. DataFrame operates similarly to SQL, including "where", "select", etc. The SparkSQl query operation also needs to traverse all DataFrame data, and compare with relevant fields to query out a desired data result, and this process also has many comparisons with useless data, which affects the query performance. sparkSQL is on top of Apache Spark, which provides a data box API to perform relational operations and simplifies large-scale data processing in Spark. SparkSQL provides a language layer for users to interactively query sql. When running the query, sparkSQL translates the SQL query into an RDD operation. sparkSQL then runs the RDD's file query. When the data is in scale, it takes a long time. In addition, SparkSQL does not support spatial data types and spatial operations, so when data is spatial data, it will process the same as normal data, and will not utilize its spatial attributes.
Disclosure of Invention
In order to solve the problems, the invention provides a graph computing system and method supporting spatial data management based on a Spark platform.
In order to achieve the purpose, the invention adopts the following technical scheme:
a graph computation system supporting spatial data management based on a Spark platform comprises a data storage layer, a spatial query layer and a graph computation layer, wherein:
the data storage layer receives spatial data, divides the spatial range of the spatial data into a plurality of rectangular areas according to geographical position information, distributes the data in each rectangle to different partitions, divides the partitions into grids, sorts the grids, performs spatial mapping of data and establishes a quadtree index;
the space query layer receives a query request of the graph calculation layer, converts the query request into data query, searches in the index of the data storage layer in the graph range of the request, and uploads a query result;
the graph calculation layer sends a space operation request, receives a feedback query result, and distributes edges, with the distance within a set range, in the query result to the same partition based on a graph partition strategy of the position to realize local graph construction.
A map calculation method based on Spark platform supporting spatial data management comprises the following steps:
(1) receiving spatial data, dividing a spatial range of the spatial data into a plurality of rectangular areas according to geographical position information, distributing the data in each rectangle to different partitions, dividing the partitions into grids, sequencing the grids, performing spatial mapping on the data, and establishing a quadtree index;
(2) receiving a query request of a graph, converting the query request into a data query, and searching in an index of a data storage layer in the graph range of the request;
(3) and according to the fed-back query result, if a plurality of graph data exist, carrying out space connection of a plurality of graphs, and distributing edges, which are within a set range, in the query result to the same partition by a graph partition strategy based on the position to realize local graph construction.
In the step (1), the spatial data set is divided into n rectangular areas according to the geographical position information, the size of n is determined by the number of cluster nodes, and the division of the areas is adjusted according to the density of the spatial data, so that the data volume contained in each area is ensured to be as uniform as possible.
In the step (1), the data set is divided into unequal rectangles according to the spatial range of the data set, and the size of the rectangular area changes with the change of the data volume.
In the step (1), the data in each rectangular region is divided into n × n grids, the grids are sorted by using Z-order cube, the grids are sequentially labeled, two-dimensional space data containing position information in the rectangular region are mapped to a one-dimensional space, the data mapped to the one-dimensional space are stored in a file, and the positions of the grids of the data in each grid in the file are synchronously recorded by using arrays.
In the step (1), a quadtree index is established for the data in each grid, and the data volume maintained by the bottommost leaves is homogenized and balanced by a density cutting method based on a Voronoi diagram.
In the step (2), a graph query request is received and converted into data types of point, line or polygon data, and the relationship between the data types is determined.
In the step (2), the master node in the cluster converts the master node into data query and sends the data query to each slave node, the data range of the polygon is synchronously searched in the local index tree of each slave node in parallel until the leaf node is found, and the data point in the polygon range is taken out in the searching process to obtain the query range.
In the step (3), the step of spatially connecting includes:
(3-1) calculating the grid id of each space record in two space data sets of the RDD file;
(3-2) with LiRepresenting the computational load in node i, at the initial time of graph computation LiAre all 0;
(3-3) calculating the amount of data in which a location is located for each mesh, and if the loads of two partitions performing a connection operation are the same, transferring a smaller amount of data set to a partition in which a larger amount of data set is located;
and (3-4) connecting the two data sets in the same partition.
In the step (3-3), if the two partitions performing the connection operation are different, the time taken for communication between the two partitions is calculated and the time is compared to decide how to transfer the data sets in the two partitions.
In the step (3), the graph partitioning policy is to allocate edges, which are within a set range, in the query result to the same partition.
The invention has the beneficial effects that:
(1) the invention is based on Spark system, and realizes highly distributed calculation by iterative calculation in the memory, thereby calculating and analyzing large-scale images and improving the calculation speed;
(2) according to the method, a new data partitioning strategy Z Curve Hashing is adopted to balance the data load in different partitions, local indexes are respectively established for the graph data at each node by utilizing a Voronoi density-based partitioning mode, a good QuadTree index structure is established for the graph data to manage the data, the load balance of graph storage is realized, the system parallelization during graph calculation is maximized, and the graph processing speed is improved;
(3) the invention distributes the edges with frequent communication which are close to each other and close to each other to the same subarea as much as possible by using a new area-based graph subarea strategy, thereby reducing the edge communication between the subareas, quickening the graph construction time and reducing the communication cost for the later graph processing;
(4) according to the method, the QuadTree index structure can be searched on the basis of the graph data to find the graph data to be inquired, so that the whole graph is prevented from being traversed, and quick and direct subgraph construction is realized;
(5) the invention realizes direct space range query and space connection operation on the graph by expanding graph query, and meets the requirements of a plurality of scenes.
Drawings
FIG. 1 is a system architecture diagram of the present invention;
FIG. 2 is a schematic diagram of data region division according to the present invention;
FIG. 3(a) is a schematic view of the ZCH partition strategy of the present invention;
FIG. 3(b) is a schematic diagram of cluster space data partitioning according to the present invention;
FIG. 4 is a schematic view of Voronoi density-based data segmentation in accordance with the present invention;
FIG. 5 is a schematic diagram of a spatial operation tree according to the present invention;
FIG. 6 is a schematic view of the operation of the spatial join of the present invention;
FIG. 7 is a schematic diagram of a graph partitioning strategy based on regions according to the present invention.
The specific implementation mode is as follows:
the invention is further described with reference to the following figures and examples.
A graph computation framework supporting spatial data management based on a spatialGraphx platform comprises three layers except for the bottommost data source:
1) the data management layer provides a good spatial data management mechanism by utilizing a ZCH (Z dark hashing) data partition mode and establishing a QuadTree index for the spatial data of the bottom layer, so as to realize the responsible balance of the data;
2) the spatial operation layer is changed into a mode of increasing operations such as range query, spatial connection and the like on spatial data by expanding a DataFrame of spark SQL;
3) and the graph calculation layer adopts a graph partition strategy based on position to allocate the edges with closer distances to the same partition as much as possible, so that the local graph construction and the graph calculation efficiency are improved.
The GeoGraphx framework system is implemented from top to bottom as follows:
1. graph computation layer
Firstly, constructing a complete large graph when a data source is read into a system, and carrying out space operation on the basis of the large graph, such as graph range query, multi-graph space connection and the like;
transmitting the space operation request to a lower space query layer for service logic processing, and returning the space operation result to the graph calculation layer;
constructing a subgraph by utilizing a position-based graph partitioning strategy (distributing data with closer spatial data geographic positions to the same partition to construct a graph) aiming at the returned spatial operation result;
performing other space operations and graph analysis on the basis of the constructed subgraph;
2. spatial query layer
The space query layer receives the space query request from the graph calculation layer, and the DataFrame is used for automatically converting the type of the graph query request into the relationships of 'in', 'overlaps' and 'interject' among Point, Line, Polygon data types and data types.
A range query is a relationship of a Point data type to a Polygon data type, querying data points located in the Polygon. When a graph computation layer submits a graph range query request, a master node in a cluster converts the graph range query request into a data query and sends the data query to each slave node, and the data query is synchronously searched in a local index tree of each slave node in parallel through a polygonal data range until a leaf node is found. In the searching process, the data points in the Polygon are taken out to obtain the query range.
A spatial join is a process of receiving two sets of spatial data R and S, and receiving operations from the graph computation layer to transform them into relationships θ (θ includes in, overlaps, and intersector) between the sets of data, and then returning a set of data pairs < a, b >, where the types of a ∈ R, b ∈ S, a, and b can be Point points, Line edges, or Polygon.
The main steps for realizing the spatial connection are as follows:
1) calculating the grid id of each space record in the RDD files R and S;
2) by LiRepresenting the computational load in node i, at the initial time of graph computation LiAre all 0;
3) for each grid computing position, if the loads of two partitions which perform connection operation are the same, transmitting a data set with smaller quantity to a partition which performs data set with larger quantity, and if the two partitions which perform connection operation are different, computing the time which is needed for communication between the two partitions and determining how to transfer the data sets in the two partitions by comparing the time;
4) the two data sets are connected in the same partition.
3. Management layer of spatial data
Reading a spatial data set to be processed by utilizing SparkConfText in a Spark cluster configured with a GeoGraphx frame;
dividing a space range of a space data set into n rectangular areas (the number of cluster nodes is n) according to geographical position information, and adjusting the division of the rectangular areas according to the density of the space data to ensure that the data volume contained in each area is uniform as much as possible;
then, distributing the data in each rectangle to different partitions, namely, respectively maintaining and managing the data in the n rectangular areas by the n nodes;
and respectively establishing local indexes of the spatial data in each sample node. Each node partitions data of a data set in a rectangle into n x n grids, sorts the grids by using Z Curve Hashing, and maps two-dimensional space data containing position information to a one-dimensional space; storing the data mapped to the one-dimensional space into a file, and synchronously recording the position of the data in each grid in the file by using an array (in which grid the records are positioned); and establishing a QuadTree index for data in the small grid, and carrying out homogenization and balancing on the data quantity maintained by the bottommost layer leaf based on a Voronoi diagram density cutting method.
In the embodiment, mobile call data is taken as an object for explanation, and the Spark cluster node takes 4 nodes as an example.
First, fig. 1 shows an overall framework diagram of a geograph system, where source data is at the bottom layer, and then data storage, space query and graph computation are performed from bottom to top.
1. A data storage layer:
(1) when spatio-temporal data are transmitted into the system, the spatio-temporal data are subjected to unequal rectangular division according to the spatial range in the data set, in the embodiment, since the cluster nodes are 4, the data set is divided into 4 rectangular areas, wherein the size of each rectangular area changes along with the change of the data volume. As shown in fig. 2, the area of the rectangular block with sparse data amount is relatively large, and the data amount in each rectangular area is kept equal as much as possible.
(2) And partitioning the data set in the rectangular block by using a Z Curve Hashing data partitioning mode.
The data in the rectangular block is firstly divided into n × n grids, as shown in fig. 3(a), and then the subdivided grids are sorted by using Z-order currve, as shown in fig. 3(a), and the sorting result is marked by a label, so that the two-dimensional spatial data in the rectangle is mapped to one dimension. All the divided grids are numbered in an order from 0 to n x (n-1).
The mesh id is then modulo by a hash function, h (key) mod (p), where key represents the mesh id and p represents the number of nodes in the cluster, in this embodiment p 4. Then h (key) indicates that the data in the grid with the id of key is allocated to the node h (key), and h (key) takes values of 0, 1, 2, and 3. After the hash mapping, the data in the grids with id 0, 4, 8 and 12 shown in fig. 3(b) will be allocated to cluster node 0.
(3) And for each node which is distributed to the data, establishing a local QuadTree index for the data set locally managed by the machine by using a Voronoi density segmentation-based mode.
In the mobile communication data used in the embodiment, the position information in the data record is determined based on the base station, and a Voronoi diagram of the data can be obtained through the position distribution of the base station. Each polygon in fig. 4 represents an area that can be covered by one base station, and the distribution of the base stations is related to economic development and is non-uniformly distributed.
Then, a local index tree is established, each polygon in the Voronoi diagram is a leaf node in the tree, and the upper layer maintains a wider range of data positions. The size of the amount of data maintained by almost every leaf node in the local index tree is uniform.
2. Spatial query layer
The spatial operation submitted by a user on the graph is converted into an operation aiming at the RDD at a spatial query layer, and all data structures and operations are at the RDD level. In the conversion process, a plurality of items of new spatial data types and spatial relations are expanded in the invention.
Spatial data types
Three data types such as Point, Line, Polygon and the like are added by extending the user-defined type in the spark SQL frame.
Spatial data relationships
Aiming at three newly added data types, three new data relationships such as in, overlaps, intersectant and the like are added in the invention by extending a user self-defining method UDF in a SparkSQL frame. Wherein in means that Point is located in Line or Polygon, overlap means that two points, two lines and two polygons are overlapped, and intersectant means cross.
Spatial operational transformation
For a spatial query request passed by the graph computation layer, a spatial operation is first converted to an operation for RDD at the spatial query layer.
For spatial operation, a corresponding operation tree is established from bottom to top, as shown in fig. 5;
sparkSQL uses all nodes in a pattern matching function recursion tree to convert each data frame function into corresponding RDD operation, namely, the SQL language tree is converted into an RDD implementation tree;
and finally, the sparkSQL sequentially traverses the RDD realization tree to obtain the RDD operation request corresponding to the space operation request.
In order to add two space operations, namely range query and space connection, to spark sql, corresponding abstract classes need to be added to local abstract classes, corresponding implementation classes need to be added to execution, and corresponding cases need to be added to a pattern matching function. In this manner, more space operations may also be added to the API layer in the above manner.
(2) Spatial range query
After a range query request transmitted by an upper graph computation layer is obtained, a master node in the GeoGraphx cluster talks the request and sends the request to all slave nodes;
after each slave node receives a query signal issued by a master, an index part is taken out from data RDD stored locally, the indexed QuadTree is searched, and all leaf nodes meeting the query condition are obtained by comparing a space range maintained by a tree node with the query condition and searching from a root node to the leaf nodes;
and according to the obtained index result, taking out the data corresponding to the index from the data RDD, and transmitting the data back to the graph calculation layer.
(3) Space joining operation
① RDD records as RDD for two data to be spatially concatenated1And rdd2First, their index is obtained1And index2
② partial index pairs of two sets of indices<n1,n2>The following conditions are satisfied: n is1Belong to index1Leaf node of the index, n2Belong to index2Leaf nodes of the index, and n at the same time1And n2The join condition is satisfied. Record n1And n2The size of the data corresponding to the index pair;
③ denote the index with larger data size in the index pair as niAnd another index with smaller corresponding data amount is recorded as nj,niThe partition where the corresponding data is located is piThen a record is derived for such a pair of indices<pi,nj>;
④ Filter the rdd1 and rdd2 data, filter the data in ② node pairs, and apply the data in ③ node pairs<pi,nj>Redistribute to get rdd3And rdd4
⑤ pairs of the resulting rdd3And rdd4And performing join operation to obtain the final join operation result.
3. Graph computation layer
The graph computation layer mainly interacts with the spatial query layer, and a user performs a series of operations on the graph at the graph computation layer. When spatial data enters the GeoGraphx framework, a big graph G is constructed for the spatial data in addition to managing the data.
First, we construct a graph by using our new graph partitioning strategy, and during the construction process, the data is divided into a plurality of grids according to the geographical position, as shown in fig. 7. In the embodiment, if there are 4 nodes in the cluster, the map partitioning policy based on the location is divided into 4 blocks of areas according to the location range in the RDD data, the area labeled 0 is located at node 0, the area labeled 1 is located at node 1, and so on, and the edge closer to the node in the obtained large map is located in the same partition node.
The method comprises the following steps that a user performs space operation and other operations on a local graph on the basis of a large graph G, when the user selects to perform range query on the local graph, a request is transmitted to a space query layer, the space query layer is responsible for processing business logic, and a processing result is returned to a graph calculation layer;
two large graphs corresponding to two RDD data are constructed in the system, when space join operation is carried out on the two large graphs, a request is transmitted to a space query layer, the join operation on the graphs is converted into the join operation of the RDD data, and an operation result is returned to a graph calculation layer;
and after obtaining the operation result, the graph calculation layer constructs a result small graph by using the graph partition strategy based on the region again, and displays the result graph to the user.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims (5)

1. A map computing system supporting spatial data management based on a Spark platform is characterized in that: the system comprises a data storage layer, a spatial query layer and a graph calculation layer, wherein:
the data storage layer receives the spatial data, divides the spatial range of the spatial data into a plurality of rectangular areas according to the geographical position information, divides the adjustment areas according to the density of the spatial data, distributes the data in each rectangle to different partitions, divides the partitions into grids, sorts the grids, performs spatial mapping on the data, and establishes a quadtree index;
dividing data in each rectangular area into n-x-n grids, sequencing the grids by using a Z-order cube, sequentially marking, mapping two-dimensional space data containing position information in the rectangular area to a one-dimensional space, storing the data mapped to the one-dimensional space into a file, and synchronously recording the grid position of the data in each grid in the file by using an array;
establishing a quadtree index for the data in each grid, and carrying out homogenization and balancing on the data volume maintained by the leaf at the bottom layer based on a density cutting method of a Voronoi diagram;
the space query layer receives a query request of the graph calculation layer, converts the query request into data query, searches in the index of the data storage layer in the graph range of the request, and uploads a query result;
converting the data into data types of point, line or polygon data, determining the relationship among the data types, expanding a plurality of new spatial data types and spatial relationships in the conversion process, expanding a spark SQL frame, and adding three data types, namely points, lines and polygons; aiming at the three newly added data types, three new data relationships are newly added in a spark SQL frame, namely points are positioned in a line or a polygon, and two points, two lines or two polygons have overlapped or crossed data relationships;
in order to add two space operations of range query and space connection into a SparkSQL frame, adding a corresponding abstract class in a local abstract class, simultaneously adding a corresponding implementation class in execution, and adding a corresponding case in a pattern matching function;
the master node in the cluster converts the data into data query and sends the data query to each slave node, synchronous parallel search is carried out in a local index tree of each slave node through a polygonal data range until a leaf node is found, and data points in the polygonal range are taken out in the search process to obtain a query range; the graph calculation layer sends a space operation request, receives a feedback query result, and distributes edges, which are within a set range, of the query result to the same partition based on a graph partition strategy of a position to realize local graph construction;
according to the feedback query result, if a plurality of graph data exist, the space connection of the plurality of graphs is carried out,
the spatial connection includes:
calculating the grid id of each space record in two space data sets of the RDD file;
by LiRepresenting the computational load in node i, at the initial time of graph computation LiAre all 0;
for each grid computing location, if the loads of the two partitions performing the join operation are the same, transferring the smaller amount of data set to the partition in which the larger amount of data set is located;
the two data sets are connected in the same partition.
2. A map calculation method based on Spark platform support space data management is characterized in that: the method comprises the following steps:
(1) receiving spatial data, dividing a spatial range of the spatial data into a plurality of rectangular areas according to geographical position information, dividing an adjustment area according to the density of the spatial data, distributing the data in each rectangle to different partitions, dividing the partitions into grids, sequencing the grids, performing spatial mapping on the data, and establishing a quadtree index;
dividing data in each rectangular area into n-x-n grids, sequencing the grids by using a Z-order cube, sequentially marking, mapping two-dimensional space data containing position information in the rectangular area to a one-dimensional space, storing the data mapped to the one-dimensional space into a file, and synchronously recording the grid position of the data in each grid in the file by using an array;
establishing a quadtree index for the data in each grid, and carrying out homogenization and balancing on the data volume maintained by the leaf at the bottom layer based on a density cutting method of a Voronoi diagram;
(2) receiving a query request of a graph, converting the query request into a data query, and searching in an index of a data storage layer in the graph range of the request; converting the data into data types of point, line or polygon data, determining the relationship among the data types, expanding a plurality of new spatial data types and spatial relationships in the conversion process, expanding a spark SQL frame, and adding three data types, namely points, lines and polygons; aiming at the three newly added data types, three new data relationships are newly added in a spark SQL frame, namely points are positioned in a line or a polygon, and two points, two lines or two polygons have overlapped or crossed data relationships;
in order to add two space operations of range query and space connection into a SparkSQL frame, adding a corresponding abstract class in a local abstract class, simultaneously adding a corresponding implementation class in execution, and adding a corresponding case in a pattern matching function;
the master node in the cluster converts the data into data query and sends the data query to each slave node, synchronous parallel search is carried out in a local index tree of each slave node through a polygonal data range until a leaf node is found, and data points in the polygonal range are taken out in the search process to obtain a query range;
(3) according to the fed-back query result, if a plurality of graph data exist, spatial connection of a plurality of graphs is carried out, and edges, with distances within a set range, in the query result are distributed to the same partition based on a graph partition strategy of a position, so that local graph construction is realized;
in the step (3), the step of spatially connecting includes:
calculating the grid id of each space record in two space data sets of the RDD file;
by LiRepresenting the computational load in node i, at the initial time of graph computation LiAre all 0;
for each grid computing location, if the loads of the two partitions performing the join operation are the same, transferring the smaller amount of data set to the partition in which the larger amount of data set is located;
the two data sets are connected in the same partition.
3. The graph computation method supporting spatial data management based on a Spark platform as claimed in claim 2, wherein: in the step (1), the spatial data set is divided into n rectangular regions according to the geographical location information, the size of n is determined by the number of cluster nodes, and the division of the adjustment regions is performed according to the density of the spatial data, so that the data volume contained in each region is ensured to be uniform.
4. The graph computation method supporting spatial data management based on a Spark platform as claimed in claim 2, wherein: in the step (1), the data set is divided into unequal rectangles according to the spatial range of the data set, and the size of the rectangular area changes with the change of the data volume.
5. The graph computation method supporting spatial data management based on a Spark platform as claimed in claim 2, wherein: in the step (3), if the two partitions performing the connection operation are different, the time taken for communication between the two partitions is calculated and the time is compared to decide how to transfer the data sets in the two partitions.
CN201610975847.0A 2016-11-07 2016-11-07 Map computing system and method based on Spark platform supporting spatial data management Active CN106528773B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610975847.0A CN106528773B (en) 2016-11-07 2016-11-07 Map computing system and method based on Spark platform supporting spatial data management

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610975847.0A CN106528773B (en) 2016-11-07 2016-11-07 Map computing system and method based on Spark platform supporting spatial data management

Publications (2)

Publication Number Publication Date
CN106528773A CN106528773A (en) 2017-03-22
CN106528773B true CN106528773B (en) 2020-06-26

Family

ID=58350019

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610975847.0A Active CN106528773B (en) 2016-11-07 2016-11-07 Map computing system and method based on Spark platform supporting spatial data management

Country Status (1)

Country Link
CN (1) CN106528773B (en)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107423368B (en) * 2017-06-29 2020-07-17 中国测绘科学研究院 Spatio-temporal data indexing method in non-relational database
CN107562872B (en) * 2017-08-31 2020-03-24 中国人民大学 SQL-based query method and device for measuring spatial data similarity
CN107644086B (en) * 2017-09-25 2019-05-10 咪咕文化科技有限公司 The location mode of spatial data
CN107818147A (en) * 2017-10-19 2018-03-20 大连大学 Distributed temporal index system based on Voronoi diagram
CN109033234B (en) * 2018-07-04 2021-09-14 中国科学院软件研究所 Streaming graph calculation method and system based on state update propagation
JP7195073B2 (en) * 2018-07-10 2022-12-23 古野電気株式会社 graph generator
CN108959653B (en) * 2018-08-06 2021-06-01 桂林电子科技大学 Based on dense grid recombination and K2Graph data representation method of tree
CN109783240B (en) * 2019-01-27 2020-08-25 中国人民解放军国防科技大学 Local optimization structured grid parallel computing load balancing method based on MINMAX
CN109947889A (en) * 2019-03-21 2019-06-28 佳都新太科技股份有限公司 Spatial data management method, apparatus, equipment and storage medium
CN109947778B (en) * 2019-03-27 2022-04-19 联想(北京)有限公司 Spark storage method and system
CN110110108B (en) * 2019-04-09 2021-03-30 苏宁易购集团股份有限公司 Data importing method and device of graph database
CN110674134B (en) * 2019-09-16 2024-02-13 腾讯大地通途(北京)科技有限公司 Geographic information data storage method, query method and device
CN111913965B (en) * 2020-08-03 2024-02-27 北京吉威空间信息股份有限公司 Space big data buffer area analysis-oriented method
CN112765295B (en) * 2021-01-13 2022-12-20 华能新能源股份有限公司 Regional meteorological data splicing system
CN112925789B (en) * 2021-02-24 2022-12-20 东北林业大学 Spark-based space vector data memory storage query method and system
CN115840752B (en) * 2023-02-24 2023-05-02 西安索格亚航空科技有限公司 Global aviation navigation data storage and query method
CN116737392B (en) * 2023-08-11 2023-11-10 北京智网易联科技有限公司 Non-vector data processing method and device and computing equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289466A (en) * 2011-07-21 2011-12-21 东北大学 K-nearest neighbor searching method based on regional coverage
CN103678550A (en) * 2013-09-09 2014-03-26 南京邮电大学 Mass data real-time query method based on dynamic index structure
CN105138607A (en) * 2015-08-03 2015-12-09 山东省科学院情报研究所 Hybrid granularity distributional memory grid index-based KNN query method
CN105138560A (en) * 2015-07-23 2015-12-09 北京天耀宏图科技有限公司 Multilevel spatial index technology based distributed space vector data management method
CN105589951A (en) * 2015-12-18 2016-05-18 中国科学院计算机网络信息中心 Distributed type storage method and parallel query method for mass remote-sensing image metadata

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102289466A (en) * 2011-07-21 2011-12-21 东北大学 K-nearest neighbor searching method based on regional coverage
CN103678550A (en) * 2013-09-09 2014-03-26 南京邮电大学 Mass data real-time query method based on dynamic index structure
CN105138560A (en) * 2015-07-23 2015-12-09 北京天耀宏图科技有限公司 Multilevel spatial index technology based distributed space vector data management method
CN105138607A (en) * 2015-08-03 2015-12-09 山东省科学院情报研究所 Hybrid granularity distributional memory grid index-based KNN query method
CN105589951A (en) * 2015-12-18 2016-05-18 中国科学院计算机网络信息中心 Distributed type storage method and parallel query method for mass remote-sensing image metadata

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Changyuan Wang,et al.SpatialGraphx:A Distributed Graph Computing Framework for Spatial and Temporal Data at Scale.《2016 IEEE 40th Annual Computer Software and Applications Conference》.2016,第609、610、611页,图1、图2. *
SpatialGraphx:A Distributed Graph Computing Framework for Spatial and Temporal Data at Scale;Changyuan Wang,et al;《2016 IEEE 40th Annual Computer Software and Applications Conference》;20160825;第609-611页,图1、图2 *

Also Published As

Publication number Publication date
CN106528773A (en) 2017-03-22

Similar Documents

Publication Publication Date Title
CN106528773B (en) Map computing system and method based on Spark platform supporting spatial data management
CN109284338B (en) Satellite remote sensing big data optimization query method based on mixed index
CN107423368B (en) Spatio-temporal data indexing method in non-relational database
Nishimura et al. MD-HBase: A scalable multi-dimensional data infrastructure for location aware services
CN106372114B (en) A kind of on-line analysing processing system and method based on big data
CN110347680B (en) Space-time data indexing method for interpyury environment
CN111291016B (en) Hierarchical hybrid storage and indexing method for massive remote sensing image data
CN110147377B (en) General query method based on secondary index under large-scale spatial data environment
Lu et al. Flexible and efficient resolution of skyline query size constraints
Zhang et al. Trajspark: A scalable and efficient in-memory management system for big trajectory data
Hongchao et al. Distributed data organization and parallel data retrieval methods for huge laser scanner point clouds
CN105117497B (en) Ocean big data principal and subordinate directory system and method based on Spark cloud network
CN110175175B (en) SPARK-based distributed space secondary index and range query algorithm
CN106095920B (en) Distributed index method towards extensive High dimensional space data
CN106209989A (en) Spatial data concurrent computational system based on spark platform and method thereof
CN109492060A (en) A kind of map tile storage method based on MBTiles
Du et al. Spatio-temporal data index model of moving objects on fixed networks using hbase
CN111078634A (en) Distributed spatio-temporal data indexing method based on R tree
CN106599190A (en) Dynamic Skyline query method based on cloud computing
CN110134683A (en) The partition zone optimizing research method and system that magnanimity element stores in relational database
CN110287508A (en) A kind of visualization emerging system of multi-source D Urban model
CN105357247A (en) Multi-dimensional cloud resource interval finding method based on hierarchical cloud peer-to-peer network
Kumar et al. M-Grid: a distributed framework for multidimensional indexing and querying of location based data
CN113934686A (en) Distributed multi-level spatial index method for massive airborne laser point clouds
Tian et al. A survey of spatio-temporal big data indexing methods in distributed environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200526

Address after: Room 1208, donglingshang block, 102 Gongye South Road, Lixia District, Jinan City, Shandong Province

Applicant after: SHANDONG LIANYOU COMMUNICATION TECHNOLOGY DEVELOPMENT Co.,Ltd.

Address before: 528, room five, building 2766, North Tower, No. 250061, show Road, Ji'nan, Shandong

Applicant before: SHANDONG SHOUXUN INFORMATION TECHNOLOGY Co.,Ltd.

Applicant before: SHANDONG University

GR01 Patent grant
GR01 Patent grant