CN116955510B

CN116955510B - Space data versioning management method based on data lake

Info

Publication number: CN116955510B
Application number: CN202310664995.0A
Authority: CN
Inventors: 王瑾晖; 赵慧慧; 黄超; 郭彪; 熊肖
Original assignee: Yizhirui Information Technology Co ltd
Current assignee: Yizhirui Information Technology Co ltd
Priority date: 2023-06-06
Filing date: 2023-06-06
Publication date: 2024-05-14
Anticipated expiration: 2043-06-06
Also published as: CN116955510A

Abstract

The embodiment of the application provides a spatial data versioning management method based on a data lake, which comprises the following steps: writing the processed heterogeneous spatial data into a data lake based on SPARK DATAFRAMEWRITER; writing Parquet files of the space data into distributed storage, and writing metadata of the space data into a transaction log of the distributed storage; recording the log number of the transaction log and writing metadata of the space data into a timestamp of the transaction log stored in a distributed mode; and reading the historical snapshot of the space data based on the log number and the timestamp to realize data restoration and historical data tracking. The application can manage the space data on the HDFS according to the form of the table, and realizes the multi-version management of the space data through the update operation of the record level and the transaction operation of the space data.

Description

Space data versioning management method based on data lake

Technical Field

The embodiment of the application relates to the technical field of data processing, in particular to a spatial data versioning management method based on a data lake.

Background

At present, for the multi-version management scheme of the space data, a traditional scheme for storing and spatially analyzing the space data based on a database postgis, mysql, oracle or shpfile file mode and a technical scheme for storing the space data based on hdfs exist.

The traditional system for storing and calculating space data and providing data service based on the database postgis, mysql, oracle or shpfile file mode has the defects of single storage point failure, large storage space consumption, scattered space data, high maintenance cost, low space calculation speed and the like; while the HDFS-based system for storing space data solves the defect of traditional storage, the space data which is imported into the file system in batches lacks global strict schema specification, space data writing has no ACID guarantee, the data in an intermediate state can be possibly read, a user cannot efficiently use upsert/delete historical data, parquet files can only rewrite one part of data in full once the data is written into the HDFS files, and the cost is high.

Disclosure of Invention

In order to solve the problems that space data on hdfs does not support transactions and recording-level operation cannot be performed, the embodiment of the application provides a space data versioning management method based on a data lake.

A space data versioning management method based on a data lake comprises the following steps:

Writing the processed heterogeneous spatial data into a data lake based on SPARK DATAFRAMEWRITER;

writing Parquet files of the space data into distributed storage, and writing metadata of the space data into a transaction log of the distributed storage;

recording the log number of the transaction log and the timestamp of the metadata written into the distributed stored transaction log of the space data;

and reading the historical snapshot of the space data based on the log number and the timestamp to realize data restoration and historical data tracking.

In one possible implementation, the versioning management of the spatial data is implemented in a delta expansion module.

In one possible implementation manner, after the data repair and the historical data tracking are implemented by the historical snapshot reading the spatial data based on the log number and the timestamp, the method further comprises performing analysis calculation on the spatial data.

In one possible implementation, the performing analysis and calculation on the spatial data includes:

Configuring a delta expansion module;

The spatial data is serialized into a spatial object and a spatial index through Kryo serialization libraries of Apache Sedona spatial objects;

adding a calculation function for operating the space data through a space function library registered by Sedona;

creating DATAFRAME a dataset based on the configured calculation parameters and the spatial data to be calculated;

and sequentially calling a spatial analysis operator to realize class based on the SPI mechanism, and completing analysis and calculation of spatial data.

In one possible implementation manner, the analysis and calculation of the spatial data is completed by a single implementation class of the spatial analysis operator includes:

Defining a function required by the space analysis and calculation based on the analysis and calculation function requirement of the space data of the current operator;

The spatial data is analyzed by invoking SparkSQL spatial functions in the spatial computation processing function library by SQL statements and/or by converting RDD to SPATIALRDD.

In one possible implementation, the analysis and calculation functions of the spatial data include one or more of buffer analysis, intersection calculation, vector clipping, data screening, partitioning by attribute, spatial join, fusion boundary, data spatialization, and area calculation.

In one possible implementation, the required function functions include one or more of a space vector data initialization function, a space vector data calculation function, a space vector data connection aggregation function, and a space vector data connection aggregation function.

In one possible implementation manner, after the analysis and calculation of the spatial data by converting RDD to SPATIALRDD, the method further includes:

Constructing a directed acyclic graph according to the dependence of SPATIALRDD, and submitting the directed acyclic graph to a DAG scheduler;

The DAG scheduler analyzes the directed acyclic graph, divides the directed acyclic graph into step groups, and submits the step groups to a task scheduler of a Spark cluster;

the Spark task scheduler sends the step group to Executor of Spark;

excutor executing the step group item by item;

When the execution of the step group is finished, excutor writes SPATIALRDD the execution result;

and reading SPATIALRDD the execution result, converting the data format into spatial data readable by a service system and storing the spatial data.

In one possible implementation manner, before the converting the data format into the spatial data readable by the service system and storing, the method further includes:

judging whether to cache the execution result according to analysis and calculation function requirements of different space analysis operators;

And if not, carrying out analysis and calculation on the space data again.

In one possible implementation, the storing of the spatial data includes storing to a spatial database and/or storing in GeoJson format.

In summary, the application has the following beneficial technical effects:

1. Writing the processed heterogeneous spatial data into a data lake based on SPARK DATAFRAMEWRITER; writing Parquet files of the space data into distributed storage, and writing metadata of the space data into a transaction log of the distributed storage; recording the log number of the transaction log and writing metadata of the space data into a timestamp of the transaction log stored in a distributed mode; and reading the historical snapshot of the space data based on the log number and the timestamp to realize data restoration and historical data tracking. The application can perform update operation of record level, manage the space data on the HDFS according to the form of the table, support the business operation of the space data, ensure the atomicity, consistency, isolation and durability characteristics of the space data and support the multi-version management mechanism of the space data.

2. The analysis and calculation flow of the space data in the data lake can be established in a configuration mode according to the analysis and calculation function requirement of the space data, and the space data is rapidly and efficiently analyzed and calculated and distributed to a service requirement end through a spark concurrent processing mechanism.

It should be understood that the description in this summary is not intended to limit the critical or essential features of the embodiments of the application, nor is it intended to limit the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The above and other features, advantages and aspects of embodiments of the present application will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals denote the same or similar elements.

Fig. 1 shows a schematic diagram of a spatial data versioning management method based on a data lake according to an embodiment of the present application.

FIG. 2 is a schematic diagram of a spatial data analysis calculation according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application.

In order to facilitate understanding of the embodiments of the present application, some terms related to the embodiments of the present application will be explained first.

A data Lake (Delta-Lake) is a repository or system that stores data in a raw format without prior structuring of the data. A data lake may store structured data (e.g., tables in a relational database), semi-structured data (e.g., CSV, log, XML, JSON), unstructured data (e.g., email, document, PDF), and binary data (e.g., graphics, audio, video).

Version management is an important function and characteristic of an engineering database management system, versions are snapshots for recording various optional states of specific objects, and the task of version management is to record and maintain a historical evolution process of the objects, select a proper topological structure among the versions according to an actual application background, and at least comprise the following functions: the method comprises the steps of generating a new version, uniformly and coordinately managing each version, effectively recording the evolution process of different versions and effectively managing the different versions, and recording each version with as little data redundancy as possible. At the same time, the logical consistency and relative independence of different versions are ensured, and the generation and disappearance of one version can not affect the content of the other versions. When the version is switched, after a new current version is specified, it is necessary to ensure that the image of the object and the specified version remain consistent.

Fig. 1 shows a flowchart of a spatial data versioning management method based on a data lake according to an embodiment of the present application, referring to fig. 1, the method includes the following steps:

step 101, writing the processed heterogeneous space data into a data lake based on SPARK DATAFRAMEWRITER.

In some embodiments, spatial data is obtained based on spatial database postgis, postgresql SDE, mysql, etc., and/or spatial data store file shpfile, geoJson format json file, etc., in other embodiments, spatial data is obtained based on a variety of data sources such as S3, FTP/SFTP, network access interfaces, etc.

The processing of the spatial data comprises spatial data analysis, abnormal data processing, spatial conversion and the like.

In particular, spatial data parsing supports GeoJson、WKT（Well-known text）/WKB（well-known binary）、ESRI Shapefile、GML（Geographic Markup Language ）、KML parsing of multiple spatial data formats.

Wherein GeoJson is a format encoding various geographic data structures, ESRI SHAPEFILE is a non-topologically simple format for storing geometric location and attribute information of geographic elements, and GML is a geographic markup language.

Step 102, writing Parquet files of the space data into the distributed storage, and writing metadata of the space data into a transaction log of the distributed storage.

Parquet is a column storage format capable of effectively storing nested data, the transaction log is used for guaranteeing atomicity, consistency and isolation of space data transactions, metadata of one data is written into the transaction log stored in a distributed mode, and the transaction log is equivalent to one transaction commit in a database.

In particular, atomicity means that multiple database operations that make up a transaction are an atomic unit that is not separable, and that the entire transaction is committed only if all operations are performed successfully. Any database operation in the transaction fails, and any operation that has been performed must be undone, allowing the database to return to the original state. The consistency is reflected in that after the transaction operation is successful, the state of the database is consistent with the business rule, i.e. the data is not destroyed. If the account A transfers 100 yuan to the account B, the sum of deposit of the account A and the account B is unchanged whether the operation is successful or not. Isolation means that different transactions have their own data space while concurrently operating on data, and their operation does not dare away from each other. Precisely, it is not required to be completely interference free. The database defines a plurality of transaction isolation levels, different isolation levels correspond to different interference achievements, and the higher the isolation level is, the better the data consistency is, but the weaker the concurrent release is.

In the embodiment of the application, since writing the transaction log is itself a commit, all updates are reflected if there is a transaction log, and no updates are reflected if there is no transaction log. Parquet files that are not listed in any transaction are never Read by Read, thus guaranteeing the atomicity of spatial data storage.

Further, consistency of spatial data storage is ensured by checking whether there is an inconsistency in the metadata suspended on the transaction log. If the suspended metadata on the transaction log does not match the schema stored in the metadata, then the update will be denied and the mechanism to maintain the integrity of the data set will work. The number of columns in the pattern and the number of columns in the partition may also be increased as long as there is no collision.

Further, multi-version concurrency control (MVCC) and optimistic exclusive control are utilized to guarantee isolation of spatial data. For a read operation, by keeping the version number in Spark when reading data, the same version of the spatial dataset (data snapshot) can continue to be read even if an update of another transaction occurs. For a write operation, the latest version of the data set at the beginning of the transaction is noted, indicating that a data conflict with other transactions occurred if there was already a transaction log of (last version+1) at commit.

In embodiments of the present application, when spatial data is stored in a data lake, the data lake stores transaction logs and metadata directly in an object store and uses a set of protocols on the object store operations to achieve serialization. The space and attributes are stored in Parquet format and any software that already supports Parquet can be used to access the data as long as a most basic connector is implemented to find the object set to read.

Step 103, record the log number of the transaction log and the metadata of the space data is written into the timestamp of the distributed stored transaction log.

The transaction log is provided with a log number and is used for making a version of the data set; the time stamp refers to a manner of recording time, and is generally used for recording the occurrence time of an event or information such as creation time and modification time of a file.

In the embodiment of the application, when the space data is read, the required Parquet file is determined by reading the transaction log, the log number used in the reading is recorded, the required Parquet file is read, and the space data is read.

In the single transaction reading, the Parquet file is read by using the recorded log number all the time to ensure the data consistency.

Step 104, the data restoration and the historical data tracking are realized based on the historical snapshot of the log number and the time stamp read space data.

In the embodiment of the application, the data can be restored to any space data record point based on the historical snapshot of the log number and the time stamp query space data, and the data is restored and the historical data is tracked; the function of space data versioning management is realized by reading a historical snapshot of the space data according to the log number of the space data storage based on the non-modifiable modification of the underlying data object and the transaction log.

Further, the spatial data is analyzed and calculated.

Specifically, referring to fig. 2, a delta expansion module is first configured.

The column storage of the space data, the transaction support of the space data, the operation of a record level and the multi-version management are all realized in a delta expansion module.

In the embodiment of the application, the spatial data and the attributes are managed by adding the table.

Further, the spatial data is serialized into a spatial object and a spatial index through Kryo serialization libraries of Apache Sedona spatial objects.

Wherein Kryo implements a serialization framework for Java objects, and objects transferred in the network by the Spark framework or cached in the memory/hard disk need to be serialized and then distributed to tasks on Executor. Executor are calculated as serialized spatial data objects. The spatial serialization object mainly has two major classes of spatial objects and spatial indexes, the spatial objects comprise points, LINESTRING, polygons, multitpoints, multiLineString, multiPolygon, geometry collection and Circle, envelope, and the spatial indexes comprise Quadtree, STRtree.

Further, a calculation function operating on the spatial data is added through registering a spatial function library of Sedona.

In the embodiment of the application, the Spark is provided with no operation function for the space data in the SQL calculation function, and the calculation function for the space data operation is added by registering the space function library of the Sedona so as to realize the basic function of space analysis calculation. The spatial processing functions are packaged based on a spatial distributed computing data set which extends in Spark, and analysis and computation of spatial data are conveniently carried out in a Spark SQL mode.

Wherein the required function functions include one or more of a space vector data initialization function, a space vector data calculation function, a space vector data connection aggregation function, and a space vector data connection aggregation function.

Specifically, the space vector data initialization function can realize the function of converting a conventional space data storage format into a space object; in SparkSQL, the space vector data calculation function performs space analysis calculation on the space vector object, such as point-line-surface extraction conversion of space data, space data buffer area creation, coordinate system conversion and the like; the space vector data connection aggregation function is used for judging the space relation of a plurality of space data sets and comprises space operation functions such as association, superposition, aggregation, intersection and the like.

Further, a DATAFRAME dataset is created based on the configured computing parameters and the spatial data to be computed.

In the embodiment of the application, based on the configured space analysis DAG graph, the DAG graph is decomposed into independent callable space analysis operator sequences, the space data calculation parameters set by the configuration are read, the space data needing analysis and calculation are read from the data lake, and the space data is defined as a distributed DATAFRAME dataset capable of operating in spark.

Further, the spatial analysis operator implementation class is sequentially called based on the SPI mechanism, and analysis and calculation of the spatial data are completed.

Specifically, defining a function required by the space analysis and calculation based on the analysis and calculation function requirement of the space data of the current operator; the spatial data is analyzed by invoking SparkSQL spatial functions in the spatial computation processing function library by SQL statements and/or by converting RDD to SPATIALRDD.

The analysis and calculation functions of the space data comprise one or more of buffer area analysis, intersection calculation, vector clipping, data screening, attribute-based segmentation, space connection, fusion boundary, data spatialization and area calculation.

In some embodiments, the space functions in the SparkSQL space computation processing function library are called through SQL sentences, and space operations such as space superposition, space aggregation and the like are performed on the space data objects; in other embodiments, for complex spatial computing functions that SPATIAL SQL cannot provide, complex spatial operations are performed on spatial data after partitioning using SPATIALRDD by converting the elastic distributed data set (RDD) into a spatially elastic distributed data Set (SPATIALRDD) that is dedicated to analysis and computation of spatial data.

Specifically, the buffer analysis is a function of creating a buffer polygon within a specified distance around the input element; intersection calculation is a function of calculating an intersection of a source data set (to-be-intersected data set) and a superposition data set (intersected data set); vector clipping is to clip a vector data set, including internal clipping and external clipping; data filtering is to extract elements from an input element class or input element layer (typically using a selected or Structured Query Language (SQL) expression) and store them in an output element class; partitioning the input dataset by unique attribute by attribute partitioning means; the spatial connection means that the attribute of one element class is connected to the attribute of another element class according to the spatial relationship, and the target element and the connected attribute from the connected element are written into the output element class; the fusion boundary represents that the objects meeting certain conditions in the line and surface data are fused into one object; the data spatialization includes point element spatialization, line element spatialization and plane element spatialization; the area calculation supports calculating the area of the vector surface dataset.

In the embodiment of the application, a space data analysis and calculation function of a web edition is provided, a space analysis flow is configured in a DAG (directed acyclic graph) mode, a space data source needing space analysis is selected, a space analysis and calculation method and corresponding calculation parameters carried out by each node are set, and finally, parameter submission task scheduling of a JSON data format is uniformly generated so as to realize the space data analysis and calculation function of the web edition.

Further, constructing a directed acyclic graph according to the SPATIALRDD dependency relationship, and submitting the directed acyclic graph to the DAG scheduler; the DAG scheduler analyzes the directed acyclic graph, divides the directed acyclic graph into step groups, and submits the step groups to a task scheduler of the Spark cluster; the Spark task scheduler sends the step group to Executor of Spark; excutor executing the step group item by item; when the execution of the step group is finished, excutor writes the execution result into SPATIALRDD; and reading SPATIALRDD the execution result, converting the data format into spatial data readable by the service system and storing the spatial data.

Wherein the storing of the spatial data includes storing to a spatial database and/or storing in GeoJson format.

In some embodiments, the data is stored in postgis, postgresql SDE, or other database that can store spatial data through the DATAFRAME (resilient distributed data set with metadata information) of spark, and provided for subsequent business system calls or releases as spatial services, in other embodiments, for web page display in GeoJson format, or the like.

In the embodiment of the application, for the execution result in each SPATIALRDD, whether the execution result is cached or not is judged according to the analysis and calculation function requirements of different space analysis operators; and if not, carrying out analysis and calculation on the space data again.

According to the embodiment of the disclosure, the following technical effects are achieved:

It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are alternative embodiments, and that the acts and modules referred to are not necessarily required for the present application.

The above description of the method embodiments further describes the solution of the present application by means of device embodiments.

It should be understood that references herein to "at least one" mean one or more, and "a plurality" means two or more. In the description of the embodiments of the present application, unless otherwise indicated, "/" means or, for example, a/B may represent a or B; "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, in order to facilitate the clear description of the technical solution of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

The above-mentioned exemplary embodiments of the present application are not intended to limit the embodiments of the present application, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the embodiments of the present application should be included in the protection scope of the present application.

Claims

1. The space data versioning management method based on the data lake is characterized by comprising the following steps of:

Writing the processed heterogeneous spatial data into a data lake based on SPARK DATAFRAMEWRITER; the processing of the heterogeneous space data comprises space data analysis, abnormal data processing and space conversion;

Writing Parquet files of the heterogeneous space data into distributed storage, and writing metadata of the heterogeneous space data into a transaction log of the distributed storage;

Recording the log number of the transaction log and the timestamp of the metadata write-in distributed storage transaction log of the heterogeneous space data;

And reading the historical snapshot of the heterogeneous space data based on the log numbers and the time stamps to realize data restoration and historical data tracking.

2. The method of claim 1, wherein versioning management of the heterogeneous spatial data is implemented in a delta expansion module.

3. The method of claim 1, wherein the reading the history snapshot of the heterogeneous spatial data based on the log number and the timestamp achieves data repair and history data tracking, further comprising performing an analytical calculation on the heterogeneous spatial data.

4. A method according to claim 3, wherein said analyzing and computing said heterogeneous spatial data comprises:

Configuring a delta expansion module;

serializing the heterogeneous spatial data into a spatial object and a spatial index through Kryo serialization libraries of Apache Sedona spatial objects;

adding a calculation function for operating the heterogeneous space data through a space function library registered by Sedona;

Creating DATAFRAME a dataset based on the configured computation parameters and the heterogeneous spatial data to be computed;

And sequentially calling a space analysis operator to realize class based on the SPI mechanism, and completing analysis and calculation of the heterogeneous space data.

5. The method of claim 4, wherein the single spatial analysis operator implementation class performs an analytical computation of heterogeneous spatial data comprising:

Defining a function required by the space analysis and calculation based on the analysis and calculation function requirement of the heterogeneous space data of the current operator;

the heterogeneous spatial data is analytically computed by invoking SparkSQL spatial functions in a spatial computation processing function library by SQL statements and/or by converting RDD to SPATIALRDD.

6. The method of claim 5, wherein the analysis computation functions of the heterogeneous spatial data include one or more of buffer analysis, intersection computation, vector clipping, data screening, partitioning by attribute, spatial join, fusion boundary, data spatialization, and area computation.

7. The method of claim 5, wherein the required function functions include one or more of a space vector data initialization function, a space vector data calculation function, and a space vector data connection aggregation function.

8. The method of claim 5, wherein said analyzing said heterogeneous spatial data by converting RDD to SPATIALRDD further comprises:

the Spark task scheduler sends the step group to Executor of Spark;

excutor executing the step group item by item;

and reading SPATIALRDD the execution result, converting the data format into heterogeneous space data readable by a service system and storing the heterogeneous space data.

9. The method of claim 8, wherein converting the data format into heterogeneous spatial data readable by the business system and storing the heterogeneous spatial data, further comprises:

And if not, carrying out analysis and calculation on the heterogeneous space data again.

10. The method of claim 8, wherein the storing of the heterogeneous spatial data comprises storing to a spatial database and/or storing in GeoJson format.