CN113934750A

CN113934750A - Data blood relationship analysis method based on compiling mode

Info

Publication number: CN113934750A
Application number: CN202111246040.0A
Authority: CN
Inventors: 黎哲
Original assignee: Shanghai Zezi Information Technology Co ltd
Current assignee: Shanghai Zezi Information Technology Co ltd
Priority date: 2021-10-26
Filing date: 2021-10-26
Publication date: 2022-01-14
Anticipated expiration: 2041-10-26
Also published as: CN113934750B

Abstract

The invention discloses a data blood relationship analysis method based on a compiling mode, which comprises the following steps: s1) collecting SQL scripts and metadata on the data management platform to form a uniform format; s2) analyzing the collected SQL script to form an abstract syntax tree, and performing semantic analysis or SQL interpretation according to metadata; s3) constructing a data blood relationship based on the SQL syntax tree with semantic information and the data blood relationship model; s4) according to the data blood relationship, carrying out the position layout of graphic elements, and carrying out the visualization display and the data tracing operation; s5) determining a unique identifier for each data transition in the data context, and persisting the data context. The invention can obtain correct tracing results, realizes good data blood relationship analysis support for a plurality of databases, and is easy to support for a data management platform emerging later.

Description

Data blood relationship analysis method based on compiling mode

Technical Field

The invention relates to a data blood relationship analysis method, in particular to a data blood relationship analysis method based on a compiling mode.

Background

With the rapid development and application of the technology in the field of big data, data blood relationship analysis becomes an important component of data management, and the data blood relationship analysis technology becomes an important means for data source tracing, data value evaluation, data compliance proof and data quality evaluation. The current data blood relationship analysis technology has three main types:

1. a pattern matching based approach;

2. a tag technology based approach;

3. based on the manner of code parsing technique.

The code analysis technology-based mode can completely track the flow of data by performing algorithm analysis on codes for data processing, conversion and transmission, and know where the data comes from and goes to and what changes occur in the flow process.

In the field of big data, the SQL language is widely used for transmission, conversion and loading of data. Therefore, the SQL statement codes are analyzed by using the compiling technology, the data conversion and flow between the database table and the field can be quickly, automatically and timely combed, a complete data consanguinity relationship is formed, important metadata is provided for data management, the data use efficiency of developers and business personnel is improved, and the potential value of the data is fully mined.

The prior art mainly has the following potential problems:

first, most of the data blood margin analysis tools commonly used in the market are based on the data blood margin analysis tools provided by the data governance platform, and then are improved and integrated. For example, secondary development is performed by using data margin analysis capability of Hive.

Potential problems with the use of such tools:

1. the data margin analysis tool is bound to a particular platform and only supports that platform.

2. The discovered problems can only be solved by depending on the provider of the data governance platform.

3. Data blooding margins of multiple data platforms cannot be analyzed in an integrated manner.

For example, for the big data analysis tool Google BigQuery, it is obviously not feasible to use Hadoop Hive's blood margin analysis tool to analyze its data blood margin relationship.

The data blood margin analysis tool developed based on the invention realizes good data blood margin analysis support for a plurality of databases and is easy to support for a data governance platform emerging in the future.

The existing data blood relationship development tools lack comprehensive and systematic description of data blood relationship, do not have a unified data blood relationship model, and are not beneficial to managing data among multiple data management platforms.

And thirdly, no clear algorithm is provided for tracing and persisting the data by using the data blood relationship.

Fourthly, the SQL analysis-based data blood relationship analysis method has the important precondition that SQL grammars of different database manufacturers can be completely analyzed. However, the existing data blood relationship analysis tools generally use the SQL parser provided by the data management platform itself, and cannot completely parse SQL statements of other manufacturers, and the cross-platform performance is lost.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a data blood relationship analysis method based on a compiling mode, which can obtain a correct tracing result and is suitable for different data management platforms.

The technical scheme adopted by the invention for solving the technical problems is to provide a data blood relationship analysis method based on a compiling mode, which comprises the following steps: s1) collecting SQL scripts and metadata on the data management platform to form a uniform format; s2) analyzing the collected SQL script to form an abstract syntax tree, and performing semantic analysis or SQL interpretation according to metadata; s3) constructing a data blood relationship according to the data blood relationship model based on the SQL syntax tree with semantic information; s4) according to the data blood relationship, carrying out the position layout of graphic elements, and carrying out the visualization display and the data tracing operation; s5) determining a unique identifier for each data transition in the data context, and persisting the data context.

Further, the SQL script and the metadata in step S1 are obtained as follows: for the distributed file system, a file copying mode is adopted for direct acquisition; for the database, the JDBC driver is connected with the database to read the data in the table; for a source code version control server, acquiring a file by using a source code version control tool client interface; for a dedicated system, the data is acquired in real time using the interface provided by the dedicated system.

Further, the step S2 includes: s21) establishing corresponding lexical and syntactic analyzers based on SQL grammars of different databases, and carrying out lexical and syntactic analysis on SQL sentences; s22) constructing an SQL statement abstract syntax tree with a unified structure; s23) performing semantic analysis based on the abstract syntax tree, and establishing correct corresponding relation of tables and fields; s24) traversing the abstract syntax tree by combining the externally provided metadata, and eliminating the ambiguous relation of the table and the field; s25) interpreting and executing the dynamic SQL statement to further refine the correct correspondence of tables and fields.

Further, the elements in the data relationship model in the step S3 include entities and relationships, each entity includes one or more attributes; data between entities are converted and flowed through various processing processes; each data transformation comprises two entities and a relationship, and a plurality of data transformations are combined to form a data consanguinity relationship; setting this direct data flow to the fdd relationship if the data of the destination field comes from the source field through some transformation; if the data for the target field is not from the source field, but the data for the source field has an effect on the target field value, then this indirect data effect is set to the fdr relationship.

Further, the data blood relationship model comprises the following three levels: and (4) task layer: a task for processing data in the data management platform; and (3) a data layer: the method comprises the steps of including a table in a database, a directory and a file in the HDFS; a field layer: the columns in the data table are indexed.

Further, the field-level data tracing process in step S4 is as follows: setting the fdd relation to 1 and setting the fdr relation to 0; performing logical AND operation on all relations on one link; if the result is 1, the data flow from the source data to the target data is an actual data flow; otherwise, there is only a data impact relationship from the source data to the target data.

Further, the table-level data tracing process in step S4 is as follows: firstly, tracing the source of the table-level data one by one according to the blood relationship of the field-level data; in the same source data and target data, if a plurality of data blood margin links exist, all blood margin links containing the fdr relation are removed; and finally, carrying out data backtracking on the tables according to the blood relationship of the field-level data, and if the blood relationship exists in any field-level data of the two tables, judging that the two tables have the blood relationship.

Further, the step S4 further includes: constructing various table and field mapping relations by adopting Map, and merging the table and field mapping relations; performing two-level caching of an internal memory and a file level on the data blood relationship result, wherein the internal memory level stores the processing results of a plurality of SQL sentences and adopts an LRU expiration strategy; the file cache stores the processing result of the whole database batch SQL script, and simultaneously, the ProtoBuf mode is adopted for the file cache to carry out serialization or deserialization on the data.

Further, the step S4 adopts an adapter design mode to access various graphic layout algorithms, performs graphic layout calculation of data blood relationship, and performs visual display based on the layout result.

Further, the step S5 includes: s51) naming the entity by using the uniform resource identifier, wherein different entity types can adopt different URIs; s52) solving a hash value of the URI of the entity to obtain a unique identifier of the entity, and avoiding overlong character strings; s53) the unique identifier of the relationship depends on the SQL statement and the corresponding conversion code, and is usually an expression; s54) generalizing the SQL statement where the relation is located, and removing the constant in the where statement; s55) solving a hash value of the generalized SQL statement text; s56) solving a hash value of the conversion code of the relation; s57) concatenating the hash values in S55, S56 as the unique identifier of the relationship; s58) concatenating the hash values of the source entity, the target entity, and the relationship as a unique identifier of the "data transformation".

Compared with the prior art, the invention has the following beneficial effects: the compiling-mode-based data blood relationship analysis method provided by the invention can obtain correct tracing results, and realizes good data blood relationship analysis support for AWS athena, Google bigquery, corehbase, dax, db2, greenplus, SAP Handa, live, inpela, informix, mdx, mysql, netezza, odbc, openedge, oracle, postgresql, presto, redshift, snowfly, sparksql, sqlserver, sybase, terada and verttica, and is easy to support for a data governance platform which appears later.

Drawings

FIG. 1 is a flow chart of data relationship analysis based on compiling mode according to the present invention;

FIG. 2 is a diagram of a data session including a data transformation performed by SQL of the present invention;

FIG. 3 is a visualized blood relationship diagram according to the present invention.

Detailed Description

The invention is further described below with reference to the figures and examples.

FIG. 1 is a flow chart of the data relationship analysis based on compiling method of the present invention.

Referring to fig. 1, the main process of the data relationship analysis method based on compiling mode provided by the present invention includes:

s1) collecting the SQL script and the metadata on the data governance platform to form a uniform format.

S2) parsing the collected SQL script to form an abstract syntax tree, performing semantic analysis based on metadata, or interpreting and executing SQL.

S3) constructing the data blood relationship based on the SQL syntax tree with semantic information and the data blood relationship model.

S4) according to the data blood relationship, carrying out position layout of graphic elements, carrying out visual display, carrying out data tracing and other operations.

S5) determining a unique identifier for each data transition in the data context, and persisting the data context.

First, collect SQL script and metadata on data governance platform

SQL script and metadata storage location

1. And the file system comprises an HDFS distributed file system.

2. A database.

3. The source code version controls the server.

4. A dedicated system.

SQL script and metadata storage format

1. A text file containing only SQL scripts.

2. Formatted files, including SQL scripts and metadata, such as CSV files.

3. A database table.

SQL script and metadata acquisition mode

1. And copying the file.

2. And connecting the database through JDBC drive, and reading data in the table.

3. And acquiring the file by using a client interface of the source code version control tool.

4. The data is acquired in real time through an interface of a dedicated system, such as hook by Hive.

Second, data blood relationship model

Data blood relationship plays an increasingly important role in data governance of data governance platforms. The hive of the Hadoop platform supports the analysis of the data blood relationship, but lacks the complete definition and description of the data blood relationship, is limited to the use of the Hadoop platform, and cannot be used for uniformly representing the data blood relationship under the condition of mixing a plurality of data platforms.

The invention provides a data blood relationship model, and the data blood relationship can be modeled, calculated, stored and displayed based on the model.

2.1 hierarchy of data relationship

In general, data-based genetic analysis comprises the following three levels:

1. task layer

The task layer refers to a task for processing data in the data governance platform, typically a job in Spark/Hive.

2. Data layer

The data layer contains tables in the database, directories, files in the HDFS, etc. The objects of this layer are the subjects of data consanguinity.

3. Field layer

A field generally refers to a column in a data table. The data relationship established in the layer can provide the finest and accurate analysis result for the user, and the data relationship of other layers is deduced through the data relationship of the layer.

2.2 entities and relationships in data context

All objects in the data relationship hierarchy are collectively referred to as entities (entries), each of which contains one or more attributes (attributes).

The jobin Spark/Hive of the task layer is an entity, and the entity has attributes such as name, running time and the like.

The most important entity in the data layer is a table (table), and the attribute in the table is called field (column) in the invention.

Data between entities is transformed and streamed through various processes, and such transformation and streaming is called relationships between entities. There are two types of relationships:

1. direct data flow: the data of the destination field is converted from the source field by some kind, and the relationship of the blood relationship is referred to as fdd.

2. Indirect data influence, the data of the target field is not from the source field, but the data of the source field has influence on the formation of the value of the target field, and the relationship is called fdr.

To each relation, other information such as the SQL statement type for performing data conversion, the code indicating the data conversion rule, and the number of data conversions is attached. Through the data conversion rule, the invention can evaluate the quality of the data. By counting the data conversion times, the invention can know the use frequency of the data.

The data relation between the fields is the finest granularity relation, and the data blood relation between the upper-layer entities can be calculated through the field blood relation.

2.3 atomic units of data kindred relationship

Using the entities and relationships defined above, the present invention can give the atomic units of the kindred relationship: data transform (data transform). The whole data blood relationship is the set of all atomic units.

Each data transformation contains two entities and a relationship, the two entities being objects in the same data layer, the relationship describing the data flow between the entities.

Take the following SQL statement as an example:

SELECT FirstName as Name from Person

this SQL-formed data lineage contains one data transformation, as shown in fig. 2.

The data representing the Name field in the target entity data set select list RS-1 comes from the FirstName field in the source entity table Person.

2.4 data relationship by data transformation

The data blooding relationship graph can be formed by combining a plurality of data conversions. Take this create view SQL statement as an example

create view v(myname)

as SELECT p.FirstName as name

from Person AS p

A data blood relationship is formed by converting two data with the relationship type of fdd.

Person.FirstName->fdd->RS-1.name->fdd->v.myname

The visualized blood relationship graph is shown in FIG. 3; the actual data-blood relationship is far more complex than this, but the rationale is the same.

Third, construction of data relationship of blood relationship

With the data blood relationship model introduced above, the invention can perform actual SQL script blood relationship analysis work.

3.1 parsing SQL scripts

In order to support SQL grammars of different databases and avoid dependence on SQL analysis tools of any database manufacturers, a general SQL parser is provided, which can completely support the SQL grammars of the different databases and comprises a storage process, so that data blood-related analysis across database platforms becomes possible. Running the generic SQL parser does not rely on any third party libraries. The general SQL parser is implemented as follows:

1. and establishing a corresponding lexical and syntactic analyzer based on SQL grammars of different databases, and performing lexical and syntactic analysis on the SQL sentences.

2. And constructing an SQL statement abstract syntax tree with a unified structure.

3. Semantic analysis is performed based on the abstract syntax tree, and correct corresponding relations between tables and fields are established.

4. Traversing the abstract syntax tree in conjunction with externally provided metadata disambiguates certain tables and fields.

5. And interpreting and executing the dynamic SQL statement to further perfect the correct corresponding relation of the table and the field.

3.2 construction of the relationship of the blood relationship of SQL statement data

After the analysis of the SQL script is completed and the syntax tree with semantic information is obtained, the invention has complete information of the source data table and the fields and the corresponding target data table and the fields. How to establish data lineage relationships for different types of SQL statements is described in detail below.

1, SELECT SQL statement

The SELECT statement is the most commonly used statement in SQL, and is also commonly used as a component of other SQL statements.

i.select list

SELECT p.FirstName as name

from Person AS p

Establishing an fdd type data relationship from a source field to a target field:

Person.FirstName->fdd->resultSet.name

ii.where clause

SELECT p.FirstName as name

from Person AS p

where city＝'NY'

a data lineage relationship of fdr type is established indicating that the number of rows in the result set produced by select list is affected by person.

Person.city->fdr->resultSet.rows

iii.group by

SELECT dept,sum(sal)salary

from scott.emp

group by dept

An fdr-type data context is established, indicating that the result value produced by the aggregation function sum () is affected by person.

Person.dept->fdr->resultSet.salary

At the same time, a data lineage relationship of the fdd type is also established, indicating that the result value produced by the aggregation function sum () comes from person.

Person.sal->fdd->resultSet.salary

UPDATE SQL statement

UPDATE Per

SET Per.PersonCityName＝Addr.City

FROM Persons Per

INNER JOIN AddressList Addr

ON Per.PersonId＝Addr.PersonId

Establish a data lineage relationship of fdd type, indicating that persons. personCityName value comes from addressList. City.

AddressList.City->fdd->Persons.PersonCityName

3, INSERT SQL statement

INSERT INTO Customers(CustomerName,City,Country)

SELECT SupplierName,City,Country FROM Suppliers；

Establishing three data blood relationship of the fdd type, which are respectively as follows:

Suppliers.SupplierName->fdd->Customers.CustomerName

Suppliers.City->fdd->Customers.City

Suppliers.Country->fdd->Customers.Country

4. more SQL statements

The data blood relationship model can represent the data blood relationship generated by all SQL sentences, and is not repeated herein.

3.3 method for tracing field level data

The construction of the data blood relationship is completed, and the data tracing can be performed according to the invention.

One data may have multiple data sources and may be processed from multiple processes. In a big data environment, the general user is concerned with the relationship that there is actual data flow, i.e., the fdd type of relationship, and the pair fdr relationship is used when troubleshooting more detailed problems.

Within the same source and target data, there are typically multiple data lineage links representing different data processing. The algorithm provided by the invention can be used for quickly and conveniently finding out the blood margin link with actual data flow, removing the blood margin link containing the fdr relation and facilitating the user to quickly position and check the blood margin relation of the data.

Column1 has the following data bloodline links:

data blood margin link:

1:tableA.column1->fdd->tableC.column->fdd->tableD.column->fdd->tableB.column1

2:tableA.column1->fdd->tableE.column->fdr->tableF.column->fdd->tableB.column1

3:tableA.column1->fdd->tableG.column->fdd->tableH.column->fdr->tableB.column1

the user wants to know whether the target data tableb. column1 has actual data from tablea. column1, the present invention performs the following calculation:

1. the fdd relation is set to 1, and the fdr relation is set to 0.

2. All relationships on one link are logically anded.

3. If the result is 1, it indicates that the data flow from the source data to the target data is actual data flow. Otherwise, there is only a data impact relationship from the source data to the target data.

According to the above algorithm:

the final blood relationship for link 1 is: 1&1&1 ═ 1, which is the actual data stream.

The final blood relationship for link 2 is: 1&0&1 ═ 0, to influence the data stream.

The final blood relationship for link 3 is: 1&1&0 ═ 0, to influence the data stream.

0604. Method for tracing table-level data

The flow of data generally occurs at the field level, so table-level data tracing must be based on field-level data consanguinity. If the source is directly traced from the blood-related relationship between tables, an erroneous tracing result may be generated.

Taking this SQL script as an example:

insert into B(a1,a2,a3)select a1,a2,a3 from A1；

insert into B(a1,a2,a3,a4)select a1,a2,a3,a4 from A2；

UPDATE C1 SET a2＝(SELECT a2 FROM B WHERE B.a1＝C1.a1)；

UPDATE C2 SET a4＝(SELECT a4 FROM B WHERE B.a3＝C2.a3)；

if only the data blood relationship of the table level is concerned, the invention obtains the following data blood relationship:

A1->fdd->B

A2->fdd->B

B->fdd->C1

B->fdd->C2

according to the data backtracking of the tables C1 and C2 based on the blood relationship, the invention can obtain:

c1 dependence on A1

C1 dependence on A2

C2 dependence on A1

C2 dependence on A2

Looking at SQL scripts, the present invention can find that C2 is not dependent on A1, so this result is erroneous. The reason is that the data consanguinity relationship at the table level does not represent complete data flow information.

The following are the bloody relationships at the SQL statement field level, with the characters in parentheses representing the field name

A1(a2)->fdd->B(a2)

A2(a2)->fdd->B(a2)

A2(a4)->fdd->B(a4)

B(a2)->fdd->C1(a2)

B(a4)->fdd->C2(a4)

According to the relationship of the blood relationship, the invention can obtain the correct tracing result:

c1 dependence on A1

C1 dependence on A2

C2 is dependent on A2.

Through investigation on a large number of enterprise data management platforms, the invention analyzes and obtains an empirical formula of data quantity calculation of data blood relationship in an enterprise-level data platform:

count(relation)＝(count(table)+count(view))*10*2；

relationship represents the single atom transform in the data edge, count (relationship) represents the total number of atom transforms in the entire data edge analysis result, count (table) represents the number of all tables, and count (view) represents the number of all views.

A data governance platform of a general enterprise may include ten thousand tables and views, where 4 ten thousand tables and views are taken as examples, and count (relationship) ═ 40000+40000 × 10 × 2 ═ 1600000.

Therefore, the data blood margin analysis result generates million-level relations, the data blood margin of the magnitude is constructed, traced and stored, the tracing time complexity of the algorithm is O (n), and the results are obtained in 10 seconds through the query of the table and the fields. In order to meet the data blood relationship analysis requirements of most enterprises and further improve the data utilization efficiency of the enterprises, the invention further adopts the following optimization algorithm for the data blood relationship:

1. and constructing various tables and field mapping relations by adopting Map, wherein the time complexity of the search is O (1).

2. And merging according to the table and field mapping relation, wherein the time complexity of the merging result is O (n).

3. And performing two-level caching of a memory and a file level on the data blood relationship result.

4. The memory level mainly stores some small SQL processing results, and adopts an LRU (Least Recently Used) expiration strategy, which is a commonly Used page replacement algorithm.

5. And the file cache stores the processing result of the whole database batch SQL script.

6. The ProtoBuf mode is adopted for file caching to serialize/deserialize data, and the deserialization time of million-level data can be compressed to be within 5 seconds. Protobuf is a set of methods developed by Google to serialize data structures, and can be used as a communication protocol, a data storage format, and the like.

Fourth, visualization of data blood relationship

When the data relationship is calculated, in order to present the data relationship to the user in a reasonable and beautiful graphic manner, the entity elements and the relationship in the relationship need to be graphically laid out, and the coordinate position of each element in the graphic needs to be calculated.

The invention adopts an adapter design mode, and a visualization module of the data blood relationship adopts the adapter design mode to access various graphic layout algorithms, perform graphic layout calculation of the data blood relationship, and then perform visualization display based on a layout result.

The calculation of the graphical layout algorithm may be done at the user terminal and then displayed visually. Or the method can be finished at a server side, and the layout coordinate data of the graph is sent to the user terminal, and in this way, the user terminal only carries out visual display.

The invention does not provide a graphic layout algorithm, and developers can adopt commercial, open source or self-developed graphic layout algorithms by themselves.

Fifth, persistence of data relationship

Data governance is a system engineering, and analyzing data blood relationship is an important link. After the data relationship analysis is completed, the data relationship analysis is usually required to be in butt joint with a data directory (data catalog) of a data management platform, the data relationship is stored in the data directory as metadata, persistence is completed, the data relationship is accumulated, and query, development and management of subsequent services are facilitated. The data consanguinity relationships may also be stored in a relational or graphical database.

The data relationship is inserted, updated and deleted into the data directory by taking 'data conversion' as a unit. To perform these operations correctly, the problem of unique identification of each "data transformation" needs to be solved.

The present invention recognizes that a "data transformation" comprises two entities and a relationship, and by uniquely identifying the two entities and the relationship, the present invention can uniquely identify the "data transformation". The specific implementation algorithm is as follows:

1. entities are named with Uniform Resource Identifiers (URIs). Different entity types may employ different URIs, which for a field take the form of: table column @ cluster.

2. And solving a hash value of the URI of the entity to obtain the unique identifier of the entity, and simultaneously avoiding overlong character strings.

3. The unique identifier of the relationship depends on the SQL statement in which it resides, and the corresponding translation code, usually an expression.

4. And generalizing the SQL statement where the relation is located, and removing the constant in the where statement.

5. And solving a hash value of the generalized SQL statement text.

6. And solving a hash value of the conversion code of the relation.

7. The hash value in the concatenation 5, 6 serves as the unique identifier for the relationship.

8. And splicing the hash values of the source entity, the target entity and the relationship to serve as the unique identifier of the data conversion.

The procedure for writing data context is as follows:

1. a unique identifier for each "data conversion" is generated.

2. If the "data transition" is not found in the data directory, a "data transition" record is inserted.

3. If the "data transitions" already exist in the data directory, the number of "data transitions" is updated.

The invention can accurately acquire the quality and the heat of the data, and can investigate the quality and the heat of the data by analyzing the data blood relationship in the data catalogue. Looking at the conversion rule codes contained in the data conversion relation, the conversion process of the data is known, and therefore the quality of the target data is known. The number of updates of each "data conversion" is checked to see the heat of the data.

Although the present invention has been described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A data blood relationship analysis method based on a compiling mode is characterized by comprising the following steps:

s1) collecting SQL scripts and metadata on the data management platform to form a uniform format;

s2) analyzing the collected SQL script to form an abstract syntax tree, and performing semantic analysis or SQL interpretation according to metadata;

s3) constructing a data blood relationship based on the SQL syntax tree with semantic information and the data blood relationship model;

s4) according to the data blood relationship, carrying out the position layout of graphic elements, and carrying out the visualization display and the data tracing operation;

2. The compiling-mode-based data consanguinity relationship analysis method according to claim 1, wherein the SQL script and the metadata are obtained in the following manner in step S1:

for the distributed file system, a file copying mode is adopted for direct acquisition;

for the database, the JDBC driver is connected with the database to read the data in the table;

for a source code version control server, acquiring a file by using a source code version control tool client interface;

for a dedicated system, the data is acquired in real time using the interface provided by the dedicated system.

3. The compiling method-based data consanguinity relationship analysis method of claim 1, wherein the step S2 includes:

s21) establishing corresponding lexical and syntactic analyzers based on SQL grammars of different databases, and carrying out lexical and syntactic analysis on SQL sentences;

s22) constructing an SQL statement abstract syntax tree with a unified structure;

s23) performing semantic analysis based on the abstract syntax tree, and establishing correct corresponding relation of tables and fields;

s24) traversing the abstract syntax tree by combining the externally provided metadata, and eliminating the ambiguous relation of the table and the field;

s25) interpreting and executing the dynamic SQL statement to further refine the correct correspondence of tables and fields.

4. The compilation-based data relationship analysis method according to claim 1, wherein the elements in the data relationship model in the step S3 include entities and relationships, each entity including one or more attributes; data between entities are converted and flowed through various processing processes; each data transformation comprises two entities and a relationship, and a plurality of data transformations are combined to form a data consanguinity relationship; setting this direct data flow to the fdd relationship if the data of the destination field comes from the source field through some transformation; if the data for the target field is not from the source field, but the data for the source field has an effect on the target field value, then this indirect data effect is set to the fdr relationship.

5. The compilation-based data relationship analysis method of claim 4, wherein the data relationship model comprises the following three layers:

and (4) task layer: a task for processing data in the data management platform;

and (3) a data layer: the method comprises the steps of including a table in a database, a directory and a file in the HDFS;

a field layer: the columns in the data table are indexed.

6. The compiling-mode-based data consanguinity relationship analysis method according to claim 4, wherein the field-level data tracing process in the step S4 is as follows:

setting the fdd relation to 1 and setting the fdr relation to 0;

performing logical AND operation on all relations on one link;

if the result is 1, the data flow from the source data to the target data is an actual data flow; otherwise, there is only a data impact relationship from the source data to the target data.

7. The compiling-mode-based data blooding source relation analyzing method according to claim 6, wherein the tracing process of the table data in the step S4 is as follows:

firstly, tracing the source of the table-level data one by one according to the blood relationship of the field-level data;

in the same source data and target data, if a plurality of data blood margin links exist, all blood margin links containing the fdr relation are removed;

and finally, carrying out data backtracking on the tables according to the blood relationship of the field-level data, and if the blood relationship exists in any field-level data of the two tables, judging that the two tables have the blood relationship.

8. The compiling method-based data consanguinity analysis method of claim 1, wherein the step S4 further comprises:

constructing various table and field mapping relations by adopting Map, and merging the table and field mapping relations;

performing two-level caching of an internal memory and a file level on the data blood relationship result, wherein the internal memory level stores the processing results of a plurality of SQL sentences and adopts an LRU expiration strategy; the file cache stores the processing result of the whole database batch SQL script, and simultaneously, the ProtoBuf mode is adopted for the file cache to carry out serialization or deserialization on the data.

9. The compiling method based data relationship analysis method according to claim 1, wherein the step S4 employs an adapter design mode to access various graphic layout algorithms, performs graphic layout calculation of data relationship, and performs visual display based on the layout result.

10. The compiling method-based data consanguinity relationship analysis method of claim 1, wherein the step S5 includes:

s51) naming the entity by using the uniform resource identifier, wherein different entity types can adopt different URIs;

s52) solving a hash value of the URI of the entity to obtain a unique identifier of the entity, and avoiding overlong character strings;

s53) the unique identifier of the relationship depends on the SQL statement and the corresponding conversion code, and is usually an expression;

s54) generalizing the SQL statement where the relation is located, and removing the constant in the where statement;

s55) solving a hash value of the generalized SQL statement text;

s56) solving a hash value of the conversion code of the relation;

s57) concatenating the hash values in S55, S56 as the unique identifier of the relationship;

s58) concatenating the hash values of the source entity, the target entity, and the relationship as a unique identifier of the "data transformation".