CN113934750A - Data blood relationship analysis method based on compiling mode - Google Patents
Data blood relationship analysis method based on compiling mode Download PDFInfo
- Publication number
- CN113934750A CN113934750A CN202111246040.0A CN202111246040A CN113934750A CN 113934750 A CN113934750 A CN 113934750A CN 202111246040 A CN202111246040 A CN 202111246040A CN 113934750 A CN113934750 A CN 113934750A
- Authority
- CN
- China
- Prior art keywords
- data
- relationship
- sql
- field
- blood
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 239000008280 blood Substances 0.000 title claims abstract description 101
- 210000004369 blood Anatomy 0.000 title claims abstract description 100
- 238000004458 analytical method Methods 0.000 title claims abstract description 58
- 238000013515 script Methods 0.000 claims abstract description 24
- 238000013523 data management Methods 0.000 claims abstract description 13
- 230000007704 transition Effects 0.000 claims abstract description 8
- 238000012800 visualization Methods 0.000 claims abstract description 6
- 230000002085 persistent effect Effects 0.000 claims abstract description 5
- 238000000034 method Methods 0.000 claims description 26
- 238000006243 chemical reaction Methods 0.000 claims description 21
- 238000004422 calculation algorithm Methods 0.000 claims description 14
- 238000013501 data transformation Methods 0.000 claims description 13
- 238000012545 processing Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 6
- 238000013461 design Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 claims description 4
- 230000000007 visual effect Effects 0.000 claims description 4
- 230000009466 transformation Effects 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 6
- 238000011161 development Methods 0.000 description 4
- 238000010276 construction Methods 0.000 description 3
- 230000002776 aggregation Effects 0.000 description 2
- 238000004220 aggregation Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 230000002688 persistence Effects 0.000 description 2
- 241001124156 Mecoptera Species 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000007405 data analysis Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000012252 genetic analysis Methods 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 210000001503 joint Anatomy 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000013441 quality evaluation Methods 0.000 description 1
- 230000009897 systematic effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000013024 troubleshooting Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a data blood relationship analysis method based on a compiling mode, which comprises the following steps: s1) collecting SQL scripts and metadata on the data management platform to form a uniform format; s2) analyzing the collected SQL script to form an abstract syntax tree, and performing semantic analysis or SQL interpretation according to metadata; s3) constructing a data blood relationship based on the SQL syntax tree with semantic information and the data blood relationship model; s4) according to the data blood relationship, carrying out the position layout of graphic elements, and carrying out the visualization display and the data tracing operation; s5) determining a unique identifier for each data transition in the data context, and persisting the data context. The invention can obtain correct tracing results, realizes good data blood relationship analysis support for a plurality of databases, and is easy to support for a data management platform emerging later.
Description
Technical Field
The invention relates to a data blood relationship analysis method, in particular to a data blood relationship analysis method based on a compiling mode.
Background
With the rapid development and application of the technology in the field of big data, data blood relationship analysis becomes an important component of data management, and the data blood relationship analysis technology becomes an important means for data source tracing, data value evaluation, data compliance proof and data quality evaluation. The current data blood relationship analysis technology has three main types:
1. a pattern matching based approach;
2. a tag technology based approach;
3. based on the manner of code parsing technique.
The code analysis technology-based mode can completely track the flow of data by performing algorithm analysis on codes for data processing, conversion and transmission, and know where the data comes from and goes to and what changes occur in the flow process.
In the field of big data, the SQL language is widely used for transmission, conversion and loading of data. Therefore, the SQL statement codes are analyzed by using the compiling technology, the data conversion and flow between the database table and the field can be quickly, automatically and timely combed, a complete data consanguinity relationship is formed, important metadata is provided for data management, the data use efficiency of developers and business personnel is improved, and the potential value of the data is fully mined.
The prior art mainly has the following potential problems:
first, most of the data blood margin analysis tools commonly used in the market are based on the data blood margin analysis tools provided by the data governance platform, and then are improved and integrated. For example, secondary development is performed by using data margin analysis capability of Hive.
Potential problems with the use of such tools:
1. the data margin analysis tool is bound to a particular platform and only supports that platform.
2. The discovered problems can only be solved by depending on the provider of the data governance platform.
3. Data blooding margins of multiple data platforms cannot be analyzed in an integrated manner.
For example, for the big data analysis tool Google BigQuery, it is obviously not feasible to use Hadoop Hive's blood margin analysis tool to analyze its data blood margin relationship.
The data blood margin analysis tool developed based on the invention realizes good data blood margin analysis support for a plurality of databases and is easy to support for a data governance platform emerging in the future.
The existing data blood relationship development tools lack comprehensive and systematic description of data blood relationship, do not have a unified data blood relationship model, and are not beneficial to managing data among multiple data management platforms.
And thirdly, no clear algorithm is provided for tracing and persisting the data by using the data blood relationship.
Fourthly, the SQL analysis-based data blood relationship analysis method has the important precondition that SQL grammars of different database manufacturers can be completely analyzed. However, the existing data blood relationship analysis tools generally use the SQL parser provided by the data management platform itself, and cannot completely parse SQL statements of other manufacturers, and the cross-platform performance is lost.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a data blood relationship analysis method based on a compiling mode, which can obtain a correct tracing result and is suitable for different data management platforms.
The technical scheme adopted by the invention for solving the technical problems is to provide a data blood relationship analysis method based on a compiling mode, which comprises the following steps: s1) collecting SQL scripts and metadata on the data management platform to form a uniform format; s2) analyzing the collected SQL script to form an abstract syntax tree, and performing semantic analysis or SQL interpretation according to metadata; s3) constructing a data blood relationship according to the data blood relationship model based on the SQL syntax tree with semantic information; s4) according to the data blood relationship, carrying out the position layout of graphic elements, and carrying out the visualization display and the data tracing operation; s5) determining a unique identifier for each data transition in the data context, and persisting the data context.
Further, the SQL script and the metadata in step S1 are obtained as follows: for the distributed file system, a file copying mode is adopted for direct acquisition; for the database, the JDBC driver is connected with the database to read the data in the table; for a source code version control server, acquiring a file by using a source code version control tool client interface; for a dedicated system, the data is acquired in real time using the interface provided by the dedicated system.
Further, the step S2 includes: s21) establishing corresponding lexical and syntactic analyzers based on SQL grammars of different databases, and carrying out lexical and syntactic analysis on SQL sentences; s22) constructing an SQL statement abstract syntax tree with a unified structure; s23) performing semantic analysis based on the abstract syntax tree, and establishing correct corresponding relation of tables and fields; s24) traversing the abstract syntax tree by combining the externally provided metadata, and eliminating the ambiguous relation of the table and the field; s25) interpreting and executing the dynamic SQL statement to further refine the correct correspondence of tables and fields.
Further, the elements in the data relationship model in the step S3 include entities and relationships, each entity includes one or more attributes; data between entities are converted and flowed through various processing processes; each data transformation comprises two entities and a relationship, and a plurality of data transformations are combined to form a data consanguinity relationship; setting this direct data flow to the fdd relationship if the data of the destination field comes from the source field through some transformation; if the data for the target field is not from the source field, but the data for the source field has an effect on the target field value, then this indirect data effect is set to the fdr relationship.
Further, the data blood relationship model comprises the following three levels: and (4) task layer: a task for processing data in the data management platform; and (3) a data layer: the method comprises the steps of including a table in a database, a directory and a file in the HDFS; a field layer: the columns in the data table are indexed.
Further, the field-level data tracing process in step S4 is as follows: setting the fdd relation to 1 and setting the fdr relation to 0; performing logical AND operation on all relations on one link; if the result is 1, the data flow from the source data to the target data is an actual data flow; otherwise, there is only a data impact relationship from the source data to the target data.
Further, the table-level data tracing process in step S4 is as follows: firstly, tracing the source of the table-level data one by one according to the blood relationship of the field-level data; in the same source data and target data, if a plurality of data blood margin links exist, all blood margin links containing the fdr relation are removed; and finally, carrying out data backtracking on the tables according to the blood relationship of the field-level data, and if the blood relationship exists in any field-level data of the two tables, judging that the two tables have the blood relationship.
Further, the step S4 further includes: constructing various table and field mapping relations by adopting Map, and merging the table and field mapping relations; performing two-level caching of an internal memory and a file level on the data blood relationship result, wherein the internal memory level stores the processing results of a plurality of SQL sentences and adopts an LRU expiration strategy; the file cache stores the processing result of the whole database batch SQL script, and simultaneously, the ProtoBuf mode is adopted for the file cache to carry out serialization or deserialization on the data.
Further, the step S4 adopts an adapter design mode to access various graphic layout algorithms, performs graphic layout calculation of data blood relationship, and performs visual display based on the layout result.
Further, the step S5 includes: s51) naming the entity by using the uniform resource identifier, wherein different entity types can adopt different URIs; s52) solving a hash value of the URI of the entity to obtain a unique identifier of the entity, and avoiding overlong character strings; s53) the unique identifier of the relationship depends on the SQL statement and the corresponding conversion code, and is usually an expression; s54) generalizing the SQL statement where the relation is located, and removing the constant in the where statement; s55) solving a hash value of the generalized SQL statement text; s56) solving a hash value of the conversion code of the relation; s57) concatenating the hash values in S55, S56 as the unique identifier of the relationship; s58) concatenating the hash values of the source entity, the target entity, and the relationship as a unique identifier of the "data transformation".
Compared with the prior art, the invention has the following beneficial effects: the compiling-mode-based data blood relationship analysis method provided by the invention can obtain correct tracing results, and realizes good data blood relationship analysis support for AWS athena, Google bigquery, corehbase, dax, db2, greenplus, SAP Handa, live, inpela, informix, mdx, mysql, netezza, odbc, openedge, oracle, postgresql, presto, redshift, snowfly, sparksql, sqlserver, sybase, terada and verttica, and is easy to support for a data governance platform which appears later.
Drawings
FIG. 1 is a flow chart of data relationship analysis based on compiling mode according to the present invention;
FIG. 2 is a diagram of a data session including a data transformation performed by SQL of the present invention;
FIG. 3 is a visualized blood relationship diagram according to the present invention.
Detailed Description
The invention is further described below with reference to the figures and examples.
FIG. 1 is a flow chart of the data relationship analysis based on compiling method of the present invention.
Referring to fig. 1, the main process of the data relationship analysis method based on compiling mode provided by the present invention includes:
s1) collecting the SQL script and the metadata on the data governance platform to form a uniform format.
S2) parsing the collected SQL script to form an abstract syntax tree, performing semantic analysis based on metadata, or interpreting and executing SQL.
S3) constructing the data blood relationship based on the SQL syntax tree with semantic information and the data blood relationship model.
S4) according to the data blood relationship, carrying out position layout of graphic elements, carrying out visual display, carrying out data tracing and other operations.
S5) determining a unique identifier for each data transition in the data context, and persisting the data context.
First, collect SQL script and metadata on data governance platform
SQL script and metadata storage location
1. And the file system comprises an HDFS distributed file system.
2. A database.
3. The source code version controls the server.
4. A dedicated system.
SQL script and metadata storage format
1. A text file containing only SQL scripts.
2. Formatted files, including SQL scripts and metadata, such as CSV files.
3. A database table.
SQL script and metadata acquisition mode
1. And copying the file.
2. And connecting the database through JDBC drive, and reading data in the table.
3. And acquiring the file by using a client interface of the source code version control tool.
4. The data is acquired in real time through an interface of a dedicated system, such as hook by Hive.
Second, data blood relationship model
Data blood relationship plays an increasingly important role in data governance of data governance platforms. The hive of the Hadoop platform supports the analysis of the data blood relationship, but lacks the complete definition and description of the data blood relationship, is limited to the use of the Hadoop platform, and cannot be used for uniformly representing the data blood relationship under the condition of mixing a plurality of data platforms.
The invention provides a data blood relationship model, and the data blood relationship can be modeled, calculated, stored and displayed based on the model.
2.1 hierarchy of data relationship
In general, data-based genetic analysis comprises the following three levels:
1. task layer
The task layer refers to a task for processing data in the data governance platform, typically a job in Spark/Hive.
2. Data layer
The data layer contains tables in the database, directories, files in the HDFS, etc. The objects of this layer are the subjects of data consanguinity.
3. Field layer
A field generally refers to a column in a data table. The data relationship established in the layer can provide the finest and accurate analysis result for the user, and the data relationship of other layers is deduced through the data relationship of the layer.
2.2 entities and relationships in data context
All objects in the data relationship hierarchy are collectively referred to as entities (entries), each of which contains one or more attributes (attributes).
The jobin Spark/Hive of the task layer is an entity, and the entity has attributes such as name, running time and the like.
The most important entity in the data layer is a table (table), and the attribute in the table is called field (column) in the invention.
Data between entities is transformed and streamed through various processes, and such transformation and streaming is called relationships between entities. There are two types of relationships:
1. direct data flow: the data of the destination field is converted from the source field by some kind, and the relationship of the blood relationship is referred to as fdd.
2. Indirect data influence, the data of the target field is not from the source field, but the data of the source field has influence on the formation of the value of the target field, and the relationship is called fdr.
To each relation, other information such as the SQL statement type for performing data conversion, the code indicating the data conversion rule, and the number of data conversions is attached. Through the data conversion rule, the invention can evaluate the quality of the data. By counting the data conversion times, the invention can know the use frequency of the data.
The data relation between the fields is the finest granularity relation, and the data blood relation between the upper-layer entities can be calculated through the field blood relation.
2.3 atomic units of data kindred relationship
Using the entities and relationships defined above, the present invention can give the atomic units of the kindred relationship: data transform (data transform). The whole data blood relationship is the set of all atomic units.
Each data transformation contains two entities and a relationship, the two entities being objects in the same data layer, the relationship describing the data flow between the entities.
Take the following SQL statement as an example:
SELECT FirstName as Name from Person
this SQL-formed data lineage contains one data transformation, as shown in fig. 2.
The data representing the Name field in the target entity data set select list RS-1 comes from the FirstName field in the source entity table Person.
2.4 data relationship by data transformation
The data blooding relationship graph can be formed by combining a plurality of data conversions. Take this create view SQL statement as an example
create view v(myname)
as SELECT p.FirstName as name
from Person AS p
A data blood relationship is formed by converting two data with the relationship type of fdd.
Person.FirstName->fdd->RS-1.name->fdd->v.myname
The visualized blood relationship graph is shown in FIG. 3; the actual data-blood relationship is far more complex than this, but the rationale is the same.
Third, construction of data relationship of blood relationship
With the data blood relationship model introduced above, the invention can perform actual SQL script blood relationship analysis work.
3.1 parsing SQL scripts
In order to support SQL grammars of different databases and avoid dependence on SQL analysis tools of any database manufacturers, a general SQL parser is provided, which can completely support the SQL grammars of the different databases and comprises a storage process, so that data blood-related analysis across database platforms becomes possible. Running the generic SQL parser does not rely on any third party libraries. The general SQL parser is implemented as follows:
1. and establishing a corresponding lexical and syntactic analyzer based on SQL grammars of different databases, and performing lexical and syntactic analysis on the SQL sentences.
2. And constructing an SQL statement abstract syntax tree with a unified structure.
3. Semantic analysis is performed based on the abstract syntax tree, and correct corresponding relations between tables and fields are established.
4. Traversing the abstract syntax tree in conjunction with externally provided metadata disambiguates certain tables and fields.
5. And interpreting and executing the dynamic SQL statement to further perfect the correct corresponding relation of the table and the field.
3.2 construction of the relationship of the blood relationship of SQL statement data
After the analysis of the SQL script is completed and the syntax tree with semantic information is obtained, the invention has complete information of the source data table and the fields and the corresponding target data table and the fields. How to establish data lineage relationships for different types of SQL statements is described in detail below.
1, SELECT SQL statement
The SELECT statement is the most commonly used statement in SQL, and is also commonly used as a component of other SQL statements.
i.select list
SELECT p.FirstName as name
from Person AS p
Establishing an fdd type data relationship from a source field to a target field:
Person.FirstName->fdd->resultSet.name
ii.where clause
SELECT p.FirstName as name
from Person AS p
where city='NY'
a data lineage relationship of fdr type is established indicating that the number of rows in the result set produced by select list is affected by person.
Person.city->fdr->resultSet.rows
iii.group by
SELECT dept,sum(sal)salary
from scott.emp
group by dept
An fdr-type data context is established, indicating that the result value produced by the aggregation function sum () is affected by person.
Person.dept->fdr->resultSet.salary
At the same time, a data lineage relationship of the fdd type is also established, indicating that the result value produced by the aggregation function sum () comes from person.
Person.sal->fdd->resultSet.salary
UPDATE SQL statement
UPDATE Per
SET Per.PersonCityName=Addr.City
FROM Persons Per
INNER JOIN AddressList Addr
ON Per.PersonId=Addr.PersonId
Establish a data lineage relationship of fdd type, indicating that persons. personCityName value comes from addressList. City.
AddressList.City->fdd->Persons.PersonCityName
3, INSERT SQL statement
INSERT INTO Customers(CustomerName,City,Country)
SELECT SupplierName,City,Country FROM Suppliers;
Establishing three data blood relationship of the fdd type, which are respectively as follows:
Suppliers.SupplierName->fdd->Customers.CustomerName
Suppliers.City->fdd->Customers.City
Suppliers.Country->fdd->Customers.Country
4. more SQL statements
The data blood relationship model can represent the data blood relationship generated by all SQL sentences, and is not repeated herein.
3.3 method for tracing field level data
The construction of the data blood relationship is completed, and the data tracing can be performed according to the invention.
One data may have multiple data sources and may be processed from multiple processes. In a big data environment, the general user is concerned with the relationship that there is actual data flow, i.e., the fdd type of relationship, and the pair fdr relationship is used when troubleshooting more detailed problems.
Within the same source and target data, there are typically multiple data lineage links representing different data processing. The algorithm provided by the invention can be used for quickly and conveniently finding out the blood margin link with actual data flow, removing the blood margin link containing the fdr relation and facilitating the user to quickly position and check the blood margin relation of the data.
Column1 has the following data bloodline links:
data blood margin link:
1:tableA.column1->fdd->tableC.column->fdd->tableD.column->fdd->tableB.column1
2:tableA.column1->fdd->tableE.column->fdr->tableF.column->fdd->tableB.column1
3:tableA.column1->fdd->tableG.column->fdd->tableH.column->fdr->tableB.column1
the user wants to know whether the target data tableb. column1 has actual data from tablea. column1, the present invention performs the following calculation:
1. the fdd relation is set to 1, and the fdr relation is set to 0.
2. All relationships on one link are logically anded.
3. If the result is 1, it indicates that the data flow from the source data to the target data is actual data flow. Otherwise, there is only a data impact relationship from the source data to the target data.
According to the above algorithm:
the final blood relationship for link 1 is: 1&1&1 ═ 1, which is the actual data stream.
The final blood relationship for link 2 is: 1&0&1 ═ 0, to influence the data stream.
The final blood relationship for link 3 is: 1&1&0 ═ 0, to influence the data stream.
0604. Method for tracing table-level data
The flow of data generally occurs at the field level, so table-level data tracing must be based on field-level data consanguinity. If the source is directly traced from the blood-related relationship between tables, an erroneous tracing result may be generated.
Taking this SQL script as an example:
insert into B(a1,a2,a3)select a1,a2,a3 from A1;
insert into B(a1,a2,a3,a4)select a1,a2,a3,a4 from A2;
UPDATE C1 SET a2=(SELECT a2 FROM B WHERE B.a1=C1.a1);
UPDATE C2 SET a4=(SELECT a4 FROM B WHERE B.a3=C2.a3);
if only the data blood relationship of the table level is concerned, the invention obtains the following data blood relationship:
A1->fdd->B
A2->fdd->B
B->fdd->C1
B->fdd->C2
according to the data backtracking of the tables C1 and C2 based on the blood relationship, the invention can obtain:
c1 dependence on A1
C1 dependence on A2
C2 dependence on A1
C2 dependence on A2
Looking at SQL scripts, the present invention can find that C2 is not dependent on A1, so this result is erroneous. The reason is that the data consanguinity relationship at the table level does not represent complete data flow information.
The following are the bloody relationships at the SQL statement field level, with the characters in parentheses representing the field name
A1(a2)->fdd->B(a2)
A2(a2)->fdd->B(a2)
A2(a4)->fdd->B(a4)
B(a2)->fdd->C1(a2)
B(a4)->fdd->C2(a4)
According to the relationship of the blood relationship, the invention can obtain the correct tracing result:
c1 dependence on A1
C1 dependence on A2
C2 is dependent on A2.
Through investigation on a large number of enterprise data management platforms, the invention analyzes and obtains an empirical formula of data quantity calculation of data blood relationship in an enterprise-level data platform:
count(relation)=(count(table)+count(view))*10*2;
relationship represents the single atom transform in the data edge, count (relationship) represents the total number of atom transforms in the entire data edge analysis result, count (table) represents the number of all tables, and count (view) represents the number of all views.
A data governance platform of a general enterprise may include ten thousand tables and views, where 4 ten thousand tables and views are taken as examples, and count (relationship) ═ 40000+40000 × 10 × 2 ═ 1600000.
Therefore, the data blood margin analysis result generates million-level relations, the data blood margin of the magnitude is constructed, traced and stored, the tracing time complexity of the algorithm is O (n), and the results are obtained in 10 seconds through the query of the table and the fields. In order to meet the data blood relationship analysis requirements of most enterprises and further improve the data utilization efficiency of the enterprises, the invention further adopts the following optimization algorithm for the data blood relationship:
1. and constructing various tables and field mapping relations by adopting Map, wherein the time complexity of the search is O (1).
2. And merging according to the table and field mapping relation, wherein the time complexity of the merging result is O (n).
3. And performing two-level caching of a memory and a file level on the data blood relationship result.
4. The memory level mainly stores some small SQL processing results, and adopts an LRU (Least Recently Used) expiration strategy, which is a commonly Used page replacement algorithm.
5. And the file cache stores the processing result of the whole database batch SQL script.
6. The ProtoBuf mode is adopted for file caching to serialize/deserialize data, and the deserialization time of million-level data can be compressed to be within 5 seconds. Protobuf is a set of methods developed by Google to serialize data structures, and can be used as a communication protocol, a data storage format, and the like.
Fourth, visualization of data blood relationship
When the data relationship is calculated, in order to present the data relationship to the user in a reasonable and beautiful graphic manner, the entity elements and the relationship in the relationship need to be graphically laid out, and the coordinate position of each element in the graphic needs to be calculated.
The invention adopts an adapter design mode, and a visualization module of the data blood relationship adopts the adapter design mode to access various graphic layout algorithms, perform graphic layout calculation of the data blood relationship, and then perform visualization display based on a layout result.
The calculation of the graphical layout algorithm may be done at the user terminal and then displayed visually. Or the method can be finished at a server side, and the layout coordinate data of the graph is sent to the user terminal, and in this way, the user terminal only carries out visual display.
The invention does not provide a graphic layout algorithm, and developers can adopt commercial, open source or self-developed graphic layout algorithms by themselves.
Fifth, persistence of data relationship
Data governance is a system engineering, and analyzing data blood relationship is an important link. After the data relationship analysis is completed, the data relationship analysis is usually required to be in butt joint with a data directory (data catalog) of a data management platform, the data relationship is stored in the data directory as metadata, persistence is completed, the data relationship is accumulated, and query, development and management of subsequent services are facilitated. The data consanguinity relationships may also be stored in a relational or graphical database.
The data relationship is inserted, updated and deleted into the data directory by taking 'data conversion' as a unit. To perform these operations correctly, the problem of unique identification of each "data transformation" needs to be solved.
The present invention recognizes that a "data transformation" comprises two entities and a relationship, and by uniquely identifying the two entities and the relationship, the present invention can uniquely identify the "data transformation". The specific implementation algorithm is as follows:
1. entities are named with Uniform Resource Identifiers (URIs). Different entity types may employ different URIs, which for a field take the form of: table column @ cluster.
2. And solving a hash value of the URI of the entity to obtain the unique identifier of the entity, and simultaneously avoiding overlong character strings.
3. The unique identifier of the relationship depends on the SQL statement in which it resides, and the corresponding translation code, usually an expression.
4. And generalizing the SQL statement where the relation is located, and removing the constant in the where statement.
5. And solving a hash value of the generalized SQL statement text.
6. And solving a hash value of the conversion code of the relation.
7. The hash value in the concatenation 5, 6 serves as the unique identifier for the relationship.
8. And splicing the hash values of the source entity, the target entity and the relationship to serve as the unique identifier of the data conversion.
The procedure for writing data context is as follows:
1. a unique identifier for each "data conversion" is generated.
2. If the "data transition" is not found in the data directory, a "data transition" record is inserted.
3. If the "data transitions" already exist in the data directory, the number of "data transitions" is updated.
The invention can accurately acquire the quality and the heat of the data, and can investigate the quality and the heat of the data by analyzing the data blood relationship in the data catalogue. Looking at the conversion rule codes contained in the data conversion relation, the conversion process of the data is known, and therefore the quality of the target data is known. The number of updates of each "data conversion" is checked to see the heat of the data.
Although the present invention has been described with respect to the preferred embodiments, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (10)
1. A data blood relationship analysis method based on a compiling mode is characterized by comprising the following steps:
s1) collecting SQL scripts and metadata on the data management platform to form a uniform format;
s2) analyzing the collected SQL script to form an abstract syntax tree, and performing semantic analysis or SQL interpretation according to metadata;
s3) constructing a data blood relationship based on the SQL syntax tree with semantic information and the data blood relationship model;
s4) according to the data blood relationship, carrying out the position layout of graphic elements, and carrying out the visualization display and the data tracing operation;
s5) determining a unique identifier for each data transition in the data context, and persisting the data context.
2. The compiling-mode-based data consanguinity relationship analysis method according to claim 1, wherein the SQL script and the metadata are obtained in the following manner in step S1:
for the distributed file system, a file copying mode is adopted for direct acquisition;
for the database, the JDBC driver is connected with the database to read the data in the table;
for a source code version control server, acquiring a file by using a source code version control tool client interface;
for a dedicated system, the data is acquired in real time using the interface provided by the dedicated system.
3. The compiling method-based data consanguinity relationship analysis method of claim 1, wherein the step S2 includes:
s21) establishing corresponding lexical and syntactic analyzers based on SQL grammars of different databases, and carrying out lexical and syntactic analysis on SQL sentences;
s22) constructing an SQL statement abstract syntax tree with a unified structure;
s23) performing semantic analysis based on the abstract syntax tree, and establishing correct corresponding relation of tables and fields;
s24) traversing the abstract syntax tree by combining the externally provided metadata, and eliminating the ambiguous relation of the table and the field;
s25) interpreting and executing the dynamic SQL statement to further refine the correct correspondence of tables and fields.
4. The compilation-based data relationship analysis method according to claim 1, wherein the elements in the data relationship model in the step S3 include entities and relationships, each entity including one or more attributes; data between entities are converted and flowed through various processing processes; each data transformation comprises two entities and a relationship, and a plurality of data transformations are combined to form a data consanguinity relationship; setting this direct data flow to the fdd relationship if the data of the destination field comes from the source field through some transformation; if the data for the target field is not from the source field, but the data for the source field has an effect on the target field value, then this indirect data effect is set to the fdr relationship.
5. The compilation-based data relationship analysis method of claim 4, wherein the data relationship model comprises the following three layers:
and (4) task layer: a task for processing data in the data management platform;
and (3) a data layer: the method comprises the steps of including a table in a database, a directory and a file in the HDFS;
a field layer: the columns in the data table are indexed.
6. The compiling-mode-based data consanguinity relationship analysis method according to claim 4, wherein the field-level data tracing process in the step S4 is as follows:
setting the fdd relation to 1 and setting the fdr relation to 0;
performing logical AND operation on all relations on one link;
if the result is 1, the data flow from the source data to the target data is an actual data flow; otherwise, there is only a data impact relationship from the source data to the target data.
7. The compiling-mode-based data blooding source relation analyzing method according to claim 6, wherein the tracing process of the table data in the step S4 is as follows:
firstly, tracing the source of the table-level data one by one according to the blood relationship of the field-level data;
in the same source data and target data, if a plurality of data blood margin links exist, all blood margin links containing the fdr relation are removed;
and finally, carrying out data backtracking on the tables according to the blood relationship of the field-level data, and if the blood relationship exists in any field-level data of the two tables, judging that the two tables have the blood relationship.
8. The compiling method-based data consanguinity analysis method of claim 1, wherein the step S4 further comprises:
constructing various table and field mapping relations by adopting Map, and merging the table and field mapping relations;
performing two-level caching of an internal memory and a file level on the data blood relationship result, wherein the internal memory level stores the processing results of a plurality of SQL sentences and adopts an LRU expiration strategy; the file cache stores the processing result of the whole database batch SQL script, and simultaneously, the ProtoBuf mode is adopted for the file cache to carry out serialization or deserialization on the data.
9. The compiling method based data relationship analysis method according to claim 1, wherein the step S4 employs an adapter design mode to access various graphic layout algorithms, performs graphic layout calculation of data relationship, and performs visual display based on the layout result.
10. The compiling method-based data consanguinity relationship analysis method of claim 1, wherein the step S5 includes:
s51) naming the entity by using the uniform resource identifier, wherein different entity types can adopt different URIs;
s52) solving a hash value of the URI of the entity to obtain a unique identifier of the entity, and avoiding overlong character strings;
s53) the unique identifier of the relationship depends on the SQL statement and the corresponding conversion code, and is usually an expression;
s54) generalizing the SQL statement where the relation is located, and removing the constant in the where statement;
s55) solving a hash value of the generalized SQL statement text;
s56) solving a hash value of the conversion code of the relation;
s57) concatenating the hash values in S55, S56 as the unique identifier of the relationship;
s58) concatenating the hash values of the source entity, the target entity, and the relationship as a unique identifier of the "data transformation".
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111246040.0A CN113934750B (en) | 2021-10-26 | 2021-10-26 | Compiling mode-based data blood relationship analysis method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111246040.0A CN113934750B (en) | 2021-10-26 | 2021-10-26 | Compiling mode-based data blood relationship analysis method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113934750A true CN113934750A (en) | 2022-01-14 |
CN113934750B CN113934750B (en) | 2024-10-01 |
Family
ID=79284440
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111246040.0A Active CN113934750B (en) | 2021-10-26 | 2021-10-26 | Compiling mode-based data blood relationship analysis method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113934750B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114116856A (en) * | 2022-01-25 | 2022-03-01 | 中电云数智科技有限公司 | Field level blood relationship analysis method based on data management full link |
CN115687309A (en) * | 2022-12-30 | 2023-02-03 | 浙江大学 | Non-invasive cigarette warehouse-in and warehouse-out full-process data blood margin construction method and device |
CN118012964A (en) * | 2024-04-08 | 2024-05-10 | 天津南大通用数据技术股份有限公司 | Blood relationship generation method based on relational algebra |
CN118069701A (en) * | 2024-04-25 | 2024-05-24 | 深圳市万物云科技有限公司 | Reverse query link construction method, reverse query link construction device, computer equipment and storage medium |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040210607A1 (en) * | 2003-04-17 | 2004-10-21 | Oracle International Corporation | Metamodel-based metadata change management |
CN109684402A (en) * | 2018-12-21 | 2019-04-26 | 福建南威软件有限公司 | One kind being based on big data platform metadata genetic connection implementation method |
CN110232056A (en) * | 2019-05-21 | 2019-09-13 | 苏宁云计算有限公司 | A kind of the blood relationship analytic method and its tool of structured query language |
CN110807026A (en) * | 2019-10-24 | 2020-02-18 | 北京中科捷信信息技术有限公司 | Automatic capture system for analyzing financial big data blood relationship |
CN111627552A (en) * | 2020-04-08 | 2020-09-04 | 湖南长城医疗科技有限公司 | Medical streaming data blood relationship analysis and storage method and device |
WO2020238597A1 (en) * | 2019-05-27 | 2020-12-03 | 深圳前海微众银行股份有限公司 | Hadoop-based data updating method, device, system and medium |
CN113326286A (en) * | 2021-08-03 | 2021-08-31 | 杭州量之智能科技有限公司 | Semantic analysis method supporting dialect SQL blood margin analysis |
WO2021174945A1 (en) * | 2020-10-21 | 2021-09-10 | 平安科技(深圳)有限公司 | Data cost calculation method, system, computer device, and storage medium |
-
2021
- 2021-10-26 CN CN202111246040.0A patent/CN113934750B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20040210607A1 (en) * | 2003-04-17 | 2004-10-21 | Oracle International Corporation | Metamodel-based metadata change management |
CN109684402A (en) * | 2018-12-21 | 2019-04-26 | 福建南威软件有限公司 | One kind being based on big data platform metadata genetic connection implementation method |
CN110232056A (en) * | 2019-05-21 | 2019-09-13 | 苏宁云计算有限公司 | A kind of the blood relationship analytic method and its tool of structured query language |
WO2020238597A1 (en) * | 2019-05-27 | 2020-12-03 | 深圳前海微众银行股份有限公司 | Hadoop-based data updating method, device, system and medium |
CN110807026A (en) * | 2019-10-24 | 2020-02-18 | 北京中科捷信信息技术有限公司 | Automatic capture system for analyzing financial big data blood relationship |
CN111627552A (en) * | 2020-04-08 | 2020-09-04 | 湖南长城医疗科技有限公司 | Medical streaming data blood relationship analysis and storage method and device |
WO2021174945A1 (en) * | 2020-10-21 | 2021-09-10 | 平安科技(深圳)有限公司 | Data cost calculation method, system, computer device, and storage medium |
CN113326286A (en) * | 2021-08-03 | 2021-08-31 | 杭州量之智能科技有限公司 | Semantic analysis method supporting dialect SQL blood margin analysis |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114116856A (en) * | 2022-01-25 | 2022-03-01 | 中电云数智科技有限公司 | Field level blood relationship analysis method based on data management full link |
CN115687309A (en) * | 2022-12-30 | 2023-02-03 | 浙江大学 | Non-invasive cigarette warehouse-in and warehouse-out full-process data blood margin construction method and device |
CN115687309B (en) * | 2022-12-30 | 2023-04-18 | 浙江大学 | Non-invasive cigarette warehouse-in and warehouse-out full-process data blood margin construction method and device |
CN118012964A (en) * | 2024-04-08 | 2024-05-10 | 天津南大通用数据技术股份有限公司 | Blood relationship generation method based on relational algebra |
CN118069701A (en) * | 2024-04-25 | 2024-05-24 | 深圳市万物云科技有限公司 | Reverse query link construction method, reverse query link construction device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN113934750B (en) | 2024-10-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11907247B2 (en) | Metadata hub for metadata models of database objects | |
US8131744B2 (en) | Well organized query result sets | |
CN113934750B (en) | Compiling mode-based data blood relationship analysis method | |
US8037108B1 (en) | Conversion of relational databases into triplestores | |
US8527502B2 (en) | Method, system and computer-readable media for software object relationship traversal for object-relational query binding | |
US7689555B2 (en) | Context insensitive model entity searching | |
CN105989150B (en) | A kind of data query method and device based on big data environment | |
US20100017395A1 (en) | Apparatus and methods for transforming relational queries into multi-dimensional queries | |
CA2429910A1 (en) | System and method of query transformation | |
JP3914662B2 (en) | Database processing method and apparatus, and medium storing the processing program | |
US7792851B2 (en) | Mechanism for defining queries in terms of data objects | |
US11455283B2 (en) | Candidate element selection using significance metric values | |
US11341142B2 (en) | Framework and metadata artefacts for updating data artefacts | |
JP5927886B2 (en) | Query system and computer program | |
CN117076742A (en) | Data blood edge tracking method and device and electronic equipment | |
Elamparithi et al. | A Review on Database Migration Strategies, Techniques and Tools | |
US9703825B2 (en) | Maintenance of a pre-computed result set | |
US12093289B2 (en) | Relationship-based display of computer-implemented documents | |
Cybula et al. | Query optimization through cached queries for object-oriented query language SBQL | |
KR101162468B1 (en) | Automatic data store architecture detection | |
Leeka et al. | RQ-RDF-3X: going beyond triplestores | |
Vidal et al. | Publication and maintenance of RDB2RDF views externally materialized in enterprise knowledge graphs | |
US11250010B2 (en) | Data access generation providing enhanced search models | |
US20240134849A1 (en) | Virtual access to parameterized data objects | |
Montolalu et al. | SQL And NoSQL Object Database Mapping to Support CRUD Operation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |