CN117235037A

CN117235037A - Data entity-based data blood relationship tracing method

Info

Publication number: CN117235037A
Application number: CN202311025166.4A
Authority: CN
Inventors: 刘静涛; 岳丽军; 周万宁; 管东林; 屈峰; 王一; 苏思; 张建延; 张新建; 陈单英; 彦世兵; 杨春
Original assignee: Unit 91977 Of Pla
Current assignee: Unit 91977 Of Pla
Priority date: 2023-08-15
Filing date: 2023-08-15
Publication date: 2023-12-15

Abstract

The invention relates to a data blood relationship tracing method based on a data entity, which comprises the following steps: mapping is established with the data entity, mapping associated data is obtained, metadata blood-edge relation data records of the data entity are collected, and traceability data are generated; defining blood edge relation data, and setting the thickness granularity of the blood edge relation at the table level and the field level; carrying out mapping data analysis, and analyzing field-level blood-edge relationship and table-level blood-edge relationship of the data entity; according to the structures of the data entity field level and the table level blood-edge relationship, storing the data of the data entity mapping relationship and the table relationship, and obtaining a data blood-edge relationship level and a mapping relationship data set corresponding to the data entity ID; and traversing the blood edge relation data set by using a recursive circulation and breadth-first algorithm to analyze, organize and visually display the blood edge relation data. The method can display the context relation of the data objects of the data management platform and rapidly analyze the influence among the data objects.

Description

Data entity-based data blood relationship tracing method

Technical Field

The invention relates to the technical field of data processing, in particular to a data blood relationship tracing method based on data entities.

Background

The main stream methods of data tracing and tracking include labeling method, reverse query method, data tracking method, bidirectional pointer tracking method, and special query language tracking method using graph theory idea. The labeling method is simple and convenient for tracing the data, but extra storage space is needed for storing labeling information, and the labeling method is not suitable for tracing the data in fine granularity data, especially in large data sets. Although the reverse query method needs smaller storage space than the labeling method, the method is complex to realize, has certain limitation on application scenes, and is difficult to effectively meet the data blood edge tracing of the data entity table and the field level when being singly used. The data blood-edge relationship tracing adopts a data reverse query method and combines a data relationship mapping and a recursion breadth-first traversal algorithm to realize tracing collection and query display of the data blood-edge relationship. The method can meet the data blood source tracing based on the data entity table and field level.

The principle of tracing the data blood relationship is shown in figure 1.

The data blood-edge relationship tracing is to establish a mapping relationship with a data entity in a data life cycle, then collect the data relationship and evolution of metadata of the data entity in different stages of data collection, storage, processing, transmission, exchange, archiving and the like, generate a data tracing link, and store data including data mapping relationship data, table relationship and the like according to a blood-edge relationship data structure.

When the data entity is selected for blood-edge relation display through inquiry, the data mapping corresponding to the data entity ID can be compared and related inquiry is carried out according to the data entity ID, a blood-edge relation tracing algorithm is adopted for data traversal, a data set is formed, and accordingly the data set completes data blood-edge relation visualization effect display of nodes of different levels through a specific graph drawing algorithm.

Disclosure of Invention

The invention aims to provide a data entity-based data blood relationship tracing method which is used for realizing table-level and field-level blood relationship tracing.

In order to achieve the above purpose, the data entity-based data blood relationship tracing method of the present invention includes:

step 1, mapping is established with a data entity to obtain mapping association data, and data entity metadata blood-edge relationship data records are collected according to the mapping relationship, data labeling is carried out, and traceability data are generated;

step 2, performing blood-edge relationship data definition according to the data entity tracing attention requirement, and performing table-level and field-level blood-edge relationship coarse-fine granularity setting;

step 3, carrying out mapping data analysis by adopting a blood-edge relationship analysis algorithm, and analyzing field-level blood-edge relationship and table-level blood-edge relationship of the data entity;

step 4, according to the structure of the data entity field-level blood-edge relationship and the table-level blood-edge relationship, storing data of the data entity mapping relationship and the table-to-table relationship, wherein a relational database or a graph database is adopted as a storage database;

step 5, inquiring the data entity to trace the blood-edge relationship, and acquiring a data blood-edge relationship level and a mapping relationship data set corresponding to the data entity ID;

and 6, traversing the blood-edge relation data set by using a recursion cycle and breadth-first algorithm, and analyzing, organizing and visually displaying the blood-edge relation data.

Further, the step 2 includes creating an ANTLR grammar file.

Further, the step 3 includes generating a lexical and grammatical analysis class.

Further, the step 3 includes tree parsing.

Further, the tree parsing includes field table relationship parsing.

Further, the tree parsing step includes: 1) Analyzing INSERT; 2) Analyzing the SELECT; 3) Processing the situation containing an asterisk, if no inserted field is specified in the inserted statement, taking a field alias of the query of the first layer query statement as an inserted field, and if the field is not aliased, specifying the field name as the inserted field; if the inserted field is null and the query field is asterisk, temporarily not supporting this type of query parsing; processing the query to contain an asterisk, and if the first layer of query field is the asterisk or contains the asterisk, replacing the asterisk with the insertion field list and the complement of the query field list; if the last layer of query field contains an asterisk, replacing the asterisk with the complement of the parent query field and the layer of query field; if the middle layer query contains an asterisk, replacing the asterisk with the complement of the sub-query field list and the query field list of the middle layer; under the condition of no support, the un-support contains an asterisk, and the un-support parent-child query contains an asterisk; 4) And (5) analyzing the association relation.

The method of the invention has the following advantages:

the method supports direct generation of the blood-edge relationship of the data object through lexical analysis technology and visual data processing, clearly displays the context relationship of the data object of the data management platform, and rapidly learns the influence analysis between the data objects on the data flow link through the blood-edge relationship.

Drawings

FIG. 1 is a schematic diagram of a data blood relationship tracing principle;

fig. 2 shows a blood relationship tracing implementation process.

Detailed Description

The technical solution of the present invention will be clearly and completely described in conjunction with the specific embodiments, but it should be understood by those skilled in the art that the embodiments described below are only for illustrating the present invention and should not be construed as limiting the scope of the present invention. All other embodiments, which can be made by one of ordinary skill in the art without undue burden on the person of ordinary skill in the art based on the embodiments of the present invention, are within the scope of the present invention.

Related technical terms of the invention:

and the data mapping association is used for establishing association mapping with the data entity object and realizing the comparison of the data blood-source entity ID.

The data definition is used for marking the granularity of the blood relationship record, such as a table level, a field level or a table record level, and the like, on the data association mapping.

And the blood-edge relation analysis is used for carrying out blood-edge relation analysis processing on the data records such as entity metadata mapping relation, inter-table relation and the like according to the data definition.

And the blood relationship storage is used for storing the data subjected to data blood relationship analysis processing. The storage content comprises data entity metadata, blood edge mapping relations, table relations and the like.

The blood relationship query is used for querying the blood relationship of the selected data entity, and can be selected or imported through a visual interface.

And the blood relationship traversal analysis is used for analyzing and analyzing the blood relationship according to a specified model or method.

And displaying the blood relationship, wherein the visual effect of the blood relationship is displayed according to the blood relationship level and distribution of the data entity.

The tracing application is based on data tracing basic data and tracing access, and the tracing application direction comprises data quality, audit trail, data re-derivation, data analysis and other scenes.

The blood-edge relation of the data reflects the coming pulse of the data, can help us track the source of the data, track the data processing process, and display the source, conversion processing, storage and other processes of the data in a data blood-edge relation visualization graph mode. The data blood-edge starts from a certain entity and traces back the processing procedure until the data source interface of the data system. For different types of entities, the conversion process involved may be of different types, such as: for the underlying warehouse entity, the ETL process is involved. Whereas for warehouse summary tables, it may involve both ETL and warehouse summary processes. And for the index, the process of index generation is also involved in addition to the above process. The data source interface entity is provided by the source system as a data input to the data system, and the other data entities are subjected to one or more different types of processing. The blood margin analysis provides the user with insight into the different processes, inputs, outputs, etc. of each process as desired.

The table-level and field-level blood-edge relation tracing is realized through the data blood-edge relation analysis, and a data map can be formed based on the data blood-edge.

The blood relationship tracing implementation process is shown in fig. 2.

The data blood relationship tracing can realize the data blood relationship tracing based on the processes of data acquisition, processing, storage, application and the like in the data lifecycle stage of the data entity metadata. The specific design implementation process is as follows:

and step 1, mapping is established with the data entity to obtain mapping association data, and data entity metadata blood-edge relationship data records are acquired according to the mapping relationship, and data labeling is carried out to generate traceability data.

And 2, performing blood-edge relationship data definition according to the data entity tracing attention requirement, and performing table-level and field-level blood-edge relationship coarse-fine granularity setting.

And 3, carrying out mapping data analysis by adopting a blood-edge relationship analysis algorithm, and analyzing field-level blood-edge relationships and table-level blood-edge relationships of the data entity.

And 4, storing data such as data entity mapping relation, inter-table relation and the like according to the structures of the data entity field level and the table level blood relationship, wherein a relational database or a graph database can be adopted as a storage database.

And 5, inquiring the data entity to perform blood-edge relationship, and acquiring a data blood-edge relationship level and a mapping relationship data set corresponding to the data entity ID.

DETAILED DESCRIPTION OF EMBODIMENT (S) OF INVENTION

1. Establishing an ANTLR grammar file

Creating tokens

The INSERT_SELECT_ STATEMENT token represents the entire INSERTSELECT statement and INSERT_COL_LIST represents the INSERT column. Example grammar file:

2. generating lexical and grammatical analysis classes

The ANTLRWORKS tool generates SqlLexer.java, sqlParser.java grammar from the lexical rules, tokens, and grammar files.

The SQL lexer.java analyzes the input stream according to the SQL lexical rules and a method of generating an m rule name for each lexical rule is used to segment the input stream.

SqlParser.java. Construct a parser from lexical analysis classes and apply the grammar rules to the token stream and generate a tree for each grammar in the grammar file.

3. Tree parsing

1) Specific parsing process of tree

Root node

The root node name is the token INSERT_ STATEMENT

First layer child node

The first level child nodes are mainly divided into two major categories, one category is related to INSERT, and the other category is the whole SELECT statement (including nested child queries)

INSERT node

The child nodes of the INSERT node have a TABLE _ NAME, TABLE _ ALIAS,

INSERT_COL_LIST, wherein the notation TABLE_NAME refers to the inserted TABLE NAME, TABLE_ALIAS, the TABLE ALIAS, INSERT_COL_LIST refers to the inserted field. And traversing each tree respectively to obtain the table names, table aliases and table field names.

INSERT_SELECT_ STATEMENT node

The following node name of the node is INSERT_SELECT_ STATEMENT, the child nodes under the node are SELECT_COL_EXPRS, SELECT_TAB_EXPRS, SELECT_WITH_EXPRS,

select_UNION, select_JOIN, select_DBLINK node

Select_col_exprs node

The root node name is select_col_exprs, and its main child node is column_expr, the COLUMN syntax notation. The main child nodes of COLUMN_EXPR are ATOM_EXPR, select_COL_ALIAS, ATOM_EXPR is a COLUMN name node, and the main child nodes are FUNCTION_EXPR and sql_identifier.

SELECT TAB EXPRS node

The node is a lookup TABLE node, the root node name is select_tab_exprs, and the main child nodes are select_tab lookup TABLE node, select_tab_alias TABLE ALIAS node, select_union, unit or unionall node, select_join, JOIN node, select_ STATEMENT node. And traversing each node in turn to obtain the table name and the table alias, wherein note that the table name traversed by the sub-query is null.

SELECT TAB ALIAS node

The node is a table alias node, and the child node is taken out to be the table alias.

SELECT _ UNION node

The root node is named as SELECY_UNION, the child nodes mainly comprise a selection_COL_EXPRSSELECT_COL_ALIAS, a selection_TAB_EXPRS, a selection_TAB_ALIAS and a selection_WITH_EXPRS, and the sub nodes are traversed in sequence to obtain the unit node information.

Select_join node

The root node is named as select_join, the main child nodes are select_table, select_ALIAS and ON_EXPRS, and the JOIN node information can be obtained by traversing each child node in turn.

Select_dblink node

The root node is named as select_dblink, and the child nodes are at_sign and sql_identifier.

2) The field table relation analysis process comprises the following steps:

because the tree structure is complex, and the tree is always nested and recursively downwards, the relationship between the target field and the source field of the target table and the relationship between the source field and the source field of the source table cannot be analyzed only by single logic judgment. Therefore, the auxiliary class is used for analysis.

Auxiliary class introduction

The query entity class records the information of the analysis result of each query statement, and if a sub-query exists, the whole sub-query is replaced by a table name of empty.

TABLE 1 QueryEntity class attribute

The TableEntity class records information of the lookup table.

TABLE 2 TableEntity class attributes

The field Entity class records information for a field.

TABLE 3 FieldEntity class attribute

UnionQueryEntity class, record the information of the unit inquiry.

Table 4 UnionQueryEntity class attributes

And resolving the entity class of the result by the relation entity class.

TABLE 5 relationship Entity class attribute

QueryComparator class

A compiler interface is implemented for ordering the List by layer in List < query entity >. Collection. Sort (List < QueryEntity > List, newQueryCompartor ()).

3) The analysis steps of the tree are outlined:

(1) Parsing INSERT

The method includes the steps that an inserted table name, an alias and an inserted field name are separated from a first-layer child node of a main tree, and if the inserted field name is not available, the field name is set to be empty.

(2) Resolving SELECT

And analyzing the SELECT hierarchically, constructing a query entity class object at each layer, constructing a field entity class object for each query column, constructing a TableEntity object for each query table, and constructing a UnionQueryentity class object if the select_UNION exists. And establishes a relationship between the objects.

(3) Handling cases containing asterisks

No specified insert field

If the inserted field is not specified in the inserted statement, the field alias of the query of the first layer query statement is used as the inserted field, and if the field is not aliased, the field name is specified as the inserted field. If the inserted field is null and the query field is asterisk, this type of SQL query parsing is temporarily not supported.

Processing queries containing asterisks

If the first layer query field is or contains an asterisk, the asterisk is replaced with the insert field list and the complement of the query field list. If the last layer of query fields contains an asterisk, the asterisk is replaced with the complement of the parent query field and the layer of query fields.

If the middle layer query contains an asterisk, replacing the asterisk with the sub-query field list and the complement of the layer query field list.

Unsupported case

The un-supported unit contains an asterisk, and the un-supported parent-child query contains an asterisk.

(4) Association analysis

Detailed algorithm

Firstly traversing a table entity list in a query entity, if the table name of a field is equal to a table alias, constructing a relation entity object and putting the relation entity object into a result list if the table type is join, if the table type is common, putting the relation entity object into the result list if the table name is not empty, and jumping to the next layer of query entity for traversing if the table name is empty, and repeating the operation steps.

For each layer of query entity, if the List < UnionQueryEntity > List is not empty, an inserted field corresponding to the input field index is found, the List is constructed, and a relation entity object is placed in the result List.

Analysis result

The parsing result is put in the relation entity list, and one inserted field may be derived from multiple tables and multiple fields.

While the invention has been described in detail in the foregoing general description and specific examples, it will be apparent to those skilled in the art that modifications and improvements can be made thereto. Accordingly, such modifications or improvements may be made without departing from the spirit of the invention and are intended to be within the scope of the invention as claimed.

Claims

1. A data entity-based data blood relationship tracing method comprises the following steps:

2. The method according to claim 1, wherein the step 2 includes creating an ANTLR grammar file.

3. The method according to claim 1, wherein the step 3 includes generating a lexical and grammatical analysis class.

4. The method for tracing a data blood relationship based on a data entity of claim 3, wherein said step 3 comprises tree parsing.

5. The data entity-based data lineage tracing method according to claim 4, wherein the tree parsing includes field table lineage parsing.

6. The method for tracing a data blood relationship based on a data entity of claim 4, wherein said step of tree parsing comprises: 1) Analyzing INSERT; 2) Analyzing the SELECT; 3) Processing the situation containing an asterisk, if no inserted field is specified in the inserted statement, taking a field alias of the query of the first layer query statement as an inserted field, and if the field is not aliased, specifying the field name as the inserted field; if the inserted field is null and the query field is asterisk, temporarily not supporting query analysis of the asterisk type; processing the query to contain an asterisk, and if the first layer of query field is the asterisk or contains the asterisk, replacing the asterisk with the insertion field list and the complement of the query field list;

if the last layer of query field contains an asterisk, replacing the asterisk with the complement of the parent query field and the layer of query field; if the middle layer query contains an asterisk, replacing the asterisk with the complement of the sub-query field list and the query field list of the middle layer; under the condition of no support, the un-support contains an asterisk, and the un-support parent-child query contains an asterisk; 4) And (5) analyzing the association relation.