CN117076742A

CN117076742A - Data blood edge tracking method and device and electronic equipment

Info

Publication number: CN117076742A
Application number: CN202311034552.XA
Authority: CN
Inventors: 肖云鹤; 刘亚军; 张俊; 代庆国
Original assignee: Beijing Xinge Technology Co ltd
Current assignee: Beijing Xinge Technology Co ltd
Priority date: 2023-08-16
Filing date: 2023-08-16
Publication date: 2023-11-17

Abstract

The application relates to a data blood-edge tracking method, a device and electronic equipment, belonging to the technical field of data security, wherein the method comprises the steps of obtaining metadata of a target database, and marking the metadata according to the hierarchical dimension of the metadata to obtain marked data, wherein the hierarchical dimension comprises a library, a table and a field; analyzing each operation statement sent to the target database, and constructing a data stream log for generating the target database based on the analysis result and the marking data, wherein the data stream log carries record data representing data stream information in the target database; and carrying out data blood edge tracking on the target data based on the data flow log. According to the technical scheme, field-level data blood-edge tracking can be effectively realized, and effective adaptation to different database types is realized based on specific analytic configuration.

Description

Data blood edge tracking method and device and electronic equipment

Technical Field

The application belongs to the technical field of data security, and particularly relates to a data blood-margin tracking method and device and electronic equipment.

Background

The blood edge tracking (or blood edge analysis) is a technical means for realizing the comprehensive tracking of the data processing process, so as to find all relevant metadata objects taking a certain data object as a starting point and the relationship among the metadata objects. In the current environment, the database can generate new data in the process of the interaction of the data of each enterprise with other enterprises and the data can be circulated, fused, cleared and the like. Based on such application reality, the data blood source is a process chain from the generation of the table to the new table formed after the table passes through a series of actions, and a relation map composed of the relation data with direct or indirect relation with the table. When data analysis is performed, the trace analysis of data sources, data causes and the like is realized without separating the table from the table, and the blood-edge relation analysis of the table fields and the table fields is realized.

The current blood edge tracking implementation mode mainly comprises reverse deduction of a scheduler, a blood edge tracking interface provided by a calculation engine system and the like; the implementation mode of reverse deduction of the scheduler is high in feasibility and low in cost, but cannot accurately track data at a field level; while the blood-edge tracking interface provided by big data calculation engines (e.g., HIVE) can be field-level, it is too customized to be suitable for other database type implementations.

Therefore, how to provide a data blood-edge tracking method capable of performing data field level in a common database type application scenario is a technical problem to be solved.

The foregoing is provided merely for the purpose of facilitating understanding of the technical solutions of the present application and is not intended to represent an admission that the foregoing is prior art.

Disclosure of Invention

In order to overcome the problems in the related art at least to a certain extent, the application provides a data blood-edge tracking method, a device and electronic equipment, and solves the technical problem of how to realize data blood-edge tracking of a data field level in a common database type application scene.

In order to achieve the above purpose, the application adopts the following technical scheme:

in a first aspect of the present application,

the application provides a data blood-edge tracking method, which comprises the following steps:

obtaining metadata of a target database, and marking the metadata according to the hierarchical dimension of the metadata to obtain marked data, wherein the hierarchical dimension comprises a library, a table and a field;

analyzing each operation statement sent to the target database, and constructing and generating a data stream log of the target database based on an analysis result and the marking data, wherein the data stream log carries record data representing data stream information in the target database;

and carrying out data blood edge tracking on the target data based on the data flow log.

Optionally, the obtaining metadata of the target database, and performing a marking process on the metadata according to a hierarchical dimension of the metadata, to obtain marked data, includes:

loading a driver of the target database, and acquiring the metadata of the target database through a JDBC interface;

marking the metadata according to the hierarchical dimension of the metadata, and warehousing the obtained marked data;

wherein the category of the metadata includes: the catalyst, schema, table, column.

Optionally, the operation statement is an sql statement; the analyzing processing is performed on each operation statement sent to the target database, and a data stream log for generating the target database is constructed based on the analysis result and the marking data, and the method comprises the following steps:

calling a general Sql parser to parse each Sql statement to generate an abstract syntax tree corresponding to the statement;

determining an execution action corresponding to the sentence according to the abstract syntax tree, and carrying out data operation analysis on the execution action based on the marking data to obtain a data operation corresponding to the corresponding sentence;

and carrying out aggregation classification on the data operation corresponding to each sql statement, and sequentially constructing the data stream log according to the time sequence of the statement based on the aggregation classification result.

Optionally, the general purpose Sql parser includes a ruid parser, an anltr4 parser.

Optionally, the record data includes metadata change record data; the data blood-edge tracking for the target data based on the data flow log specifically comprises the following steps:

sorting metadata change record information carried by the metadata change record data according to time sequence to obtain data processing links of each level dimension data, and constructing a data processing link set for data blood edge tracking according to the data processing links;

and according to the data processing link set, carrying out query matching on the target data, and obtaining the dependency relationship between the metadata of the target data and other related metadata to obtain a data blood-edge relationship map of the target data.

Optionally, the performing data blood-edge tracking on the target data based on the data flow log further includes:

and according to the data processing link set, carrying out query matching on the target data, and acquiring iteration information of the target data from a starting state to a current state to obtain a tracing map of the target data.

In a second aspect of the present application,

the application provides a data blood edge tracking device, which comprises:

the marking processing module is used for acquiring metadata of the target database, and marking the metadata according to the hierarchical dimension of the metadata to obtain marked data, wherein the hierarchical dimension comprises a library, a table and a field;

the analysis construction module is used for carrying out analysis processing on each database operation statement sent to the target database, constructing and generating a data stream log of the target database based on analysis results and the marking data, wherein the data stream log carries record data representing data stream information in the target database;

and the tracking implementation module is used for carrying out data blood-edge tracking on the target data based on the data flow log.

In a third aspect of the present application,

the present application provides an electronic device including:

a memory having an executable program stored thereon;

and a processor for executing the executable program in the memory to implement the steps of the method described above.

The application adopts the technical proposal and has at least the following beneficial effects:

the data blood-edge tracking method comprises the following steps of obtaining metadata of a target database, and carrying out marking processing according to the hierarchical dimension of the metadata to obtain marked data, wherein the hierarchical dimension comprises a library, a table and a field; analyzing each operation statement sent to the target database, and constructing a data stream log for generating the target database based on the analysis result and the marking data, wherein the data stream log carries record data representing data stream information in the target database; and carrying out data blood edge tracking on the target data based on the data flow log. According to the technical scheme, specific configuration is adopted, in the implementation process, marking processing with the minimum hierarchy dimension as a field can be carried out on metadata of the target database, a data stream log representing the data stream direction of the target database is constructed by combining with analysis processing on each operation statement, and further data blood-edge tracking is carried out on the target data based on the constructed data stream log.

Additional advantages, objects, and features of the application will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the application.

Drawings

The accompanying drawings are included to provide a further understanding of the technical aspects or prior art of the present application, and are incorporated in and constitute a part of this specification. The drawings, which are used to illustrate the technical scheme of the present application, are not limited to the technical scheme of the present application.

FIG. 1 is a flow chart of a data blood-edge tracking method according to an embodiment of the present application;

FIG. 2 is a schematic illustration of an implementation configuration of a data blood-edge tracking method according to another embodiment of the present application;

FIG. 3 is a schematic diagram of a data blood-edge tracking device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail below. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, based on the examples herein, which are within the scope of the application as defined by the claims, will be within the scope of the application as defined by the claims.

As described in the background art, the current blood-edge tracking implementation mainly includes a scheduler reverse deducing, a blood-edge tracking interface provided by a computing engine system, and the like; the implementation mode of reverse deduction of the scheduler is high in feasibility and low in cost, but cannot accurately track data at a field level; while the blood-edge tracking interface provided by big data calculation engines (e.g., HIVE) can be field-level, it is too customized to be suitable for other database type implementations. In addition, for non-database data, the data characteristic recognition or manual recognition is adopted to judge the blood-edge relationship of the data, so that the mode has high service requirements, the implementation needs to be dynamically judged and adjusted based on specific service changes, and the limitation is strong.

In view of the above, the application provides a data blood-edge tracking method to solve the technical problem of how to realize data blood-edge tracking of a data field level in a common database type application scene.

As shown in fig. 1, in an embodiment, the data blood-edge tracking method provided by the present application includes:

step S110, obtaining metadata of a target database, and marking the metadata according to the hierarchical dimension of the metadata to obtain marked data, wherein the hierarchical dimension comprises a library, a table and a field;

the target database refers to a service database in an actual application scene, for example, a book management system, and a MySQL database is adopted in the background;

the marking process herein refers to a normalized marking arrangement of the acquired metadata information, for example, marking an a field of a t1 table may use "/10.10.10.10:3306/mysql/mysql01/t1/a", marking a library mysql may use "/10.10.10:3306/mysql/", marking a schema (a level intermediate between the library and the table) mysql01 may use "/10.10.10:3306/mysql/mysql 01" marking a table t1 may use "/10.10.10.10:3306/mysql/mysql01/t1", etc.;

in this step, the marking process is convenient for subsequent realization of the blood-edge searching of metadata of different levels, for example, the blood-edge relation of the t1 table is searched for only data with prefix of 10.10.10.10:3306/mysql/mysql01/t1, and the efficiency is much higher than that of the whole marking, specifically, the least-level dimension of the marking is a field, so that the data blood-edge tracking of the field level can be realized in the subsequent.

After step S110, as shown in fig. 2, step S120 is performed, where each operation sentence (for example, the operation sentence is an sql sentence in this embodiment) sent to the target database is subjected to analysis processing, and a data stream log for generating the target database is constructed based on the analysis result and the flag data, where the data stream log carries record data representing data stream information in the target database;

it should be noted that, in actual implementation, step S110 is a static process, that is, it is performed once for the service database in the specific scenario, and in contrast, step S120 is a dynamic process, which is to parse for each operation statement sent to the target database, and then construct a data stream log based on the parsing result (parsing result in a certain period of time) and the tag data;

specifically, in this embodiment, the parsing process is performed on each operation statement sent to the target database, and a data stream log for generating the target database is constructed based on the parsing result and the tag data, including:

invoking a general Sql parser to parse each Sql statement to generate an abstract syntax tree (ASTAT, abstract Syntax Tree) corresponding to the statement, wherein the general Sql parser can adopt a guide parser, an anltr4 parser or other types of parsers;

determining an execution action corresponding to the sentence according to the abstract syntax tree (for example, the class of the execution action includes drop, del, add, update, etc.), and performing data operation analysis on the execution action based on the tag data obtained in the step S110 to obtain a data operation corresponding to the corresponding sentence;

it should be noted that, in the present application, the data operation is a combination of the data object and the execution action, for example, the categories of the data operation include: adding field columns, deleting field columns, creating tables, deleting tables and the like; for example: statement ALTER TABLE user ADD account INT NULL COMMENT 'account', ultimately, will be parsed into an add field operation;

after the sentences in a certain time period are processed, the data operation corresponding to each operation sentence is collected and classified, a data stream log is built according to the time sequence of the sentences based on the collection classification result, and the built data stream log contains metadata change data, change data and the like.

After obtaining the data stream log, step S130 may be performed to track the data blood edges of the target data based on the data stream log.

Specifically, in this embodiment, metadata change record information carried by metadata change data in a data stream log is sequentially ordered according to time to obtain data processing links of each level dimension data, and a data processing link set for data blood-edge tracking is constructed according to the obtained data processing links;

for example, such as: the actual database operates by inserting the column1 field of the a table into the B table, denoted column1, now deleting the column1 field of the B table, then inserting the field data of the C table column1 into the B table, denoted column1, and inserting the column1 data as well. The middle has three steps, metadata change information of each step is recorded in the data stream log, and the change information is combed into a processing link of the data in sequence.

After the data processing link set is obtained, query matching is carried out on target data according to the data processing link set, the dependency relationship of the metadata of the target data and other related metadata is obtained, and a data blood-edge relationship map of the target data is obtained, wherein the data blood-edge relationship map shows the association relationship between the metadata of a target database and other metadata, and is a common requirement of actual data blood-edge tracking application.

Continuing the previous example, in practice, the column1 field of the final B table is the same as the column1 field of the C table, and the data sources are the same, so that a simple relationship network based on the fields can be obtained, B-column1-C, which means that this field is equivalent in B, C, and is equivalent to the relationship of B, C for the same field, that is, the parent node of column 1; thus, it is apparent that such a blood relationship map may reflect the relationships of fields and tables, tables and databases.

According to the technical scheme, specific configuration is adopted, the metadata of the target database is subjected to marking processing with the minimum hierarchy dimension as a field in the implementation process, the analysis processing of each operation statement is combined to construct a data stream log representing the data stream direction of the target database, and further, the data blood-edge tracking is carried out on the target data based on the constructed data stream log.

In order to facilitate understanding of the technical solution of the present application, another embodiment of the present application is described below.

As shown in fig. 2, a schematic explanatory diagram of an implementation configuration of the data blood edge tracking method in this embodiment is shown.

In this embodiment, as shown in fig. 2, the service system in the application scenario includes an application front end, an application server, and a service database (the service database may adopt mysql, oracle, sqlserver, hive, db2, h2, etc.), and the user implements a specific service operation flow by accessing the application front end deployed at the application server, and in the service flow process, the service database implements data support management on relevant service data.

In order to realize the technical scheme of the application, an analysis server and an analysis database are added in the existing service system architecture;

based on the specific configuration of the analysis server, firstly, obtaining metadata of a target database (a business database in fig. 2) and marking the metadata to obtain marked data, and warehousing the marked data (to the analysis database);

specifically, in the implementation of this embodiment, a driver of a target database is loaded, metadata of the target database is obtained through a JDBC interface, the metadata is marked according to a metadata hierarchy dimension, and the obtained marked data is put in a warehouse, where the types of the metadata include: cataloge, schema, table, column, etc.

Then, as shown in fig. 2, based on the specific configuration of the analysis server, the filtering and monitoring of the database operation sentences sent to the service database are realized, each operation sentence sent to the target database is analyzed, a data stream log of the target database is constructed and generated based on the analysis result and the marking data, and the data stream log is put into storage (into the analysis database), and the analysis and construction processes are described above and are not repeated here.

After obtaining the data flow log, the data edge tracking can be performed on the target data based on the data flow log, and in this embodiment, based on the foregoing embodiment, the implementation of the data edge tracking further includes:

according to the data processing link set, query matching is carried out on target data (specifically input by a user), iteration information of the target data from a starting state to a current state is obtained based on change records of data marks, a tracing map of the target data is obtained, and the tracing map obtained can be easily understood and can be put into storage (enter an analysis database) based on actual tracing service requirements.

It should be noted that, the traceability map focuses on displaying the self-variation process of the metadata, and the data blood relationship map mentioned in the foregoing embodiment focuses on the association relationship between the metadata and other metadata.

In the embodiment, metadata information is acquired by adopting a universal interface of the JDBC standard, so that a common database is adapted, and the method is not limited to a specified database type, and is more suitable for popularization in practical application; in addition, hierarchical marks are adopted for metadata, so that the relationship graphs of the data blood edges of different levels can be conveniently displayed, and classification can be carried out according to different levels; the Sql analysis is used for realizing data flow and data flow direction mapping, the processing process is converted into visual relational data presentation, in tracking realization, a relationship map of the data blood edges can be generated according to the change of metadata, the change record of the metadata, and the trace map of the data blood edges is generated by aggregation according to the link formed by the change record of the metadata and the change record of the metadata mark, so that the change of target data in any process can be effectively found.

Fig. 3 is a schematic structural diagram of a data blood-edge tracking device according to an embodiment of the present application, and as shown in fig. 3, the data blood-edge tracking device 300 includes:

the marking processing module 301 is configured to obtain metadata of the target database, and perform marking processing on the metadata according to a hierarchical dimension of the metadata to obtain marked data, where the dimension includes a library, a table, and a field;

the parsing and constructing module 302 is configured to parse each database operation statement sent to the target database, and construct a data stream log of the target database based on the parsing result and the tag data, where the data stream log carries record data representing data stream information in the target database;

the trace implementation module 303 is configured to perform data blood edge tracing on the target data based on the data stream log.

The specific manner in which the various modules perform the operations of the data lineage tracking apparatus 300 in the related embodiments described above have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

Fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present application, as shown in fig. 4, the electronic device 400 includes:

a memory 401 on which an executable program is stored;

a processor 402 for executing an executable program in the memory 401 to implement the steps of the above method.

The specific manner in which the processor 402 executes the program in the memory 401 of the electronic device 400 in the above embodiment has been described in detail in the embodiment concerning the method, and will not be described in detail here.

The present application is not limited to the above-mentioned embodiments, and any changes or substitutions that can be easily understood by those skilled in the art within the scope of the present application are intended to be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. A method for data blood-edge tracking, comprising:

2. The method according to claim 1, wherein the obtaining metadata of the target database and performing a marking process on the metadata according to a hierarchical dimension of the metadata to obtain marked data includes:

3. The data lineage tracking method according to claim 2, wherein the operation statement is an sql statement; the analyzing processing is performed on each operation statement sent to the target database, and a data stream log for generating the target database is constructed based on the analysis result and the marking data, and the method comprises the following steps:

4. The data lineage tracking method according to claim 3, wherein the generic Sql parser includes a guide parser, an anltr4 parser.

5. The data lineage tracking method according to claim 1, wherein the record data includes metadata change record data; the data blood-edge tracking for the target data based on the data flow log specifically comprises the following steps:

6. The method of claim 5, wherein the performing data lineage tracking on target data based on the data flow log, further comprises:

7. A data blood-edge tracking device, comprising:

8. An electronic device, comprising:

a memory having an executable program stored thereon;

a processor for executing the executable program in the memory to implement the steps of the method of any one of claims 1-6.