CN112783857A

CN112783857A - Data blood reason management method and device, electronic equipment and storage medium

Info

Publication number: CN112783857A
Application number: CN202011623179.8A
Authority: CN
Inventors: 任亮; 傅雨梅; 杨飞; 文齐辉
Original assignee: Beijing Zhiyin Intelligent Technology Co ltd
Current assignee: Beijing Zhiyin Intelligent Technology Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-05-11
Anticipated expiration: 2040-12-31
Also published as: CN112783857B

Abstract

The application provides a data blood reason management method, a device, electronic equipment and a storage medium, wherein the data blood reason management method comprises the following steps: acquiring a workflow of target metadata, and storing the workflow to a target node; the workflow includes at least one flow component; analyzing the process components in the workflow, and determining a source table and a target table of each process component and an incidence relation between a source field in the source table and a target field in the target table; and managing the data blood relationship through the attribute information of the target node and the branching diagram. This application is through at data processing's in-process, to the holistic record of workflow and management, realizes the whole flow record to the table blooding reason and the field blooding reason of each node in the workflow, each flow assembly in the workflow, can realize carrying out the location, tracking and backtracking to the data problem in the data warehouse through reasonable mode.

Description

Data blood reason management method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of data management technologies, and in particular, to a data blood relationship management method and apparatus, an electronic device, and a storage medium.

Background

The monitoring of the blood relationship of a lot of data in the market at present is obtained from the monitoring of the blood relationship of the individual component data of hive or from different data tables, or the monitoring of the blood relationship of the single data trend, and there is not a complete set of whole blood relationship monitoring from data flow to data table to data field for the whole metadata and data processing process, therefore, in the daily data processing and in the process of managing all data tables and data processing flows in the data warehouse, when the data is in problem or needs to be managed for the data, the blood relationship in the market at present can not be located, tracked and traced in a reasonable way.

Disclosure of Invention

In view of this, an object of the present application is to provide a data lineage management method, apparatus, electronic device and storage medium, which implement full-process recording on table lineage and field lineage of each node, each process component in a workflow by recording and managing the whole workflow in the data processing process, and can implement positioning, tracking and backtracking of data problems in a data warehouse in a reasonable manner.

The application mainly comprises the following aspects:

in a first aspect, an embodiment of the present application provides a data blood relationship management method, where the data blood relationship management method includes:

acquiring a workflow of target metadata, and storing the workflow to a target node; the workflow comprises at least one process component;

analyzing the process components in the workflow, determining a source table and a target table of each process component and an incidence relation between a source field and a target field in the source table, and displaying the determined source table, the determined target table and the incidence relation between the source field and the target field in a branch graph form;

and managing the data blood relationship through the attribute information of the target node and the branch graph.

In a possible implementation manner, the process component specifically includes: the system comprises a data exchange flow component, a data development flow component, a data quality verification flow component and a data visualization flow component.

In a possible embodiment, the parsing the process components in the workflow, determining a source table, a target table, and an association relationship between a source field in the source table and a target field in the target table for each process component, and displaying the determined source table, the target table, and the association relationship between the source field and the target field in a branch graph form includes:

analyzing the synchronous script of the data exchange flow assembly in the workflow, and determining a source table and a target table of the data exchange flow assembly;

analyzing corresponding fields aiming at source data corresponding to the metadata in the source table and target data corresponding to the metadata in the target table, determining an association relation between the source fields in the source table and the target fields in the target table, and displaying the determined source table, the target table and the association relation between the source fields and the target fields in a branch graph form.

In a possible implementation manner, the parsing the process components in the workflow, determining a source table, a target table, and an association relationship between a source field in the source table and a target field in the target table for each process component, and displaying the determined source table, the target table, and the association relationship between the source field and the target field in a branch graph form further includes:

and analyzing the script of the data development flow assembly in the workflow through a custom code, and synchronously analyzing a specific field in the metadata.

In one possible implementation, the attribute information of the target node includes: the method comprises the steps of establishing time of each node, capacity of each node, internal execution directories of each node and execution servers corresponding to each node.

In a second aspect, an embodiment of the present application further provides a data blood reason management device, including:

the first acquisition module is used for acquiring a workflow of target metadata and storing the workflow to a target node; the workflow comprises at least one process component;

the analysis module is used for analyzing the process components in the workflow, determining a source table and a target table of each process component and an association relationship between a source field in the source table and a target field in the target table, and displaying the determined source table, the determined target table and the association relationship between the source field and the target field in a branch graph form;

and the management module is used for managing the data blood relationship through the attribute information of the target node and the branch graph.

In a possible implementation manner, the flow component in the first obtaining module specifically includes: the system comprises a data exchange flow component, a data development flow component, a data quality verification flow component and a data visualization flow component.

In a possible implementation, the parsing module includes:

a data exchange flow analysis first unit, configured to perform analysis of a synchronization script on the data exchange flow component in the workflow, and determine a source table and a target table of the data exchange flow component;

and the data exchange process analysis second unit is used for analyzing corresponding fields according to source data corresponding to the metadata in the source table and target data corresponding to the metadata in the target table, determining an association relationship between the source fields in the source table and the target fields in the target table, and displaying the determined source table, the target table and the association relationship between the source fields and the target fields in a branch graph form.

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the steps of the data lineage management method as described above.

In a fourth aspect, the present application further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to perform the steps of the data blood margin management method as described above.

Compared with the data consanguinity management method in the prior art, the data consanguinity management method and the data consanguinity management device provided by the embodiment of the application can be used for recording and managing the whole workflow of the workflow by recording and managing the whole workflow of each node in the workflow, the form consanguinity and the field consanguinity of each process component in the workflow, and can be used for positioning, tracking and backtracking data problems in a data warehouse in a reasonable mode.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 is a flowchart illustrating a data blood reason management method provided by an embodiment of the present application;

FIG. 2 is a flow chart illustrating another method for data lineage management provided by an embodiment of the present application;

FIG. 3 is a schematic structural diagram illustrating a data blood reason management device according to an embodiment of the present disclosure;

FIG. 4 is a schematic structural diagram of another data margin management device provided in the embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present application;

fig. 6 is a flowchart illustrating a workflow in a data blood reason management method according to an embodiment of the present application.

In the figure:

300-data blood margin management device; 310-a first obtaining module; 320-a resolution module; 321-data exchange flow parsing the first unit; 322-data exchange flow parsing second unit; 323-data development flow analysis unit; 330-a management module; 500-an electronic device; 510-a processor; 520-a memory; 530-bus.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.

Research shows that the blood relationship monitoring of a lot of data in the current market is obtained by monitoring the blood relationship of single assembly data of hive, or obtained by monitoring different data tables, or is the blood relationship monitoring of single pure data trend, and there is no complete set of whole blood relationship monitoring from data flow to data table to data field aiming at the whole metadata and data processing process, therefore, when data is in problem or needs to be managed aiming at all data tables and data processing flows in a data warehouse in daily data processing, the blood relationship in the current market can not be positioned, tracked and backtracked in a reasonable mode when the data is in problem or needs to be managed aiming at the data.

Based on this, embodiments of the present application provide a data blood reason management method and apparatus, an electronic device, and a storage medium, which implement full-process recording on table blood reasons and field blood reasons of each node in a workflow, each process component in the workflow by recording and managing the whole workflow in a data processing process, and can implement positioning, tracking, and backtracking of data problems in a data warehouse in a reasonable manner.

Referring to fig. 1, fig. 1 is a flowchart illustrating a data blood relationship management method according to an embodiment of the present disclosure. As shown in fig. 1, a data blood relationship management method provided by an embodiment of the present application includes the following steps:

s101, obtaining a workflow of target metadata, and storing the workflow to a target node; the workflow includes at least one flow component.

In the step, workflows of target metadata are obtained, the content of each flow component is recorded according to the content of each workflow, and the content recorded by each flow component is stored in the target node, so that the blood margin recording of the workflows is realized.

Here, the flow of a certain workflow is exemplified, as shown in fig. 6: data exchange, cleaning layer data processing, data quality verification, fusion layer data processing, data quality verification, subject layer data processing, data quality verification and data synchronization.

Wherein, the workflow may specifically be: data exchange (sqoop) - > hql (cleaning layer hive data processing) - > qualitis (quality inspection) - > hql (fusion layer hive data processing) - > qualitis (quality inspection) - > hql (topic layer hive data processing) - > qualitis (quality inspection) - > shell (waterdrop synchronous hive data to clickhouse).

Further, the process component specifically includes: the system comprises a data exchange flow component, a data development flow component, a data quality verification flow component and a data visualization flow component.

Here, the data exchange flow components include sqoop flow components and datax flow components, the data development flow components include shell flow components, hql flow components, sql flow components, flow components spark, python flow components, and the like, the data quality verification flow components are used for verifying problems of data, and the data visualization flow components are used for visually displaying data in each flow component in the workflow processing process, so as to facilitate some index display.

Further, the attribute information of the target node includes: the method comprises the steps of establishing time of each node, capacity of each node, internal execution directories of each node and execution servers corresponding to each node.

Here, the target nodes are arranged to quickly locate, track, and backtrack the corresponding data sources according to the types of the target nodes and the attribute information of the target nodes and to clearly display the data sources when the data is in a problem during data processing flow management or when an error occurs in a certain target node.

S102, analyzing the process components in the workflow, determining a source table and a target table of each process component and an association relationship between a source field and a target field in the source table, and displaying the determined source table, the determined target table and the association relationship between the source field and the target field in a branch graph form.

In this step, in a data processing flow, at least one different workflow may be generated, and at least one flow component may be used in each workflow, so that it is necessary to perform corresponding analysis on each flow component to determine a source table and a target table of each flow component, and an association relationship between a source field in the source table and a target field in the target table.

In order to more intuitively and clearly display the source table, the target table, and the association between the source field and the target field, the determined association between the source table, the target table, and the source field and the target field may be displayed in the form of a branch graph, where the branch graph is a structure of a knowledge graph.

S103, managing the data blood relationship through the attribute information of the target node and the branch graph.

In this step, the attribute information of the target node is combined with the source table, the target table and the incidence relation between the source field and the target field, and is embodied and displayed in a centralized manner through the driving of the branch diagram.

Compared with the data consanguinity method in the prior art, the data consanguinity method provided by the embodiment of the application realizes the full-process recording of the form consanguinity and the field consanguinity of each node and each flow component in the workflow through the overall recording and management of the workflow in the data processing process, and can realize the positioning, tracking and backtracking of data problems in a data warehouse through a reasonable mode.

Referring to fig. 2, fig. 2 is a flowchart illustrating a data blood relationship management method according to another embodiment of the present application. As shown in fig. 2, a method for managing data blood relationship provided in an embodiment of the present application includes:

s201, obtaining a workflow of target metadata, and storing the workflow to a target node; the workflow includes at least one flow component.

S202, analyzing the synchronous script of the data exchange process assembly in the workflow, and determining a source table and a target table of the data exchange process assembly.

In this step, the process components in the workflow specifically include: the data exchange process component comprises a data exchange process component, a data development process component, a data quality verification process component and a data visualization process component, wherein synchronous scripts of the data exchange process component need to be analyzed, and a source table and a target table of the data exchange process component are determined.

S203, analyzing corresponding fields according to source data corresponding to the metadata in the source table and target data corresponding to the metadata in the target table, determining an association relationship between the source fields in the source table and the target fields in the target table, and displaying the determined source table, the target table and the association relationship between the source fields and the target fields in a branch graph form.

The analysis mode of the data exchange flow component sqoop is as follows: and analyzing the synchronous script, acquiring a source table and a target table according to the table and the hive table, and analyzing corresponding fields of a source field in the source table and a target field in the target table by combining mysql/oracle and the like.

For the data exchange flow component hive, analysis of data blood margin can be achieved by hive.

Further, the analyzing the process components in the workflow, determining a source table, a target table, and an association relationship between a source field in the source table and a target field in the target table of each process component, and displaying the determined association relationship between the source table, the target table, and the source field and the target field in a branch graph form, further includes:

And for the data development process component spark, synchronous analysis of specific fields in the metadata can be realized through self-defining org.

For a data development flow component hadoop, synchronous analysis of specific fields in metadata can be realized through self-defining org.

Here, the parsing of the corresponding field between the source field and the target field may specifically be:

for hadoop, self-defined codes can be analyzed in org.

For the elastic search, when the hive pushes data to the ES through the mapping table, the table building statement of the mapping table is analyzed to obtain a source field and a target field, and when a spark program is used, corresponding field analysis is performed according to the designated sql field of the program.

And for clickhouse, analyzing a target table and a source table according to a synchronous script of a waterdrop, and acquiring a source field and a target field according to an sql statement in the script and metadata of the clickhouse.

And S204, managing the data blood relationship through the attribute information of the target node and the branch graph.

The description of S201 may refer to the description of S101, and the same technical effect may be achieved, which is not described in detail herein.

Referring to fig. 3 and 4, fig. 3 is a schematic structural diagram of a data blood margin management device according to an embodiment of the present disclosure, and fig. 4 is a schematic structural diagram of another data blood margin management device according to an embodiment of the present disclosure. As shown in fig. 3, the data blood margin management device 300 includes:

a first obtaining module 310, configured to obtain a workflow of target metadata, and store the workflow to a target node; the workflow includes at least one flow component.

Further, the flow components in the first obtaining module 310 specifically include: the system comprises a data exchange flow component, a data development flow component, a data quality verification flow component and a data visualization flow component.

Further, the attribute information of the target node in the first obtaining module 310 includes: the method comprises the steps of establishing time of each node, capacity of each node, internal execution directories of each node and execution servers corresponding to each node.

The parsing module 320 is configured to parse the process components in the workflow, determine a source table, a target table, and an association relationship between a source field in the source table and a target field in the target table of each process component, and display the determined source table, the target table, and the association relationship between the source field and the target field in a branch graph form.

And the management module 330 is configured to manage the data blood relationship through the attribute information of the target node and the branch graph.

The data bloody border management device that this application embodiment provided compares with data bloody border management device among the prior art, and this application is through at data processing's in-process, to the holistic record of workflow and management, realizes the whole flow record to the table bloody border and the field bloody border of each flow subassembly in each node, the workflow in the workflow, can realize carrying out location, tracking and backtracking to the data problem in the data warehouse through reasonable mode.

As shown in fig. 4, the data blood margin management device 300 includes:

Further, the parsing module 320 includes:

a data exchange flow parsing first unit 321, configured to parse the synchronization script for the data exchange flow component in the workflow, and determine a source table and a target table of the data exchange flow component.

A second data exchange flow parsing unit 322, configured to parse corresponding fields for source data corresponding to the metadata in the source table and target data corresponding to the metadata in the target table, determine an association relationship between a source field in the source table and a target field in the target table, and display the determined source table, the target table, and the association relationship between the source field and the target field in a branch graph form.

The data development flow analysis unit 323 is used for analyzing the script of the data development flow component in the workflow through the custom code and synchronously analyzing the specific field in the metadata.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 500 includes a processor 510, a memory 520, and a bus 530.

The memory 520 stores machine-readable instructions executable by the processor 510, the processor 510 and the memory 520 communicate via the bus 530 when the electronic device 500 is running, and the machine-readable instructions, when executed by the processor 510, perform the steps of the data blood margin management method in the method embodiments of fig. 1 and 2.

In particular, the machine readable instructions, when executed by the processor 510, may perform the following:

acquiring a workflow of target metadata, and storing the workflow to a target node; the workflow includes at least one flow component.

Analyzing the process components in the workflow, determining a source table and a target table of each process component and an association relationship between a source field and a target field in the source table, and displaying the determined source table, the determined target table and the association relationship between the source field and the target field in a branch graph form.

This application is through at data processing's in-process, to the holistic record of workflow and management, realizes the whole flow record to the table blooding reason and the field blooding reason of each node in the workflow, each flow assembly in the workflow, can realize carrying out the location, tracking and backtracking to the data problem in the data warehouse through reasonable mode.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the data blood margin management method in the method embodiments shown in fig. 1 and fig. 2 may be executed.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for data consanguinity management, the method comprising:

2. The data lineage management method according to claim 1, wherein the process component specifically includes: the system comprises a data exchange flow component, a data development flow component, a data quality verification flow component and a data visualization flow component.

3. The method according to claim 2, wherein the parsing the process components in the workflow, determining the source table, the destination table and the association relationship between the source field and the destination field in the source table for each process component, and displaying the determined source table, destination table and the association relationship between the source field and the destination field in a branch graph form comprises:

4. The method according to claim 2, wherein the parsing the process components in the workflow, determining the source table, the destination table and the association relationship between the source field and the destination field in the source table for each process component, and displaying the determined source table, destination table and the association relationship between the source field and the destination field in a branch graph form further comprises:

5. The data lineage management method according to claim 1, wherein the attribute information of the target node includes: the method comprises the steps of establishing time of each node, capacity of each node, internal execution directories of each node and execution servers corresponding to each node.

6. A data lineage management device, comprising:

7. The data bloodline management apparatus of claim 6, characterized in that the flow components in the first acquisition module specifically include: the system comprises a data exchange flow component, a data development flow component, a data quality verification flow component and a data visualization flow component.

8. The data lineage management device according to claim 6, wherein the parsing module includes:

9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating via the bus when the electronic device is operated, the machine-readable instructions being executable by the processor to perform the steps of the data lineage management method according to any one of claims 1 to 5.

10. A computer-readable storage medium, having stored thereon a computer program for performing, when being executed by a processor, the steps of the data-based blood-margin management method according to any one of claims 1 to 5.