CN116662441A

CN116662441A - Distributed data blood margin construction and display method

Info

Publication number: CN116662441A
Application number: CN202310238130.8A
Authority: CN
Inventors: 严浩; 周晓磊; 范强; 王芳潇; 张骁雄
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2023-03-13
Filing date: 2023-03-13
Publication date: 2023-08-29

Abstract

The application discloses a distributed data blood-edge construction and display method, and belongs to the technical field of visual data resource analysis and display. The method comprises the following steps: step SS1: the data blood margin construction step comprises the following steps: generating blood-edge relation data through distributed metadata acquisition, data processing and blood-edge acquisition, data access middleware object analysis and data storage access monitoring; step SS2: analyzing the blood-edge relationship data and integrating the blood-edge relationship by a data analyzing and processing module to generate a data blood-edge relationship; step SS3: and storing the data blood relationship into a graph database for user query and visual display. In the running process of the data center, the application realizes automatic data collection and analysis and manual data collection to realize data blood relationship construction through metadata collection, and simultaneously provides a relationship exploration technology for finding potential association relationships among data resources and providing support for data stream management.

Description

Distributed data blood margin construction and display method

Technical Field

The application relates to a distributed data blood-lineage construction and display method, and belongs to the technical field of visual data resource analysis and display.

Background

Any data, from creation, ETL processing, fusion, circulation to final extinction, naturally forms a relationship between the data. Like human relationships in human society, a similar relationship expresses this relationship between data, called the blood relationship of data. The data blood-source belongs to a concept in data management, is a logic concept, and is used for finding out the relation between related data in the process of tracing the data. The blood margin analysis is a part of data management, is a means for ensuring data fusion, and realizes the traceability of the data fusion processing through the blood margin analysis. The data blood-line of big data is a link for data generation, which records how the data goes through which processes and stages.

The general blood edge analysis method is to formulate different data blood edge analysis schemes aiming at blood edge analysis with different granularity, display data flow direction in a graphical mode, assist a user to know complex blood edge relations, and realize data blood edge collection, data blood edge analysis and data blood edge display.

The disadvantages of the prior art are: (1) The metadata acquisition step in the current blood edge analysis technology is mainly oriented to a data warehouse, and metadata in the data warehouse is acquired through an API direct connection mode. Along with the rapid development of business, the requirements of data operation and cost management are stronger and stronger, and metadata acquisition needs to cover the full life cycle of data, including databases, offline computing services, online computing services, data center components, computing tasks and the like; (2) The present blood edge analysis is mainly aimed at data of a data center, but in a large distributed system, the data is stored in each node in a distributed mode, and is limited by factors such as a network, authority and the like, so that the data cannot be physically gathered, and the blood edge analysis is difficult to carry out.

Disclosure of Invention

The application aims to overcome the technical defects in the prior art, solve the technical problems, and provide a distributed data blood-margin construction and display method.

The application adopts the following technical scheme: a distributed data blood-edge construction and presentation method comprises the following steps:

step SS1: the data blood margin construction step comprises the following steps: generating blood-edge relation data through distributed metadata acquisition, data processing and blood-edge acquisition, data access middleware object analysis and data storage access monitoring;

step SS2: analyzing the blood-edge relationship data and integrating the blood-edge relationship by a data analyzing and processing module to generate a data blood-edge relationship;

step SS3: and storing the data blood relationship into a graph database for user query and visual display.

As a preferred embodiment, the distributed metadata collection in step SS1 includes: each node acquires metadata in a data life cycle as an acquisition object to create a data source, wherein the acquisition object comprises a database, an offline computing service, an online computing service, a data center component and a computing task; and configuring an acquisition task for the data source, then executing metadata acquisition, marking the acquired metadata, writing the metadata into a database, and performing cross-node metadata aggregation and deduplication fusion.

As a preferred embodiment, the distributed metadata collection in step SS1 specifically further includes: aiming at the database as an acquisition object, acquiring the table name, remarks, a field list, a main key, an external key, a table size, a line number, the number of files, the number of partitions and the upstream and downstream dependency relationship of the table/field by an acquisition method of database access.

As a preferred embodiment, the distributed metadata collection in step SS1 specifically further includes: aiming at offline computing service as a collection object, collecting Hive/RDS (remote data service) table metadata with content by calling a collection method of a computing service interface, wherein the table metadata comprises trend data of file states, file numbers, file sizes and data update time; aiming at the online computing service as an acquisition object, acquiring the basic metadata information of a computing theme by accessing the worksheet data of the service landing disc, and acquiring metadata with content of a Flume/Hbase/Kafka component.

As a preferred embodiment, the distributed metadata collection in step SS1 specifically further includes: aiming at a data center component serving as an acquisition object, the blood margin data with the contents of a BI report system, an index base and OneServer service are acquired by synchronizing the component data to a database and extracting metadata offline.

As a preferred embodiment, the distributed metadata collection in step SS1 specifically further includes: aiming at a computing task as a collection object, collecting names, responsible persons, readline alarm time, scripts and task configuration information of the offline/real-time computing task by analyzing task input/output dependency configuration or analyzing a collection method of the blood edge relation of the table/field in the computing script.

As a preferred embodiment, the writing metadata into the database and performing cross-node metadata aggregation and deduplication fusion specifically comprises: adopting a central metadata convergence mode, namely using a middle node as a master node and using other data nodes as slave nodes, wherein after metadata acquisition is completed, the slave nodes converge metadata to the master node through periodic metadata synchronization tasks; after receiving the metadata, the master node firstly stores the original data into a database as the original data, and then performs de-duplication and fusion operations on the metadata through metadata tags, so as to obtain a metadata set of global data.

As a preferred embodiment, the data processing blood edge obtaining in step SS1 includes: firstly, data extraction, namely comprehensively identifying the scattered data generated by each service system and each data source, setting and defining the required data sources, selecting the data sources capable of implementing operation, and determining the definition of increment extraction; then, converting the data, converting various data from the business model into an analysis model through conversion measures, and providing basic tasks of selection, separation/combination, conversion and summarization; then, loading the data, namely loading the converted data into a database by a direct loading or database connection method; and finally, extracting the data processing process information of the source data in the data conversion process to obtain the field-level blood edge relation of the source data, and storing the converted data and the field-level blood edge relation of the source data into a database.

As a preferred embodiment, the data blood-lineage relationship is an owner-database-table-field.

As a preferred embodiment, the parsing of the data blood relationship in the step SS2 includes obtaining rich information of an input table and an output table through a Hivehook plug-in, asynchronously sending the rich information to Kafka, and writing the data into a graph database after parsing processing, so as to provide metadata system display and REST API service, and fall to a Hive relationship table for user query and visual display.

The application has the beneficial effects that: (1) The application provides a data blood edge construction method of a data center, which comprises the steps of metadata acquisition, data processing and blood edge acquisition, data access middleware object analysis and data storage access monitoring of 4 processes for acquiring data blood edges, wherein the 4 processes realize data analysis and blood edge relation integration through a unified data analysis processing module and are finally stored in a graph database.

(2) The application provides a metadata acquisition method of a full life cycle of data, which is used for carrying out metadata acquisition on a database, an off-line computing service, an on-line computing service, a middle platform component, a computing task and the like, organizing metadata and writing the metadata into the database. Further, the application performs list description on different metadata acquisition objects, acquisition contents and acquisition methods.

(3) The application provides a blood margin acquisition method for data processing. The data processing process firstly extracts the related data in the data source, then carries out further conversion according to the determined conversion requirement, and then loads the data with the comparison specifications into a data warehouse, wherein the blood margin of the data processing process acquires and collects the modification information of the data in the data processing process, thereby generating the blood margin relation of new and old data

(4) The application provides a blood edge analysis and display method, which comprises a hierarchical structure of blood edge relations of structured data stored in a database; in the blood margin analysis process, rich information such as an input table, an output table and the like is acquired through a Hivehook plug-in, asynchronously transmitted to Kafka, and after analysis processing, data are written into a graph database to provide metadata system display and REST API service, and fall into a Hive relation table for user query and visual display; the blood relationship visual display method displays the rule and flow direction distribution at different positions on the graph, and realizes the tracing data tracing, data value evaluation and data quality evaluation capability. (5) The acquisition of the data blood edges is processed by adopting different technical means through dynamic and static operation modes and the like. The dynamic logic association data blood-edge display of the data is obtained through the modes of data processing and processing tasks, data storage and access monitoring, data access middleware object analysis and the like, so that the data flow relation paths among entity objects such as tables, files, fields and tasks can be clearly reflected, the data flow links of the whole system can be intuitively displayed for users, the influence analysis and the data cold and hot analysis are supported, and the users are assisted in knowing complex blood-edge relations.

Drawings

FIG. 1 is a schematic topology diagram of a preferred embodiment of a distributed data lineage construction and presentation method according to the present application;

FIG. 2 is a topological schematic diagram of distributed metadata collection of the present application;

FIG. 3 is a schematic diagram of the topology of the data processing process blood edge acquisition of the present application;

FIG. 4 is a hierarchical schematic of data blood relationship of the present application;

FIG. 5 is a schematic view of a blood margin analysis procedure according to the present application;

FIG. 6 is a schematic diagram of the data blood-lineage visualization of the present application.

Detailed Description

The application is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present application, and are not intended to limit the scope of the present application.

Example 1: as shown in fig. 1, the application provides a distributed data blood-edge construction and display method, which realizes data blood-edge relation construction by metadata acquisition, automatic data acquisition and data analysis and collection in the data center operation process, and provides a relation exploration technology for discovering potential association relations among data resources, comprising the following steps:

FIG. 1 shows 4 processes for obtaining data blood edges used in the present application: distributed metadata acquisition, data processing blood edge acquisition, data access middleware object analysis and data storage access monitoring. The 4 processes realize analysis of data and integration of blood relationship through a unified data analysis processing module, and finally store the data into a graph database.

As shown in fig. 2, as a preferred embodiment, the distributed metadata collection in step SS1 includes: each node acquires metadata in a data life cycle as an acquisition object to create a data source, wherein the acquisition object comprises a database, an offline computing service, an online computing service, a data center component and a computing task; and configuring an acquisition task for the data source, then executing metadata acquisition, marking the acquired metadata, writing the metadata into a database, and performing cross-node metadata aggregation and deduplication fusion. Metadata acquisition modes of different sources are not very same, metadata information of a data dictionary of structured data and unstructured data is acquired, and the metadata information is stored in a database after the metadata acquisition is completed. Metadata is data used to describe data, and all other information/data needed to maintain the operation of the entire system, except those that are processed by business logic to read and write directly, can be called metadata. Such as Schema, table, column information of a database, blood relationship of tasks, authority mapping relationship information of users and scripts/tasks, and the like.

The application describes metadata acquisition contents and acquisition methods of various metadata acquisition objects, as shown in table 1.

Table 1 data acquisition content and method

In a preferred embodiment, the data processing process performs extraction on the relevant data in the data source according to a unified rule, then performs further conversion according to a determined conversion requirement, and finally loads the data with the comparison specifications into the data warehouse. The blood edge acquisition of the data processing is to collect the modification information of the data in the data processing process, so as to generate the blood edge relation of new and old data. As shown in fig. 3, the data processing blood edge obtaining in step SS1 includes: firstly, data extraction, namely comprehensively identifying the scattered data generated by each service system and each data source, setting and defining the required data sources, selecting the data sources capable of implementing operation, and determining the definition of increment extraction; then, converting the data, converting various data from the business model into an analysis model through conversion measures, and providing basic tasks of selection, separation/combination, conversion and summarization; then, loading the data, namely loading the converted data into a database by a direct loading or database connection method; and finally, extracting the data processing process information of the source data in the data conversion process to obtain the field-level blood edge relation of the source data, and storing the converted data and the field-level blood edge relation of the source data into a target database.

As a preferred embodiment, depicted in FIG. 4 is a hierarchy of structured data blood-edge relationships stored in a database. There are subtle differences in the hierarchy of blood-lineage relationships for different types of data. The data are circulated and fused among different owners to form a relationship among the owners through the data connection, which is one of the relationship of the blood edges of the data. The data blood-lineage relationship is an owner-database-table-field.

As a preferred embodiment, as shown in FIG. 5, the parsing of the data blood relationship in the step SS2 includes obtaining rich information of the input table and the output table through a Hivehook plug-in, asynchronously sending the rich information to Kafka, and writing the data into a graph database after parsing processing, so as to provide metadata system display and REST API service, and fall to a Hive relationship table for user query and visual display.

The content to be presented for the blood relationship visualization is depicted as shown in fig. 6. The data blood-edge visualization displays the regular, flow-direction distribution at different locations on the graph, thereby serving several roles:

1) Tracing data: when the data is abnormal, the method helps to track the reason of the occurrence of the abnormality, and simultaneously helps us to track the source of the data and the data processing process;

2) Evaluation of data value: the data value is required to be evaluated, and the data blood relationship can provide basis for evaluating the data value in terms of data audience, data update magnitude, data update frequency and the like;

3) Data quality assessment: the standard list of data cleaning can be conveniently seen on the blood relationship diagram of the data, and the requirements on the data quality are reflected.

The application supports scientific and reasonable integration and integration of scattered and non-standardized low-availability data, and forms unified, high-quality and high-credibility data assets through the construction of a data warehouse. Through data blood-edge collection and analysis, the method and the system can efficiently obtain the data of the complex blood-edge relationship, intuitively provide visual service for users on a blood-edge relationship map, and bring great efficiency and experience improvement to user analysis and decision.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present application and not for limiting the same, and although the present application has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the application without departing from the spirit and scope of the application, which is intended to be covered by the claims.

Claims

1. The distributed data blood-edge construction and display method is characterized by comprising the following steps of:

2. The method for constructing and displaying distributed data blood edges according to claim 1, wherein the distributed metadata collection in step SS1 comprises: each node acquires metadata in a data life cycle as an acquisition object to create a data source, wherein the acquisition object comprises a database, an offline computing service, an online computing service, a data center component and a computing task; and configuring an acquisition task for the data source, then executing metadata acquisition, marking the acquired metadata, writing the metadata into a database, and performing cross-node metadata aggregation and deduplication fusion.

3. The method for constructing and displaying distributed data blood edges according to claim 2, wherein the distributed metadata collection in step SS1 specifically further comprises: aiming at the database as an acquisition object, acquiring the table name, remarks, a field list, a main key, an external key, a table size, a line number, the number of files, the number of partitions and the upstream and downstream dependency relationship of the table/field by an acquisition method of database access.

4. The method for constructing and displaying distributed data blood edges according to claim 2, wherein the distributed metadata collection in step SS1 specifically further comprises: aiming at offline computing service as a collection object, collecting Hive/RDS (remote data service) table metadata with content by calling a collection method of a computing service interface, wherein the table metadata comprises trend data of file states, file numbers, file sizes and data update time; aiming at the online computing service as an acquisition object, acquiring the basic metadata information of a computing theme by accessing the worksheet data of the service landing disc, and acquiring metadata with content of a Flume/Hbase/Kafka component.

5. The method for constructing and displaying distributed data blood edges according to claim 2, wherein the distributed metadata collection in step SS1 specifically further comprises: aiming at a data center component serving as an acquisition object, the blood margin data with the contents of a BI report system, an index base and OneServer service are acquired by synchronizing the component data to a database and extracting metadata offline.

6. The method for constructing and displaying distributed data blood edges according to claim 2, wherein the distributed metadata collection in step SS1 specifically further comprises: aiming at a computing task as a collection object, collecting names, responsible persons, readline alarm time, scripts and task configuration information of the offline/real-time computing task by analyzing task input/output dependency configuration or analyzing a collection method of the blood edge relation of the table/field in the computing script.

7. The method for constructing and displaying the distributed data blood edges according to claim 2, wherein the steps of writing metadata into a database and performing cross-node metadata aggregation and deduplication fusion specifically comprise: adopting a central metadata convergence mode, namely using a middle node as a master node and using other data nodes as slave nodes, wherein after metadata acquisition is completed, the slave nodes converge metadata to the master node through periodic metadata synchronization tasks; after receiving the metadata, the master node firstly stores the original data into a database as the original data, and then performs de-duplication and fusion operations on the metadata through metadata tags, so as to obtain a metadata set of global data.

8. The method for constructing and displaying distributed data blood margins according to claim 2, wherein the data processing blood margin obtaining in step SS1 includes: firstly, data extraction, namely comprehensively identifying the scattered data generated by each service system and each data source, setting and defining the required data sources, selecting the data sources capable of implementing operation, and determining the definition of increment extraction; then, converting the data, converting various data from the business model into an analysis model through conversion measures, and providing basic tasks of selection, separation/combination, conversion and summarization; then, loading the data, namely loading the converted data into a database by a direct loading or database connection method; and finally, extracting the data processing process information of the source data in the data conversion process to obtain the field-level blood edge relation of the source data, and storing the converted data and the field-level blood edge relation of the source data into a database.

9. A distributed data lineage construction and presentation method according to claim 1, wherein the data lineage relationship is an owner-database-table-field.

10. The method for constructing and displaying the distributed data blood-edge according to claim 1, wherein the analyzing of the data blood-edge relationship in the step SS2 includes obtaining rich information of an input table and an output table through a HiveHook plug-in, asynchronously sending the rich information to Kafka, and writing the data into a graph database after analyzing processing, so as to provide metadata system display and REST API service, and falling to a Hive relationship table for user query and visual display.