CN116662441A - Distributed data blood margin construction and display method - Google Patents

Distributed data blood margin construction and display method Download PDF

Info

Publication number
CN116662441A
CN116662441A CN202310238130.8A CN202310238130A CN116662441A CN 116662441 A CN116662441 A CN 116662441A CN 202310238130 A CN202310238130 A CN 202310238130A CN 116662441 A CN116662441 A CN 116662441A
Authority
CN
China
Prior art keywords
data
blood
metadata
edge
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310238130.8A
Other languages
Chinese (zh)
Inventor
严浩
周晓磊
范强
王芳潇
张骁雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202310238130.8A priority Critical patent/CN116662441A/en
Publication of CN116662441A publication Critical patent/CN116662441A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/445Program loading or initiating
    • G06F9/44521Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
    • G06F9/44526Plug-ins; Add-ons
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The application discloses a distributed data blood-edge construction and display method, and belongs to the technical field of visual data resource analysis and display. The method comprises the following steps: step SS1: the data blood margin construction step comprises the following steps: generating blood-edge relation data through distributed metadata acquisition, data processing and blood-edge acquisition, data access middleware object analysis and data storage access monitoring; step SS2: analyzing the blood-edge relationship data and integrating the blood-edge relationship by a data analyzing and processing module to generate a data blood-edge relationship; step SS3: and storing the data blood relationship into a graph database for user query and visual display. In the running process of the data center, the application realizes automatic data collection and analysis and manual data collection to realize data blood relationship construction through metadata collection, and simultaneously provides a relationship exploration technology for finding potential association relationships among data resources and providing support for data stream management.

Description

Distributed data blood margin construction and display method
Technical Field
The application relates to a distributed data blood-lineage construction and display method, and belongs to the technical field of visual data resource analysis and display.
Background
Any data, from creation, ETL processing, fusion, circulation to final extinction, naturally forms a relationship between the data. Like human relationships in human society, a similar relationship expresses this relationship between data, called the blood relationship of data. The data blood-source belongs to a concept in data management, is a logic concept, and is used for finding out the relation between related data in the process of tracing the data. The blood margin analysis is a part of data management, is a means for ensuring data fusion, and realizes the traceability of the data fusion processing through the blood margin analysis. The data blood-line of big data is a link for data generation, which records how the data goes through which processes and stages.
The general blood edge analysis method is to formulate different data blood edge analysis schemes aiming at blood edge analysis with different granularity, display data flow direction in a graphical mode, assist a user to know complex blood edge relations, and realize data blood edge collection, data blood edge analysis and data blood edge display.
The disadvantages of the prior art are: (1) The metadata acquisition step in the current blood edge analysis technology is mainly oriented to a data warehouse, and metadata in the data warehouse is acquired through an API direct connection mode. Along with the rapid development of business, the requirements of data operation and cost management are stronger and stronger, and metadata acquisition needs to cover the full life cycle of data, including databases, offline computing services, online computing services, data center components, computing tasks and the like; (2) The present blood edge analysis is mainly aimed at data of a data center, but in a large distributed system, the data is stored in each node in a distributed mode, and is limited by factors such as a network, authority and the like, so that the data cannot be physically gathered, and the blood edge analysis is difficult to carry out.
Disclosure of Invention
The application aims to overcome the technical defects in the prior art, solve the technical problems, and provide a distributed data blood-margin construction and display method.
The application adopts the following technical scheme: a distributed data blood-edge construction and presentation method comprises the following steps:
step SS1: the data blood margin construction step comprises the following steps: generating blood-edge relation data through distributed metadata acquisition, data processing and blood-edge acquisition, data access middleware object analysis and data storage access monitoring;
step SS2: analyzing the blood-edge relationship data and integrating the blood-edge relationship by a data analyzing and processing module to generate a data blood-edge relationship;
step SS3: and storing the data blood relationship into a graph database for user query and visual display.
As a preferred embodiment, the distributed metadata collection in step SS1 includes: each node acquires metadata in a data life cycle as an acquisition object to create a data source, wherein the acquisition object comprises a database, an offline computing service, an online computing service, a data center component and a computing task; and configuring an acquisition task for the data source, then executing metadata acquisition, marking the acquired metadata, writing the metadata into a database, and performing cross-node metadata aggregation and deduplication fusion.
As a preferred embodiment, the distributed metadata collection in step SS1 specifically further includes: aiming at the database as an acquisition object, acquiring the table name, remarks, a field list, a main key, an external key, a table size, a line number, the number of files, the number of partitions and the upstream and downstream dependency relationship of the table/field by an acquisition method of database access.
As a preferred embodiment, the distributed metadata collection in step SS1 specifically further includes: aiming at offline computing service as a collection object, collecting Hive/RDS (remote data service) table metadata with content by calling a collection method of a computing service interface, wherein the table metadata comprises trend data of file states, file numbers, file sizes and data update time; aiming at the online computing service as an acquisition object, acquiring the basic metadata information of a computing theme by accessing the worksheet data of the service landing disc, and acquiring metadata with content of a Flume/Hbase/Kafka component.
As a preferred embodiment, the distributed metadata collection in step SS1 specifically further includes: aiming at a data center component serving as an acquisition object, the blood margin data with the contents of a BI report system, an index base and OneServer service are acquired by synchronizing the component data to a database and extracting metadata offline.
As a preferred embodiment, the distributed metadata collection in step SS1 specifically further includes: aiming at a computing task as a collection object, collecting names, responsible persons, readline alarm time, scripts and task configuration information of the offline/real-time computing task by analyzing task input/output dependency configuration or analyzing a collection method of the blood edge relation of the table/field in the computing script.
As a preferred embodiment, the writing metadata into the database and performing cross-node metadata aggregation and deduplication fusion specifically comprises: adopting a central metadata convergence mode, namely using a middle node as a master node and using other data nodes as slave nodes, wherein after metadata acquisition is completed, the slave nodes converge metadata to the master node through periodic metadata synchronization tasks; after receiving the metadata, the master node firstly stores the original data into a database as the original data, and then performs de-duplication and fusion operations on the metadata through metadata tags, so as to obtain a metadata set of global data.
As a preferred embodiment, the data processing blood edge obtaining in step SS1 includes: firstly, data extraction, namely comprehensively identifying the scattered data generated by each service system and each data source, setting and defining the required data sources, selecting the data sources capable of implementing operation, and determining the definition of increment extraction; then, converting the data, converting various data from the business model into an analysis model through conversion measures, and providing basic tasks of selection, separation/combination, conversion and summarization; then, loading the data, namely loading the converted data into a database by a direct loading or database connection method; and finally, extracting the data processing process information of the source data in the data conversion process to obtain the field-level blood edge relation of the source data, and storing the converted data and the field-level blood edge relation of the source data into a database.
As a preferred embodiment, the data blood-lineage relationship is an owner-database-table-field.
As a preferred embodiment, the parsing of the data blood relationship in the step SS2 includes obtaining rich information of an input table and an output table through a Hivehook plug-in, asynchronously sending the rich information to Kafka, and writing the data into a graph database after parsing processing, so as to provide metadata system display and REST API service, and fall to a Hive relationship table for user query and visual display.
The application has the beneficial effects that: (1) The application provides a data blood edge construction method of a data center, which comprises the steps of metadata acquisition, data processing and blood edge acquisition, data access middleware object analysis and data storage access monitoring of 4 processes for acquiring data blood edges, wherein the 4 processes realize data analysis and blood edge relation integration through a unified data analysis processing module and are finally stored in a graph database.
(2) The application provides a metadata acquisition method of a full life cycle of data, which is used for carrying out metadata acquisition on a database, an off-line computing service, an on-line computing service, a middle platform component, a computing task and the like, organizing metadata and writing the metadata into the database. Further, the application performs list description on different metadata acquisition objects, acquisition contents and acquisition methods.
(3) The application provides a blood margin acquisition method for data processing. The data processing process firstly extracts the related data in the data source, then carries out further conversion according to the determined conversion requirement, and then loads the data with the comparison specifications into a data warehouse, wherein the blood margin of the data processing process acquires and collects the modification information of the data in the data processing process, thereby generating the blood margin relation of new and old data
(4) The application provides a blood edge analysis and display method, which comprises a hierarchical structure of blood edge relations of structured data stored in a database; in the blood margin analysis process, rich information such as an input table, an output table and the like is acquired through a Hivehook plug-in, asynchronously transmitted to Kafka, and after analysis processing, data are written into a graph database to provide metadata system display and REST API service, and fall into a Hive relation table for user query and visual display; the blood relationship visual display method displays the rule and flow direction distribution at different positions on the graph, and realizes the tracing data tracing, data value evaluation and data quality evaluation capability. (5) The acquisition of the data blood edges is processed by adopting different technical means through dynamic and static operation modes and the like. The dynamic logic association data blood-edge display of the data is obtained through the modes of data processing and processing tasks, data storage and access monitoring, data access middleware object analysis and the like, so that the data flow relation paths among entity objects such as tables, files, fields and tasks can be clearly reflected, the data flow links of the whole system can be intuitively displayed for users, the influence analysis and the data cold and hot analysis are supported, and the users are assisted in knowing complex blood-edge relations.
Drawings
FIG. 1 is a schematic topology diagram of a preferred embodiment of a distributed data lineage construction and presentation method according to the present application;
FIG. 2 is a topological schematic diagram of distributed metadata collection of the present application;
FIG. 3 is a schematic diagram of the topology of the data processing process blood edge acquisition of the present application;
FIG. 4 is a hierarchical schematic of data blood relationship of the present application;
FIG. 5 is a schematic view of a blood margin analysis procedure according to the present application;
FIG. 6 is a schematic diagram of the data blood-lineage visualization of the present application.
Detailed Description
The application is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present application, and are not intended to limit the scope of the present application.
Example 1: as shown in fig. 1, the application provides a distributed data blood-edge construction and display method, which realizes data blood-edge relation construction by metadata acquisition, automatic data acquisition and data analysis and collection in the data center operation process, and provides a relation exploration technology for discovering potential association relations among data resources, comprising the following steps:
step SS1: the data blood margin construction step comprises the following steps: generating blood-edge relation data through distributed metadata acquisition, data processing and blood-edge acquisition, data access middleware object analysis and data storage access monitoring;
step SS2: analyzing the blood-edge relationship data and integrating the blood-edge relationship by a data analyzing and processing module to generate a data blood-edge relationship;
step SS3: and storing the data blood relationship into a graph database for user query and visual display.
FIG. 1 shows 4 processes for obtaining data blood edges used in the present application: distributed metadata acquisition, data processing blood edge acquisition, data access middleware object analysis and data storage access monitoring. The 4 processes realize analysis of data and integration of blood relationship through a unified data analysis processing module, and finally store the data into a graph database.
As shown in fig. 2, as a preferred embodiment, the distributed metadata collection in step SS1 includes: each node acquires metadata in a data life cycle as an acquisition object to create a data source, wherein the acquisition object comprises a database, an offline computing service, an online computing service, a data center component and a computing task; and configuring an acquisition task for the data source, then executing metadata acquisition, marking the acquired metadata, writing the metadata into a database, and performing cross-node metadata aggregation and deduplication fusion. Metadata acquisition modes of different sources are not very same, metadata information of a data dictionary of structured data and unstructured data is acquired, and the metadata information is stored in a database after the metadata acquisition is completed. Metadata is data used to describe data, and all other information/data needed to maintain the operation of the entire system, except those that are processed by business logic to read and write directly, can be called metadata. Such as Schema, table, column information of a database, blood relationship of tasks, authority mapping relationship information of users and scripts/tasks, and the like.
As a preferred embodiment, the distributed metadata collection in step SS1 specifically further includes: aiming at the database as an acquisition object, acquiring the table name, remarks, a field list, a main key, an external key, a table size, a line number, the number of files, the number of partitions and the upstream and downstream dependency relationship of the table/field by an acquisition method of database access.
As a preferred embodiment, the distributed metadata collection in step SS1 specifically further includes: aiming at offline computing service as a collection object, collecting Hive/RDS (remote data service) table metadata with content by calling a collection method of a computing service interface, wherein the table metadata comprises trend data of file states, file numbers, file sizes and data update time; aiming at the online computing service as an acquisition object, acquiring the basic metadata information of a computing theme by accessing the worksheet data of the service landing disc, and acquiring metadata with content of a Flume/Hbase/Kafka component.
As a preferred embodiment, the distributed metadata collection in step SS1 specifically further includes: aiming at a data center component serving as an acquisition object, the blood margin data with the contents of a BI report system, an index base and OneServer service are acquired by synchronizing the component data to a database and extracting metadata offline.
As a preferred embodiment, the distributed metadata collection in step SS1 specifically further includes: aiming at a computing task as a collection object, collecting names, responsible persons, readline alarm time, scripts and task configuration information of the offline/real-time computing task by analyzing task input/output dependency configuration or analyzing a collection method of the blood edge relation of the table/field in the computing script.
The application describes metadata acquisition contents and acquisition methods of various metadata acquisition objects, as shown in table 1.
Table 1 data acquisition content and method
As a preferred embodiment, the writing metadata into the database and performing cross-node metadata aggregation and deduplication fusion specifically comprises: adopting a central metadata convergence mode, namely using a middle node as a master node and using other data nodes as slave nodes, wherein after metadata acquisition is completed, the slave nodes converge metadata to the master node through periodic metadata synchronization tasks; after receiving the metadata, the master node firstly stores the original data into a database as the original data, and then performs de-duplication and fusion operations on the metadata through metadata tags, so as to obtain a metadata set of global data.
In a preferred embodiment, the data processing process performs extraction on the relevant data in the data source according to a unified rule, then performs further conversion according to a determined conversion requirement, and finally loads the data with the comparison specifications into the data warehouse. The blood edge acquisition of the data processing is to collect the modification information of the data in the data processing process, so as to generate the blood edge relation of new and old data. As shown in fig. 3, the data processing blood edge obtaining in step SS1 includes: firstly, data extraction, namely comprehensively identifying the scattered data generated by each service system and each data source, setting and defining the required data sources, selecting the data sources capable of implementing operation, and determining the definition of increment extraction; then, converting the data, converting various data from the business model into an analysis model through conversion measures, and providing basic tasks of selection, separation/combination, conversion and summarization; then, loading the data, namely loading the converted data into a database by a direct loading or database connection method; and finally, extracting the data processing process information of the source data in the data conversion process to obtain the field-level blood edge relation of the source data, and storing the converted data and the field-level blood edge relation of the source data into a target database.
As a preferred embodiment, depicted in FIG. 4 is a hierarchy of structured data blood-edge relationships stored in a database. There are subtle differences in the hierarchy of blood-lineage relationships for different types of data. The data are circulated and fused among different owners to form a relationship among the owners through the data connection, which is one of the relationship of the blood edges of the data. The data blood-lineage relationship is an owner-database-table-field.
As a preferred embodiment, as shown in FIG. 5, the parsing of the data blood relationship in the step SS2 includes obtaining rich information of the input table and the output table through a Hivehook plug-in, asynchronously sending the rich information to Kafka, and writing the data into a graph database after parsing processing, so as to provide metadata system display and REST API service, and fall to a Hive relationship table for user query and visual display.
The content to be presented for the blood relationship visualization is depicted as shown in fig. 6. The data blood-edge visualization displays the regular, flow-direction distribution at different locations on the graph, thereby serving several roles:
1) Tracing data: when the data is abnormal, the method helps to track the reason of the occurrence of the abnormality, and simultaneously helps us to track the source of the data and the data processing process;
2) Evaluation of data value: the data value is required to be evaluated, and the data blood relationship can provide basis for evaluating the data value in terms of data audience, data update magnitude, data update frequency and the like;
3) Data quality assessment: the standard list of data cleaning can be conveniently seen on the blood relationship diagram of the data, and the requirements on the data quality are reflected.
The application supports scientific and reasonable integration and integration of scattered and non-standardized low-availability data, and forms unified, high-quality and high-credibility data assets through the construction of a data warehouse. Through data blood-edge collection and analysis, the method and the system can efficiently obtain the data of the complex blood-edge relationship, intuitively provide visual service for users on a blood-edge relationship map, and bring great efficiency and experience improvement to user analysis and decision.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Finally, it should be noted that: the above embodiments are only for illustrating the technical aspects of the present application and not for limiting the same, and although the present application has been described in detail with reference to the above embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the application without departing from the spirit and scope of the application, which is intended to be covered by the claims.

Claims (10)

1. The distributed data blood-edge construction and display method is characterized by comprising the following steps of:
step SS1: the data blood margin construction step comprises the following steps: generating blood-edge relation data through distributed metadata acquisition, data processing and blood-edge acquisition, data access middleware object analysis and data storage access monitoring;
step SS2: analyzing the blood-edge relationship data and integrating the blood-edge relationship by a data analyzing and processing module to generate a data blood-edge relationship;
step SS3: and storing the data blood relationship into a graph database for user query and visual display.
2. The method for constructing and displaying distributed data blood edges according to claim 1, wherein the distributed metadata collection in step SS1 comprises: each node acquires metadata in a data life cycle as an acquisition object to create a data source, wherein the acquisition object comprises a database, an offline computing service, an online computing service, a data center component and a computing task; and configuring an acquisition task for the data source, then executing metadata acquisition, marking the acquired metadata, writing the metadata into a database, and performing cross-node metadata aggregation and deduplication fusion.
3. The method for constructing and displaying distributed data blood edges according to claim 2, wherein the distributed metadata collection in step SS1 specifically further comprises: aiming at the database as an acquisition object, acquiring the table name, remarks, a field list, a main key, an external key, a table size, a line number, the number of files, the number of partitions and the upstream and downstream dependency relationship of the table/field by an acquisition method of database access.
4. The method for constructing and displaying distributed data blood edges according to claim 2, wherein the distributed metadata collection in step SS1 specifically further comprises: aiming at offline computing service as a collection object, collecting Hive/RDS (remote data service) table metadata with content by calling a collection method of a computing service interface, wherein the table metadata comprises trend data of file states, file numbers, file sizes and data update time; aiming at the online computing service as an acquisition object, acquiring the basic metadata information of a computing theme by accessing the worksheet data of the service landing disc, and acquiring metadata with content of a Flume/Hbase/Kafka component.
5. The method for constructing and displaying distributed data blood edges according to claim 2, wherein the distributed metadata collection in step SS1 specifically further comprises: aiming at a data center component serving as an acquisition object, the blood margin data with the contents of a BI report system, an index base and OneServer service are acquired by synchronizing the component data to a database and extracting metadata offline.
6. The method for constructing and displaying distributed data blood edges according to claim 2, wherein the distributed metadata collection in step SS1 specifically further comprises: aiming at a computing task as a collection object, collecting names, responsible persons, readline alarm time, scripts and task configuration information of the offline/real-time computing task by analyzing task input/output dependency configuration or analyzing a collection method of the blood edge relation of the table/field in the computing script.
7. The method for constructing and displaying the distributed data blood edges according to claim 2, wherein the steps of writing metadata into a database and performing cross-node metadata aggregation and deduplication fusion specifically comprise: adopting a central metadata convergence mode, namely using a middle node as a master node and using other data nodes as slave nodes, wherein after metadata acquisition is completed, the slave nodes converge metadata to the master node through periodic metadata synchronization tasks; after receiving the metadata, the master node firstly stores the original data into a database as the original data, and then performs de-duplication and fusion operations on the metadata through metadata tags, so as to obtain a metadata set of global data.
8. The method for constructing and displaying distributed data blood margins according to claim 2, wherein the data processing blood margin obtaining in step SS1 includes: firstly, data extraction, namely comprehensively identifying the scattered data generated by each service system and each data source, setting and defining the required data sources, selecting the data sources capable of implementing operation, and determining the definition of increment extraction; then, converting the data, converting various data from the business model into an analysis model through conversion measures, and providing basic tasks of selection, separation/combination, conversion and summarization; then, loading the data, namely loading the converted data into a database by a direct loading or database connection method; and finally, extracting the data processing process information of the source data in the data conversion process to obtain the field-level blood edge relation of the source data, and storing the converted data and the field-level blood edge relation of the source data into a database.
9. A distributed data lineage construction and presentation method according to claim 1, wherein the data lineage relationship is an owner-database-table-field.
10. The method for constructing and displaying the distributed data blood-edge according to claim 1, wherein the analyzing of the data blood-edge relationship in the step SS2 includes obtaining rich information of an input table and an output table through a HiveHook plug-in, asynchronously sending the rich information to Kafka, and writing the data into a graph database after analyzing processing, so as to provide metadata system display and REST API service, and falling to a Hive relationship table for user query and visual display.
CN202310238130.8A 2023-03-13 2023-03-13 Distributed data blood margin construction and display method Pending CN116662441A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310238130.8A CN116662441A (en) 2023-03-13 2023-03-13 Distributed data blood margin construction and display method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310238130.8A CN116662441A (en) 2023-03-13 2023-03-13 Distributed data blood margin construction and display method

Publications (1)

Publication Number Publication Date
CN116662441A true CN116662441A (en) 2023-08-29

Family

ID=87717808

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310238130.8A Pending CN116662441A (en) 2023-03-13 2023-03-13 Distributed data blood margin construction and display method

Country Status (1)

Country Link
CN (1) CN116662441A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117273131A (en) * 2023-11-22 2023-12-22 四川三合力通科技发展集团有限公司 Cross-node data relationship discovery system and method
CN117312331A (en) * 2023-12-01 2023-12-29 浪潮云信息技术股份公司 Metadata blood-edge analysis method, device, equipment and storage medium
CN117555950A (en) * 2024-01-12 2024-02-13 山东再起数据科技有限公司 Data blood relationship construction method based on data center

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117273131A (en) * 2023-11-22 2023-12-22 四川三合力通科技发展集团有限公司 Cross-node data relationship discovery system and method
CN117273131B (en) * 2023-11-22 2024-02-13 四川三合力通科技发展集团有限公司 Cross-node data relationship discovery system and method
CN117312331A (en) * 2023-12-01 2023-12-29 浪潮云信息技术股份公司 Metadata blood-edge analysis method, device, equipment and storage medium
CN117312331B (en) * 2023-12-01 2024-03-29 浪潮云信息技术股份公司 Metadata blood-edge analysis method, device, equipment and storage medium
CN117555950A (en) * 2024-01-12 2024-02-13 山东再起数据科技有限公司 Data blood relationship construction method based on data center
CN117555950B (en) * 2024-01-12 2024-04-02 山东再起数据科技有限公司 Data blood relationship construction method based on data center

Similar Documents

Publication Publication Date Title
US20230122210A1 (en) Resource dependency system and graphical user interface
US9792327B2 (en) Self-described query execution in a massively parallel SQL execution engine
US11663033B2 (en) Design-time information based on run-time artifacts in a distributed computing cluster
US11042523B2 (en) Data curation system with version control for workflow states and provenance
CN116662441A (en) Distributed data blood margin construction and display method
Fu et al. Real-time data infrastructure at uber
CN111639082B (en) Object storage management method and system of billion-level node scale knowledge graph based on Ceph
Anderson Embrace the challenges: Software engineering in a big data world
US11615076B2 (en) Monolith database to distributed database transformation
CN111125068A (en) Metadata management method and system
Ahmed et al. A literature review on NoSQL database for big data processing
US11429572B2 (en) Rules-based dataset cleaning
CN114036130A (en) Metadata analysis processing method and device
CN112148718A (en) Big data support management system for city-level data middling station
CN112148578A (en) IT fault defect prediction method based on machine learning
CN115858513A (en) Data governance method, data governance device, computer equipment and storage medium
Mostajabi et al. A systematic review of data models for the big data problem
Almassabi et al. Top NewSQL databases and features classification
CN115640300A (en) Big data management method, system, electronic equipment and storage medium
US12039416B2 (en) Facilitating machine learning using remote data
Faridoon et al. Big Data Storage Tools Using NoSQL Databases and Their Applications in Various Domains: A Systematic Review.
CN111126961A (en) Complex product full life cycle digital mainline service system
CN113779313B (en) Knowledge management method and system based on graph database
Prakash et al. A Comprehensive Study on Structural and Non-Structural Databases and Its Impact on Hybrid Databases
Jung Design and Development of Big Data Platform based on IoT-based Children's Play Pattern Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination