CN113868253B

CN113868253B - Data relationship capturing and big data relationship tree construction method

Info

Publication number: CN113868253B
Application number: CN202111142241.6A
Authority: CN
Inventors: 方佩; 李国民; 全威; 蔡希; 杨凯; 曾学俊
Original assignee: China Comservice Enrising Information Technology Co Ltd
Current assignee: China Comservice Enrising Information Technology Co Ltd
Priority date: 2021-09-28
Filing date: 2021-09-28
Publication date: 2024-04-23
Anticipated expiration: 2041-09-28
Also published as: CN113868253A

Abstract

The invention discloses a data relation capturing and big data relation tree construction method, which relates to the big data processing field, and the technical key points are as follows: integrating the relational data and the non-relational data to obtain metadata of data blood edges, defining the metadata, and creating an entity based on the defined metadata; executing the SQL language to trigger the change component to change the metadata entity, automatically capturing change information by the capturing hook to generate metadata change details, and analyzing the metadata change details to generate a data blood edge lineage diagram of the single system; and storing the data blood-lineage diagram into a diagram database, encrypting by a private key, synchronizing the encrypted data into a data blood-lineage exchange space by a message queue, and further constructing a cross-system big data blood-lineage relation tree. According to the invention, the data blood-edge synchronization of the single application system is reported to the cross-system blood-edge final synchronization and issued, and a complete closed loop of the blood-edge relation tree from data acquisition, cross-application system blood-edge calculation and blood-edge tree construction to issuing is formed.

Description

Data relationship capturing and big data relationship tree construction method

Technical Field

The invention relates to the field of big data processing, in particular to a method for capturing data relationship and constructing a big data relationship tree.

Background

The data blood-edge reveals the lifecycle of the data—it is intended to show the complete link of the data from generation to end. The data blood edges record the process of data generation, processing, circulation, and final extinction. Including all transformations the data undergoes in the process-how it is transformed, what changes have occurred, and why.

In the current big data environment, the blood-edge relationship among a plurality of data is unclear, even if there is blood-edge, the blood-edge relationship among the data is not automatically captured, and the data blood-edge relationship among the systems is not realized.

Therefore, how to study and design a cross-system data blood relationship tree is a current urgent problem to be solved.

Disclosure of Invention

The invention solves the technical problems that the blood-edge relationship between the data is not automatically paved and the data blood-edge relationship between the systems is not realized, and the invention aims to provide a data relationship capturing and big data relationship tree construction method.

The technical aim of the invention is realized by the following technical scheme:

A data relation capturing and big data relation tree construction method comprises the following steps:

Integrating the relational data and the non-relational data to obtain metadata of data blood edges, defining the metadata, and creating an entity based on the defined metadata;

Executing an SQL language triggering change component to change a metadata entity, automatically capturing change information by a capturing hook to generate metadata change details, and analyzing the metadata change details to generate a data blood-lineage diagram of a single system;

and storing the data blood-lineage diagram into a diagram database, encrypting by a private key, synchronizing the encrypted data into a data blood-lineage exchange space by a message queue, and further constructing a cross-system big data blood-lineage relation tree.

According to the method, a SQL language is executed to trigger a change component to change metadata entities, a capture hook automatically captures change information to generate metadata change details, a data blood edge relation pedigree diagram of a single system is generated according to the change details, the data blood edge pedigree diagram is stored in a diagram database, the data blood edge pedigree diagram is encrypted through a private key, encrypted data is synchronized to a data blood edge exchange space through a message queue to be synchronously exchanged and decrypted, and finally a cross-system big data blood edge relation tree is constructed.

Further, metadata definition includes aliases, classifications and labels for metadata, wherein the types of metadata are generated by the aliases, metadata is correlated with classifications by the labels or metadata and data assets are correlated, metadata is managed according to different classifications, business scope of metadata is expressed according to the classifications, and data blood-edge dependence is propagated through the labels and classifications.

Further, metadata is modeled by type and represented as entities, the types being uniquely identified by a "name", each type having a meta-type, the entities being specific values or specific columns of types, the entities being identified by unique identifiers.

Further, the metadata entity modification includes performing a create/modify/delete operation on the metadata to modify the metadata entity.

Further, the creation/updating/deleting operation of the metadata is automatically captured through different types of capturing hooks to generate an output column and a group of input columns or input tables of metadata change details, the output column is associated with the group of input columns or the group of input tables to generate a data blood-edge dependency lineage diagram, and the information content of the metadata change details is pushed to a message queue to update the metadata; the information content comprises entity creation information, entity update information, entity deletion information, field creation information, field update information and field deletion information.

Further, the dependency types of the data lineage graph include simple dependencies, expressions and scripts, wherein the simple dependencies, output columns have the same values as input columns, the expressions, the output columns are converted at runtime by the expressions on the input columns, the scripts, the output columns are converted by the scripts provided by the user.

Further, the data lineage graph is persisted through the graphics engine and an index is generated, and stored in the search engine, which performs deep mining on the data lineage relationships to generate potential links between the data.

Further, the specific steps of constructing the cross-system enterprise-level data blood-lineage tree are as follows:

each application system applies public and private keys in the blood-edge exchange space, the private keys are held by the system, and the public keys are reserved by the blood-edge exchange space and are used for data decryption;

each application system encrypts the data blood-edge pedigree graph through the private key, and synchronizes the encrypted data to the blood-edge exchange space through the message queue in real time;

the blood margin exchange space adopts the public key of the corresponding system to decrypt the blood margin pedigree data of the single system, then carries out real-time calculation according to the blood margin pedigree data of each current and latest system, opens and perfects the blood margin relation of the data among the systems, and further draws a big data blood margin relation tree.

Further, after private key encryption is carried out on the updated data blood-edge tree of the cross-application level by the blood-edge exchange space, the data blood-edge exchange space is synchronized to a system of each application system in the ecology, which holds a public key of the blood-edge exchange space, decryption is carried out, so that a big data blood-edge tree in the whole ecology is obtained, and further, a complete cross-system big data relationship blood-edge tree of all application systems in the whole ecology is obtained.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention automatically captures the data blood edges: and automatically capturing the data blood-edge relation through the execution process, and identifying the missing value, the abnormal value and other data anomalies through the deep mining analysis to realize automatic data quality analysis.

2. The invention constructs across intersystem blood margins: reporting the single application blood edge synchronization to cross-system blood edge final synchronization and issuing to form a complete closed loop of an enterprise blood edge relation tree from data acquisition, cross-application blood edge calculation and blood edge tree construction to issuing.

Drawings

The accompanying drawings, which are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings:

FIG. 1 is a flow chart for constructing a data blood relationship according to an embodiment of the present invention;

FIG. 2 is a flow chart of the construction of an in-system lineage diagram according to an embodiment of the present invention;

FIG. 3 is a flow chart of cross-system blood relationship tree construction provided by an embodiment of the present invention;

FIG. 4 is a flowchart of an automatic capturing of a capturing hook according to an embodiment of the present invention;

fig. 5 is a flow chart of data edge depth mining according to an embodiment of the present invention.

Detailed Description

For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.

It will be understood that when an element is referred to as being "mounted" or "disposed" on another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly or indirectly connected to the other element.

It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are merely for convenience in describing and simplifying the description based on the orientation or positional relationship shown in the drawings, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus are not to be construed as limiting the invention.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

Examples

In the current big data environment, the blood-edge relation of many data is unclear, and even if there is a blood-edge, the problem to be solved is mainly that the blood-edge relation of the data is automatically captured, the blood-edge in the system is thinned, and then the construction of a data blood-edge relation tree among a plurality of systems is realized by combining a plurality of external systems.

As shown in fig. 1, the present embodiment provides a method for capturing a data relationship and constructing a big data relationship tree, which includes the following steps:

s1, integrating relational data and non-relational data to obtain metadata of data blood edges, defining the metadata, and creating an entity based on the defined metadata;

S2, executing the SQL language to trigger the change component to change the metadata entity, automatically capturing change information by the capture hook to generate metadata change details, and analyzing the metadata change details to generate a data blood-lineage diagram of the single system;

S3, storing the data blood-lineage diagram into a diagram database, encrypting by a private key, synchronizing the encrypted data to a data blood-lineage exchange space by a message queue, and further constructing a cross-system big data blood-lineage relation tree.

Specifically, the method comprises three parts of content, wherein the step S1 is metadata management, the step S2 is metadata intelligent capturing and updating, and a data lineage diagram is generated, and the step S3 is that a single system exchanges space synchronous lineage diagram with a data lineage, so that a cross-system big data lineage tree is constructed. Metadata management includes metadata integration, setting classifications for metadata, labels, aliases, and the like. The integrated data comprises mysql, oracel, hive, hbase and other data, and the existing data is integrated into the system through data integration.

Preferably, the metadata definition includes aliases, classifications and labels for metadata, wherein the types of metadata are generated by the aliases, metadata is correlated with classifications by the labels or metadata and data assets are correlated, metadata is managed by different classifications, business scope of metadata is expressed by the hierarchy of classifications, and data blood-edge dependencies are propagated by the labels and classifications.

In particular, the user is allowed to define service tags and service classifications for the metadata. Tags and classifications are associated with assets, such as libraries, tables, columns, etc., by metadata, and aliases identify the type of metadata.

Preferably, the metadata is modeled according to types and represented as entities, the types being uniquely identified by a "name", one meta-type for each type, the entities being specific values or specific columns of types, the entities being identified by unique identifiers.

In particular, the type system defines a model for managed metadata objects. All metadata is modeled using types and is represented as an entity. Type (2): the types are uniquely identified by a "name", each type having a meta-type comprising: original meta-types, enumerated meta-types, aggregate meta-types, and composite meta-types.

In addition, the entity and classification types may be extended from other types. Entity: an entity is a particular value or a particular column of a type, such as a table is an entity. The entity is identified by a unique identifier (GUID). This unique identifier is generated by the server when defining the object and remains unchanged throughout the life cycle of the entity. At any time, this particular entity may be accessed using its GUID. Metadata definition is mainly to abstract metadata, so that various metadata sources of different types are convenient to manage uniformly. The definition of the identifier guarantees the uniqueness of the metadata.

Preferably, the metadata entity modification includes modifying the metadata entity by a create/modify/delete operation on the metadata.

Preferably, the creation/update/deletion operation of the metadata is automatically captured through different types of capturing hooks to generate an output column and a group of input columns or input tables of metadata change details, the output column and the group of input columns or input tables are associated to generate a data blood-edge dependency lineage diagram, and metadata change detail information content is pushed to a message queue for metadata update; the information content comprises entity creation information, entity update information, entity deletion information, field creation information, field update information and field deletion information.

Specifically, the spreader hook may be used to spread the following data operations, create a database, create a table or view, selectively create a table, load data, import or export data, DMLs (insert), change a database, later table, age view of data, etc

Preferably, the dependency types of the data lineage diagrams include simple dependencies, expressions and scripts, wherein the simple dependencies, output columns have the same value as input columns, the expressions, the output columns are converted at runtime by expressions on the input columns, the scripts, the output columns are converted by scripts provided by the user.

Preferably, the data lineage graph is persisted through a graphics engine and an index is generated, and stored in a search engine that deep mines the data lineage relationships to generate potential links between data.

Specifically, as shown in fig. 5, fig. 5 is a data blood edge depth mining flow chart, and automatic data quality analysis is realized by identifying missing values, abnormal values and other data anomalies through depth mining analysis. Analysis by deep mining reveals how the data evolves over its lifecycle, where it comes from, and foresees the assets that will be affected by future changes. The same classification and security control is automatically ensured by deep mining analysis inherited from each table or column derived from the column that is sensitive.

Preferably, the specific steps of building a cross-system enterprise-level data blood-lineage tree are as follows:

Preferably, after private key encryption is carried out on the updated data blood-edge tree across application levels by the blood-edge exchange space, the data blood-edge exchange space is synchronized to a system with a blood-edge exchange space public key held by each application system in the ecology through a message queue, decryption is carried out, so that the big data blood-edge tree in the whole ecology is obtained, and further, the whole data relationship blood-edge tree across systems held by all application systems in the whole ecology is obtained.

The present invention will be described in further detail with reference to the accompanying drawings and description, in order to make the objects, technical solutions and advantages of the present invention more apparent: the method comprises three parts of contents, wherein the first part is metadata management, the second part is metadata intelligent capturing and updating, the third part is a single-system data blood-edge exchange space synchronous blood-edge pedigree diagram, and then a cross-system big data blood-edge tree is constructed, the whole flow is as shown in fig. 2,3 and4, and step 1: metadata integration. First, the existing metadata (such as Hive metadata) can be manually assembled or directly imported, as shown in fig. 4, assuming two pieces of metadata id, name and creating a model T1.

As shown in fig. 2, metadata management is as follows: in a metadata object import system, metadata objects after import are classified, labeled, aliased, etc., entities are created for metadata after definition, and the entities are changed. Step 2: SQL trigger metadata changes are performed, and if "create table T2 AS SELECT ID, name from T1", data manipulation and blood-address construction are performed. Suppose that a data processing script such as "create table T2 AS SELECT ID, name from T1" is executed, at which time the automatic capture hook starts capturing blood edges, and the details of metadata change by the capture hook include: creating or changing a database, creating or changing a table or a view, inputting data, analyzing change details in time to generate a data blood-lineage diagram, storing the data blood-lineage diagram in the database, then constructing a search engine by the database, adding a mining component into the search engine, and deeply mining the data blood-lineage diagram based on the mining component to generate a relation among more irrelevant data. At the same time, the metadata change information informs the update metadata through the message queue, and the information content comprises entity creation information, entity update information, entity deletion information, field creation information, field update information and field deletion information.

And 3, constructing and storing the data blood relationship from the step 1 to the step 2. As shown in fig. 3, encrypted data is synchronized instantaneously through a message queue to the blood-edge exchange space by private key encryption,

The blood margin exchange space adopts a corresponding system public key to decrypt the blood margin pedigree data of a single system, integrates the blood margin pedigree data of each current and latest system, calculates in real time, opens up and updates the blood margin relation among the systems, and further draws an enterprise-level big data blood margin tree.

After the blood-edge exchanging space encrypts the private key of the built data blood-edge tree crossing the application level, the data blood-edge tree is synchronized to each application system in the ecology through a message queue, and a system with the public key of the blood-edge exchanging space can decrypt and obtain the big data blood-edge relation tree in the whole ecology.

In summary, by generating the data blood-edge relation graph after capturing the execution process, each application system encrypts and synchronizes the blood-edge relation graph to the blood-edge exchange space, the blood-edge exchange space constructs the latest cross-system blood-edge relation tree through real-time calculation and analysis, and encrypts and issues the latest cross-system blood-edge relation tree, and the whole process from reporting to final issuing shows a complete data blood-edge capturing from a single system to a cross-system application implementation scheme.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The data relation capturing and big data relation tree construction method is characterized by comprising the following steps:

Integrating the relational data and the non-relational data to obtain metadata of data blood edges, defining the metadata, and creating an entity based on the defined metadata; the metadata definition comprises the steps of carrying out alias, classification and label on metadata, wherein the type of the metadata is generated through the alias, the metadata is related to each other or the metadata and data assets through the label and the classification, the metadata is managed according to different classifications, the service range of the metadata is expressed according to the classified layers, and the data blood-edge dependence is propagated through the label and the classification; metadata is modeled according to types and expressed as entities, the types are uniquely identified by names, each type has a meta type, the entities are specific values or specific columns of the types, and the entities are identified by unique identifiers;

Executing an SQL language triggering change component to change a metadata entity, automatically capturing change information by a capturing hook to generate metadata change details, and analyzing the metadata change details to generate a data blood-lineage diagram of a single system; wherein the metadata entity change includes creation/change/deletion operations on metadata; automatically capturing the creation/update/deletion operation of the metadata through different types of capturing hooks to generate an output column and a group of input columns or input tables of metadata change details, associating the output column with the group of input columns or input tables to generate a data blood-edge dependent lineage diagram, and pushing the information content of the metadata change details to a message queue to update the metadata; the information content comprises entity creation information, entity update information, entity deletion information, field creation information, field update information and field deletion information;

Storing the data blood-lineage diagram into a diagram database, encrypting by a private key, synchronizing the encrypted data to a data blood-lineage exchange space by a message queue, and further constructing a cross-system big data blood-lineage relation tree; the specific steps for constructing the cross-system big data blood-relation tree are as follows: each application system applies public and private keys in the blood-edge exchange space, the private keys are held by the system, and the public keys are reserved by the blood-edge exchange space and are used for data decryption; each application system encrypts the data blood-edge pedigree graph through the private key, and synchronizes the encrypted data to the blood-edge exchange space through the message queue in real time; the blood margin exchange space adopts the public key of the corresponding system to decrypt the blood margin pedigree data of the single system, then carries out real-time calculation according to the blood margin pedigree data of each current and latest system, opens and perfects the blood margin relation of the data among the systems, and further draws a big data blood margin relation tree.

2. The method of claim 1, wherein the dependency types of the data lineage graph include simple dependencies, expressions, and scripts, wherein simple dependencies, output columns have the same values as input columns, expressions, output columns are converted at runtime by expressions on input columns, scripts, output columns are converted by scripts provided by users.

3. The method for capturing data relationship and constructing big data relationship tree according to claim 2, wherein the data blood-edge dependency graph is persisted through a graphic engine to generate an index, the index is stored in a search engine, and the search engine performs deep mining on the data blood-edge relationship to generate potential links between data.

4. The method for capturing data relationship and constructing big data relationship tree according to claim 1, wherein after the data relationship tree of the application level is updated by the relationship exchange space, the relationship exchange space is encrypted by private key, and then the relationship exchange space is synchronized to the system of each application system in the ecology with the relationship exchange space public key through the message queue, and the relationship exchange space is decrypted to obtain the big data relationship tree in the whole ecology, and further the big data relationship tree of the whole ecology with all application systems in the whole ecology with the whole relationship tree of the whole system is obtained.