CN112800149A

CN112800149A - Data blood margin analysis-based data management method and system

Info

Publication number: CN112800149A
Application number: CN202110187130.0A
Authority: CN
Inventors: 王泽宇; 宋海涛; 尹曦萌; 于春蕾; 张正奇
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2021-02-18
Filing date: 2021-02-18
Publication date: 2021-05-14
Anticipated expiration: 2041-02-18
Also published as: CN112800149B

Abstract

The invention discloses a data management method and a data management system based on data blood margin analysis, belongs to the technical field of data processing, and aims to solve the technical problems of how to overcome the difficulties of data source tracing, verification and correlation analysis in the data management process, and the adopted technical scheme is as follows: the method comprises the steps of constructing a data family relation network map by analyzing data blood relationship, and carrying out mutual evidence expansion on data of all nodes in the network map, so that data management personnel are helped to complete source tracing, verification, supplement and standardization of the data, and the data management efficiency is improved; the method comprises the following specific steps: scheduling and storing the big data; performing blood-related analysis on the data to form a data family map; and constructing a data map through an algorithm model. The system comprises a big data scheduling and storing module, a data blood margin analyzing module and an algorithm model module.

Description

Data blood margin analysis-based data management method and system

Technical Field

The invention relates to the technical field of data processing, in particular to a data management method and a data management system based on data blood relationship analysis.

Background

In the big data era, data is explosively increased, and massive and various types of data are rapidly generated. The huge and complicated data information is called again, converted and transformed, circulated and circulated, and new data is generated and converged into a data ocean.

Human relationship refers to the interpersonal relationship resulting from marriage or birth, such as parent-child relationship, sibling relationship, and other relatives derived therefrom. In the processes of data generation, processing, circulation and extinction, a relationship is naturally formed among the data, and the relationship between the data is expressed by referring to a similar relationship in human society, which is called as the blood-related relationship of the data.

The data bloodline in turn has the following characteristics:

attributing: the data is owned by a specific organization or person, and the organization or person owning the data has the use right of the data;

② multi-source: the same data can have a plurality of sources (namely a plurality of parents), and the data is generated by processing a plurality of data or by a plurality of processing modes or processing steps;

traceability: the blood relationship of the data represents the full life cycle of the data, and the whole process from data generation to abandonment can be traced back;

fourthly, layering: the blood-related relationship of the data is hierarchical; the description information of the data such as classification, induction and summarization of the data forms new data, and the description information of different degrees forms the hierarchy of the data.

In the disordered data, how to use the characteristics of 4 blood sources above the blood source of the data to straighten the blood source relationship of the data is a difficult problem to help data management personnel to better complete data management work such as data source tracing, verification, supplement, standardization and the like.

To visually describe the data lineage definition, as an example in life, for example, in a shopping website, after a customer purchases an item, the data is stored in a background database table a. When it is necessary to count which articles are sold in a month, the original data in the database needs to be processed and summarized to form an intermediate table B for storing the data processed in the stage, and if the logic is complicated, the processing is continued to form the intermediate table. . . Until finally processing into the final table for foreground presentation, say C table. Then the A table is the original source of the C table data and is the ancestor of the C table data. From the A table data to the B table data to the C table data, the link is the data bloodline of the C table.

In the data processing process, each link from the data source to the final data generation may cause a data quality problem, for example, the data quality of the data source itself is not high, if the data quality is not detected and processed in the subsequent processing link, the data information is finally transferred to the target table, and the data quality is not high, and there is also a possibility that some improper processing is performed on the data in the data processing of a certain link, so that the data quality of the subsequent link becomes poor.

Therefore, in the data management process, how to overcome the difficulties of data source tracing, verification and correlation analysis is a problem to be solved urgently at present.

Disclosure of Invention

The technical task of the invention is to provide a data management method and a data management system based on data blood relationship analysis, so as to solve the problems of difficult source tracing, difficult verification and difficult correlation analysis of data in the data management process.

The technical task of the invention is realized in the following way, the method for data management based on data blood relationship analysis comprises the steps of constructing a data family relation network map by analyzing the data blood relationship, and carrying out mutual evidence expansion on data of each node in the network map, thereby helping data management personnel to complete the tracing, verification, supplement and normalization of data and improving the data management efficiency; the method comprises the following specific steps:

scheduling and storing the big data;

performing blood-related analysis on the data to form a data family map;

and constructing a data map through an algorithm model.

Preferably, the scheduling and storing of the big data are as follows:

dispatching the relevant data resources into a database of HBASE through a data dispatcher of NIFI;

in the dispatching process, the field names are subjected to standardized processing, and the data of the focus fields are cleaned, so that the blood relationship analysis is facilitated.

The data family map is formed by preferably performing blood-related analysis on the data as follows:

finding the most basic data resource through the data characteristics to serve as an information main node, and finding a data outflow node of the information main node to serve as a sub-node; when a parent node and a child node are searched, important field information is marked in the information main node so as to carry out mutual evidence analysis on data of each node;

finding the data inflow node and the data outflow node of the child node, and forming a family data grid after mutual association;

identifying a basic node, an inflow node and an outflow node by using circles, identifying data inflow and data outflow by using line segments with arrows, and starting to analyze the blood margin of data by using the basic node as a main node;

in the data blood margin analysis process, table names and table key fields are marked in a circle in a key mode, association fields among tables are marked clearly on a connecting line, and all data inflow nodes and data outflow nodes are connected in sequence to form a data family map.

More preferably, the data family map comprises the following elements:

host nodes: only one main node is positioned in the middle of the whole map and is a core node of the visual graph; the blood relationship displayed by the map is the blood relationship of the main node, and other blood relationship irrelevant to the node is not displayed on the graph so as to ensure the simplicity and the clarity of the graph;

data outflow node: the data flow-out node is a father node of the main node and represents a data source;

thirdly, data flows into the node: the data flow-in node is a child node of the main node and represents the destination of the data; the data inflow node also comprises a special node, namely a terminal node, wherein the terminal node is a special data outflow node and indicates that data is not circulated downwards;

data transfer circuit: the data flow path is a data flow path which flows from left to right; the data circulation line converges from the data inflow node to the host node and diffuses from the host node to the data outflow node.

More preferably, the blood margin analysis method is as follows:

firstly, a static analysis method: based on the compiling principle, objective reflection of data circulation is realized by scanning and syntax analysis of a source code and static analysis and listing of paths related to program logic;

② contact infection type analysis method: screening program commands related to data transmission and mapping to obtain key information for deep analysis;

third, the logic time sequence analysis method: in order to avoid the interference of redundant information, an indirect process of transmission and mapping which has no direct relation with data fields of the database, the file and the communication interface and intermediate variables of the program are converted into direct transmission and mapping among the data fields of the database, the file and the communication interface according to a program processing flow.

Preferably, the data family map is constructed by an algorithm model as follows:

abstracting a data table into an object, abstracting fields in the data table into object attributes, abstracting a data table and table relation into an object relation, establishing a uniform body data model by taking the object and attribute set relation as an element, and establishing mapping from a physical table to the body data model;

and analyzing the relation of the data family data table through an algorithm model to form a data map and extracting value data information.

A data governance system based on data blood margin analysis, the system comprising,

the big data scheduling and storing module is used for scheduling and storing the analysis data;

the data blood relationship analysis module is used for analyzing the data relationship to generate a data family map;

and the algorithm model module is used for automatically analyzing the data association relationship to form a data map through the key field indexes of all the nodes, managing the data quality, analyzing the data relationship and extracting the data value.

Preferably, the big data scheduling storage module comprises,

the warehousing submodule is used for scheduling and warehousing the data;

the standardization submodule is used for carrying out standardization processing on the data fields in the scheduling process;

and the cleaning submodule is used for cleaning the key fields.

Preferably, the data blood margin analysis module comprises,

the query submodule I is used for querying the main data node;

the query submodule II is used for querying the data inflow node;

the query submodule III is used for querying the data outflow node;

a construction submodule for constructing a data family map;

and the identification submodule is used for identifying the important data field of the node.

A computer readable storage medium having stored thereon computer executable instructions, which when executed by a processor, implement a data governance method based on data blood margin analysis as described above.

The data governing method and the system based on the data blood relationship analysis have the following advantages that:

the method solves the problems of difficult data source tracing, difficult verification, difficult specification, difficult analysis and the like in the data management process under the background of the existing big data, and can realize the data source tracing and quality verification and provide help for the data analysis by performing blood-related analysis on the scale-related data to form a family network graph and comparing and analyzing key data items of associated nodes;

the data management efficiency is improved mainly through data blood margin analysis, a data blood margin relation map is established for certain data, and the data in the map is analyzed and verified mutually according to attributes, multi-source property, traceability and hierarchical characteristics of the data blood margin, so that the data can be traced effectively, the data quality is verified, the incidence relation among the data is analyzed, the data management efficiency is improved finally, and the later-stage data analysis and utilization are facilitated;

the invention analyzes the whole process of data generation, circulation and extinction, finds the data blood relationship, delineates the whole data family relationship network graph, and carries out mutual evidence expansion on the data of each node in the network graph, thereby helping data management personnel or data management algorithm to trace source, verify, supplement and standardize the data and improving the data management efficiency.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a flow chart diagram of a data governance method based on data blood margin analysis;

FIG. 2 is a block diagram of a data governance system based on data blood margin analysis.

Detailed Description

The data governance method and system based on data blood relationship analysis according to the present invention will be described in detail with reference to the drawings and specific embodiments.

Example 1:

as shown in the attached drawing 1, the data governance method based on data blood relationship analysis of the invention constructs a data family relationship network map by analyzing the data blood relationship, and carries out mutual evidence expansion on each node data in the network map, thereby helping data governance personnel to complete the tracing, verification, supplement and standardization of data and improving the data governance efficiency; the method comprises the following specific steps:

s1, scheduling and storing the big data;

s2, performing blood relationship analysis on the data to form a data family map;

and S3, constructing a data map through the algorithm model.

The scheduling and storing of the big data of S1 in this embodiment are specifically as follows:

s101, relevant data resources are scheduled into an HBASE database through a NIFI data scheduler;

s102, in the dispatching process, the field names are subjected to standardization processing, and the highlight fields are subjected to data cleaning, so that blood relationship analysis is facilitated.

In this example, the data subjected to the blood-related analysis by S2 form a data family map as follows:

s201, finding the most basic data resource through data characteristics to serve as an information main node, and finding a data outflow node of the information main node to serve as a sub-node; when a parent node and a child node are searched, important field information is marked in the information main node so as to carry out mutual evidence analysis on data of each node;

s202, finding a data inflow node and a data outflow node of a child node, and forming a family data grid after mutual correlation;

s203, identifying a basic node, an inflow node and an outflow node by circles, identifying data inflow and data outflow by line segments with arrows, and starting to perform data blood margin analysis by taking the basic node as a main node;

s204, in the data blood relationship analysis process, table names and table key fields are marked in the circle in a key mode, association fields among the tables are marked clearly on a connecting line, and the data inflow nodes and the data outflow nodes are connected in sequence to form a data family map.

The data family map in this example includes the following elements:

The blood margin analysis method in this example is as follows:

In this embodiment, the specific steps of constructing the data family map through the algorithm model in step S3 are as follows:

s301, abstracting a data table into objects, abstracting fields in the data table into object attributes, abstracting a data table and table relationship into object relationships, establishing a uniform body data model by taking the object and attribute set relationships as elements, and establishing mapping from a physical table to the body data model;

and S302, analyzing the relation of the data family data table through an algorithm model to form a data map, and extracting value data information. The following is an example of an algorithm for obtaining the bloody border relationship between tables by parsing the SQL syntax tree based on the DRUID:

the blood margin logical relationship between tables is analyzed through various tools such as the example's DRUID or spark's logcplan's and the like, the table and table association relationship is extracted through an algorithm, and a data model is established to form a data family map, help data analysts to trace data sources, control data quality and the like, and play an auxiliary role in data relationship analysis.

Example 2:

as shown in fig. 2, the data governance system based on data blood relationship analysis of the present invention includes a big data scheduling storage module, for scheduling and storing the analysis data;

The big data scheduling storage module in this embodiment includes,

the warehousing submodule is used for scheduling and warehousing the data;

and the cleaning submodule is used for cleaning the key fields.

The data blood margin analysis module in this embodiment includes,

the query submodule I is used for querying the main data node;

the query submodule II is used for querying the data inflow node;

the query submodule III is used for querying the data outflow node;

a construction submodule for constructing a data family map;

Example 3:

the embodiment of the invention also provides a computer-readable storage medium, wherein a plurality of instructions are stored, and the instructions are loaded by a processor, so that the processor executes the data governance method based on the data blood margin analysis in any embodiment of the invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a flexible disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-R data management method and system M, DVD-RW, DVD + RW based on data edge analysis), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A data governance method based on data blood relationship analysis is characterized in that the method is that a data family relation network map is constructed by analyzing the data blood relationship, and the data of each node in the network map is mutually verified and expanded, so that data governance personnel are helped to complete data tracing, verification, supplement and standardization, and the data governance efficiency is improved; the method comprises the following specific steps:

scheduling and storing the big data;

performing blood-related analysis on the data to form a data family map;

and constructing a data map through an algorithm model.

2. The data governance method based on data blood margin analysis according to claim 1, wherein the scheduling and storing of big data is as follows:

3. The data governance method based on data blood-related analysis according to claim 1, wherein performing blood-related analysis on the data to form a data family map is as follows:

finding the most basic data resource through the data characteristics to serve as an information main node, and finding a data outflow node of the information main node to serve as a sub-node; when finding a parent node and a child node, identifying important field information in the information main node;

4. The data governance method based on data-based blood-margin analysis of claim 3, wherein the data family map comprises the following elements:

host nodes: only one main node is positioned in the middle of the whole map and is a core node of the visual graph; the blood relationship displayed by the map is the blood relationship of the main node;

thirdly, data flows into the node: the data flow-in node is a child node of the main node and represents the destination of the data; the data inflow node also comprises a terminal node, and the terminal node indicates that the data is not circulated downwards any more;

5. The data governance method based on data blood margin analysis according to claim 3 or 4, wherein the blood margin analysis method is as follows:

third, the logic time sequence analysis method: according to the program processing flow, the indirect process and the program intermediate variable which are transmitted and mapped without direct relation with the data fields of the database, the file and the communication interface are converted into direct transmission and mapping among the data fields of the database, the file and the communication interface.

6. The data governance method based on data blood margin analysis according to claim 1, wherein the data family map is constructed by an algorithm model as follows:

7. A data governance system based on data blood margin analysis is characterized by comprising,

8. The data governance system based on data blooding margin analysis according to claim 7, wherein the big data schedule storage module comprises,

the warehousing submodule is used for scheduling and warehousing the data;

and the cleaning submodule is used for cleaning the key fields.

9. The data governance system based on data margin analysis according to claim 7 or 8, wherein the data margin analysis module comprises,

the query submodule I is used for querying the main data node;

the query submodule II is used for querying the data inflow node;

the query submodule III is used for querying the data outflow node;

a construction submodule for constructing a data family map;

10. A computer-readable storage medium having stored thereon computer-executable instructions that, when executed by a processor, perform a data governance method based on data blood margin analysis as recited in claims 1 to 6.