CN116010428A - Data blood margin analysis method and device - Google Patents

Data blood margin analysis method and device Download PDF

Info

Publication number
CN116010428A
CN116010428A CN202310163717.7A CN202310163717A CN116010428A CN 116010428 A CN116010428 A CN 116010428A CN 202310163717 A CN202310163717 A CN 202310163717A CN 116010428 A CN116010428 A CN 116010428A
Authority
CN
China
Prior art keywords
data
blood
edge
hive
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310163717.7A
Other languages
Chinese (zh)
Inventor
王厚平
王乐珩
张金银
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Bizhi Technology Co ltd
Original Assignee
Hangzhou Bizhi Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Bizhi Technology Co ltd filed Critical Hangzhou Bizhi Technology Co ltd
Priority to CN202310163717.7A priority Critical patent/CN116010428A/en
Publication of CN116010428A publication Critical patent/CN116010428A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a data blood-edge analysis method and a device, the method obtains blood-edge data through Hook mechanism analysis of a big data component, the blood-edge data is sent to a storage system through a message middleware, a data model is constructed in the storage system for storing the blood-edge data, the method comprises the following steps: step 10, analyzing and collecting blood edge data integrated by data based on the embedded code plug-in, and sending the blood edge data to a message middleware; step 20, collecting blood edge data of the Hive data warehouse based on Hive Hook analysis, and sending the blood edge data to a message middleware; step 30, analyzing and collecting blood edge data of the API data service based on the embedded code plug-in, and sending the blood edge data to the message middleware; step 40, the message middleware receives the appointed Topic message and caches the message; and 50, subscribing the message in the message middleware according to the type model, and providing analysis inquiry after storing. The invention can accurately collect and analyze the data blood edges of the big data platform in real time in a full-link way.

Description

Data blood margin analysis method and device
Technical Field
The invention relates to the fields of computers, network communication technologies and big data processing, in particular to a data blood-edge analysis method and a data blood-edge analysis device.
Background
With the maturity and perfection of big data technology represented by Hadoop, big data platforms are gradually used by a large number of enterprises, individuals and scientific research institutions. It provides the capability to process massive data based on inexpensive microcomputers, with a large number of DAG distributed tasks to co-process the data, however, the large number of tasks and complex SQL can make it difficult to observe the data blood edges, and it is difficult to know which tables are affected by modifying the data of a table which is processed by which tables. The prior technical proposal is basically that: manually maintaining an Excel record processing link in big data development; or the information of the related table is obtained by carrying out regular expression and character string segmentation on SQL, and the blood margin obtained by the method is not quite accurate and can not be obtained in real time.
However, this type of approach has drawbacks: the former approach relies on a large amount of manpower, is very costly, and can be increasingly difficult to maintain over time. In the latter way, because SQL can be constructed very complicated, it is difficult to guarantee correctness for complex SQL, thus the obtained blood margin is also very inaccurate, on the other hand, because the executing process is difficult to perceive by means of regular expression and character string segmentation, it is difficult to obtain blood margin information in time.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention aims to provide a data blood-edge analysis method and device.
In order to achieve the above object, the present invention provides a data blood-edge analysis method, which obtains blood-edge data through Hook mechanism analysis of a big data component, and sends the blood-edge data to a storage system through a message middleware, wherein a data model is constructed in the storage system for storing the blood-edge data, and the method comprises the following steps:
step 10, analyzing and collecting blood edge data integrated by data based on the embedded code plug-in, and sending the blood edge data to a message middleware;
step 20, collecting blood edge data of the Hive data warehouse based on Hive Hook analysis, and sending the blood edge data to a message middleware;
step 30, analyzing and collecting blood edge data of the API data service based on the embedded code plug-in, and sending the blood edge data to the message middleware;
step 40, the message middleware receives the designated Topic (named as hive_hook_topic) message and caches the message to wait for the consumer to consume;
step 50, subscribing to the messages in the message middleware according to the type model and storing to provide an analytical query.
Further, in step 10, the Hook in the data integration system may analyze the blood edge of the data integration system and push the blood edge data to the storage, and the process of data integration in step 10 is as follows:
step 101, extracting data according to a data reading module;
step 102, packaging job information into task node information;
step 103, after the data construction is completed, the data is sent to a storage system through a message middleware;
step 104, writing data according to the output writing module.
Further, the Hive data warehouse hook process in step 20 specifically includes the following steps:
step 201, setting Hive variables for transferring platform variables related to Hive tasks;
step 202, the HQL is parsed into an AST abstract syntax tree;
step 203, reading Hive variables, and transmitting platform variables to Hook plugins when a task starts;
step 204, reading the LineageInfo information, including inputting a table set
Figure SMS_1
And output table set +.>
Figure SMS_2
Step (a)205,
Figure SMS_3
A set of input tables can be obtained, and names of the input tables are obtained by traversing ReadEntity;
in step 206 the process continues with the step of,
Figure SMS_4
the set of output tables can be obtained, and the names of the output tables can be obtained by traversing WriteEntity;
step 207, through the above data, constructing a unique attribute name of the table, and constructing a formula as follows:
QualifiedName = library name table name @ data source ID.
Further, the API data service is a means of accessing processed data using SQL, step 30 comprising the sub-steps of:
step 301, performing preliminary analysis on SQL through Antrl 4;
step 302, obtaining an AST grammar tree through analysis;
step 303, traversing the multi-path AST grammar tree through a Java code by a depth-first algorithm to obtain the information of the expected input table and the expected output table, and storing the information in the lineageInfo;
step 304, the input table information and the output table information construction notification message are stored in a storage system.
Further, step 302 includes the sub-steps of:
step 3021, obtaining a root node first;
step 3022, obtaining a list of all child nodes of the node, and traversing from the leftmost child node in turn;
step 3023, judging whether the node belongs to the selected elements type, if so, traversing all child nodes of the node until the ID attribute is found. For example: id=t4, which is an output alias table;
and 3024, judging that the node is of the from source type, if yes, traversing all child nodes of the node, respectively finding root nodes of a left brother node and a right brother node of the AS type, and finding a final child node of the node, wherein the left brother final child node is an input table, and the right brother final child node is an output alias table.
Further, in step 50, the data lineage store may store data in a distributed storage system that employs Hbase and elastic search based on different types of data.
Further, step 50 includes the sub-steps of:
step 401, receiving blood edge data analyzed in data acquisition, data warehouse and data service;
step 402, after analyzing according to different types of data, packaging the data into different data objects, and storing the data objects to Hbase and elastic search respectively, so as to facilitate query and retrieval.
In another aspect, the invention provides a data blood-lineage analysis device for implementing a method according to the invention.
Further, the device comprises a data integration system, a Hive data warehouse system, an API data service system, a message middleware and a data blood edge storage system, wherein the data integration system, the Hive data warehouse system and the API data service system are used for realizing the process of obtaining data blood edges and the process of analyzing SQL (structured query language) to data services, and the message middleware and the data blood edge storage system realize the process of storing the data blood edges to the storage service through the message middleware.
Further, the data integration system comprises a data reading module, a data output module, a rate control module and a resource control module. The blood margin mainly analyzes the configuration of the data reading module and the data output module.
The data blood-edge analysis method and the data blood-edge analysis device can be conveniently integrated into the assembly through the Hook mechanism of the engine to obtain the data blood-edge; the acquisition process of the data blood edges is accurate and real-time due to the deep inside of the assembly. Decoupling is performed through the message middleware, so that no performance influence is caused to the whole process; distributed storage may carry vast amounts of data. And analyzing SQL to data blood edges through a Hook process, decoupling through a message middleware, and storing the SQL to a storage system according to types. The method can be used for acquiring and analyzing the data blood edges of the big data platform in a full-link, accurate and real-time manner.
Drawings
FIG. 1 illustrates an overall architecture diagram of a data lineage analysis method and apparatus according to an embodiment of the present invention;
FIG. 2 shows a data lineage acquisition process diagram for the data integration of FIG. 1;
FIG. 3 shows a data blood edge acquisition process diagram of the Hive data warehouse of FIG. 1;
FIG. 4 shows a data lineage acquisition process diagram for the data service of FIG. 1;
FIG. 5 shows a data structure diagram of how Datax data integration, hive data warehouse, data service full link blood edge is constructed;
FIG. 6 shows an example of a data structure of complex blood clots with a large number of tasks performed multiple times.
Detailed Description
The following description of the embodiments of the present invention will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
Specific embodiments of the present invention are described in detail below with reference to fig. 1-6. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.
As shown in FIG. 1, according to the data blood-edge analysis method and device provided by the embodiment of the invention, blood-edge data is obtained through Hook mechanism analysis of a big data component, and is sent to a storage system through a message middleware, and a data model is constructed in the storage system for storing the blood-edge data. The user may be obtained by accessing an interface in the storage system. Hook is a system mechanism provided in Windows to replace "interrupt" in DOS, translated in chinese as "Hook" or "Hook". After a hook is made for a particular system event, once the hook event occurs, the program that has made the hook for that event will receive a notification from the system, at which point the program can respond to the event at the first time.
As shown in fig. 1, the basic flow of the data blood edge analysis method according to the embodiment of the invention is as follows:
and step 10, acquiring blood edge data of the data integration based on Hook analysis, and sending the blood edge data to the message middleware. Data integration Datax is an open source, widely used data integration component on the market, and the invention will extract key blood-edge data based on embedded code plug-ins and send it to message middleware designated Topic (naming: datax hook_topic). The purpose is to collect blood-edge data structures in a data integration task. In a specific embodiment, for Mysql, the integration task of Hive is synchronized, and the task is integrated from the T1, T2 table of Mysql to the T3 table of Hive, so that the blood relationship of the integration task P1 (Name) (T1- > T3) can be finally obtained. As shown in fig. 2, the data integration acquisition process in step 10 is as follows:
in step 101, the datax starts a data reading module according to JSON configuration, and is used for extracting data. The configuration JSON is exemplified as follows:
{
"core": {
"transport": {
"channel": {
"speed": {
"byte": "1048576"
}
}
}
},
"job": {
"content": [
{
"reader": {
"parameter": {
"database": "simba_test",
"partition": "name=${bdp.system.bizdate}",
"simba": true,
"nullFormat": "\\N",
"haveKerberos": false,
"column": [
{
"index": 0,
"type": "STRING"
},
{
"index": 1,
"type": "STRING"
}
],
"encoding": "UTF-8",
"tableName": "T1"
},
"name": "hivereader",
"id": "100266"
},
"writer": {
"parameter": {
"database": "simba_test",
"postSql": [],
"haveKerberos": false,
"column": [
"`oneid`",
"`is_buy_7d`"
],
"connection": [
{
"table": [
"T3"
]
}
],
"writeMode": "replace",
"batchSize": "1024",
"preSql": []
},
"name": "mysqlwriter",
"id": "100267"
}
}
],
"setting": {
"errorLimit": {
"record": 0
},
"speed": {
"channel": 1
}
}
}
}
in the execution process, analyzing JSON configuration of the Datax, and obtaining a Mysql input table T1 by analyzing a job.content.reader.parameter.tableName structure of the JSON; the Hive output table T3 is obtained by analyzing the job.center.writer.parameter.connection.table structure of JSON; obtaining a library simba_test of Mysql by analyzing the job.content.reader.parameter.database structure of the JOSN; obtaining a library sim_test of Hive by analyzing a job.center.writer.parameter.connection.database structure of JSON; obtaining an input data source ID by parsing the job.content.reader.id of the JSON; obtaining an output data source ID by parsing the job.center.writer.id of JSON; the above parsing is completed, as shown in step 102 of fig. 2. Based on the above data, a unique attribute name for the table can be constructed as follows:
qualifiedName = library name table name @ data source ID
Thus, T1 (QualifiedName: sim_test.t1@100266, attributes) - > J1 (QualifiedName: J1, attributes) - > T3 (QualifiedName: sim_test.t1@100267, attributes) can be obtained, and Attributes are other collected Attributes, and the invention is abbreviated for highlighting the focus.
After the data construction is completed, a Topic (datax_hook_topic) is sent through Kafka message middleware, as shown in step 103 of fig. 2. The message middleware is used to transfer data instead of directly calling the database, because the process of 101-104 is linearly executed, and direct calling can cause the blocking of the process, and the message can be asynchronously consumed through the message middleware, so that the throughput of the service is improved. Finally, as shown in step 104 of FIG. 2, the input data is written to the specified output destination. And step 20, acquiring blood edge data of the Hive data warehouse based on the Hook, and sending the blood edge data to the message middleware. Hive is a data warehouse tool based on Hadoop for data extraction, transformation, and loading, which is a mechanism that can store, query, and analyze large-scale data stored in Hadoop. Hive collects blood-edge data of Hive data warehouse based on Hook analysis and sends to message middleware to assign Topic (naming: hive_hook_topic);
as shown in FIG. 3, the Hive data warehouse Hook process in step 20, specifically, the execution process of the Hive data warehouse includes the following steps:
step 201, setting Hive variables for delivering additional configuration related to Hive tasks, wherein the configuration is mainly used for describing special attributes given by a platform, and comprises the following configurations:
simba. Meta. Jobid: task ID identifying platform
simba. Meta. Taskid: job ID identifying platform
simba. Meta. Tenantid: tenant ID identifying a platform
simba. Meta. Datasourceid: data Source ID identifying platform
sim ba. Meta. ProjectId: item ID identifying platform, item ID and database in one-to-one relationship
sibba. Meta. Task dagtype: identifying DAG types for a platform
In step 202, the process is a Hive engine native process, the HQL is parsed into an AST abstract syntax tree, and the engine executes SQL.
And 203, reading the Hive variable, and transmitting the custom variable at the beginning of the task to the Hook plug-in. The reason for setting and reading at first is that the variable when the Hive task is submitted can only be transferred through the set grammar in SQL, so that the corresponding variable needs to be obtained through the HiveConf class in the execution process. The most core needs to obtain the simba. Meta. Datasourceid variable, which can be assumed to be 100267; obtaining the value of simba.meta.project id and obtaining the corresponding hive library name from the value, it can be assumed that the value is simba_test; the value of simba.meta.taskid is obtained, which is used to represent the unique ID of the task, which can be assumed to be J2.
Step 204, reading lineinfo information, which fields exist in the information and are processed into which fields by which process, for example, the purchase quantity (num) is converted into the purchase total (sum) by a sum function;
in step 205,
Figure SMS_5
a set of input tables can be obtained, the names of the input tables can be obtained by traversing ReadEntity, and the value can be assumed to be T3;
in step 206 the process continues with the step of,
Figure SMS_6
a set of output tables can be obtained, the names of the output tables can be obtained by traversing the WriteEntity, and the value can be assumed to be T4;
step 207, through the above data, constructing a unique attribute name of the table, and constructing a formula as follows:
qualifiedName = library name table name @ data source ID
Constructing the notification message from the data obtained in steps 203-206, a data structure T3 (QualifiedName: sim_test.t3@100267, attributes) - > J1 (QualifiedName: J2, attributes) - > T4 (QualifiedName: sim_test.t4@100267, attributes), attributes being other Attributes of the table, field collected in the process, for example: names, types, etc., are also abbreviated for highlighting emphasis. Sent to the designated Topic (hive hook Topic) by Kafka. It can be found that the output table quaiiedname during the Datax task is equal to the input table quaiiedname during the Hive task, where a relationship is established from data collection to data warehouse.
Step 30, collecting blood edge data of the API data service based on Hook analysis and sending the blood edge data to a message middleware designated Topic (named: api_hook_topic);
in step 30, the API data service is a service used after the data is processed in the data platform, and generally uses SQL as an execution configuration of the data service. As shown in fig. 4, step 30 includes the sub-steps of:
step 301, performing preliminary analysis on SQL through Antrl 4;
step 302, obtaining an AST syntax tree through analysis:
traversing the entire AST syntax tree by breadth-first traversal algorithm, step 302 comprises the sub-steps of:
step 3021, obtaining a root node first;
step 3022, obtaining a list of all child nodes of the node, and traversing sequentially from the leftmost child;
step 3023, judging whether the node belongs to the selected elements type, if yes, traversing all child nodes of the node until an ID attribute is found, id=t5, and the value is an output alias table;
step 3024, judging that the node is of the from source type, if yes, traversing all child nodes of the node, finding root nodes of left brothers and right brothers of the AS type, finding final child nodes of the node, wherein the left brothers final child nodes are input tables (T4), and the right brothers final child nodes are output nickname tables (T5)
Step 303, traversing the multi-path AST grammar tree through a Java code by a depth-first algorithm to obtain expected input table and output table information, and storing the expected input table and output table information in the lineageInfo; meanwhile, the project belonged to the API and the data source ID of the operation can be obtained through the API_ID, the API_ID is J3, the database corresponding to the project ID is sibba_test, and the data source ID is 100267.
Step 304, the data analyzed by 301-303 is brought into the previous qualifiedName composition formula, so that T4 (qualifiedName: simba_test.T4@100267, attributes) - > J1 (qualifiedName: J3, attributes) - > T5 (qualifiedName: simba_test.T5@100267, attributes) can be obtained, and the data structure is sent to a message middleware designated Topic (api_hook_topic). We can find that the qualifiedName of the output table during Hive tasks is equal to the qualifiedName of the input table during API where the blood-edge relationship from the data warehouse to the API is established.
In step 40, the message middleware uses Kafka, and the system is mainly used for receiving the message, so that data can be consumed asynchronously without affecting the performance of the original system, and belongs to the optimization improvement of blood edge acquisition by the component.
In step 50, the above relationship data are saved by respectively consuming different types of Topic data (datax_hook_topic, hive_hook_topic, api_hook_topic), so as to obtain the blood-edge link diagram shown in fig. 5.
Fig. 5 is a diagram of a unidirectional blood-edge link formed under the condition of each of a Datax task, a Hive task and a data service task.
P1 represents the blood source of the Datax task, namely the T1- > J1- > T3 process, and expresses the process that the T1 table flows to the T3 table through the J1 task. Where T1- > J1- > T3 is a Datax task that acts to extract data in Mysql, named T1, into the table in Hive, named T3. T1 is the name of the source table, T3 is the name of the target table, and J1 is the name of the Datax task;
j1 For the Datax task example, the JSON expression may be as follows:
{
"job": {
"content": [{
"reader": {
"parameter": {
"database": "default",
"column": [{
"index": 0,
"type": "STRING"
},
{
"index": 1,
"type": "STRING"
}
],
"tableName": "T1"
},
"name": "mysqlreader",
"id": "100001"
},
"writer": {
"parameter": {
"database": "default",
"column": [
"`id`",
"`name`"
],
"connection": [{
"table": [
"T2"
]
}],
"writeMode": "replace",
},
"name": "hivewriter",
"id": "100002"
}
}]
}
}
p2 represents the Hive task blood edge, i.e., the T3- > J2- > T4 process, and represents the process by which the T3 table flows through the J2 task to the T4 table. T3- > J2- > T4 is a Hive task, and functions to write data named as a T3 table in Hive into a table named as T4 in summary of Hive after processing. T3 is the source table name, T4 is the target table name, and J2 is the Hive task name.
J2 For Hive tasks, they can be expressed in SQL statements, examples of which are as follows:
Insert into table T4 from select from T3
the meaning of this SQL statement is to query all column data of T3 and then insert the direct increment into the T4 table.
P3 represents the blood-edge of the data service task, namely the process of T4- > J3- > T5, and expresses the process that the T4 table flows to the T5 table through the J3 task. T4- > J3- > T5 is a data service task which functions to expose data named as a T4 table in Hive through a data service to an externally accessible interface and provide external personnel with access to the data. Where T4 is the source table name, T5 is the table alias, and J3 is the data service task.
J3 The operation mode of the data service task can be expressed by SQL sentences, and the expression examples are as follows:
Select T5. From T4 T5
the SQL statement has the meaning of querying all columns of data of T4 and forms a temporary table with an individual name of T5, and external query is provided through the temporary table.
FIG. 6 is a schematic diagram showing the formation of a multi-way blood-edge link based on the one-way blood-edge link of FIG. 5, after a large number of different Datax tasks, hive tasks, and data service tasks are performed. On the basis of fig. 5, T2, T6, T7, T8, T9, T10, T11, T12 and T13 are table names, and J4, J5, J6, J7 and J8 are Hive task names.
In addition, the embodiment of the invention also provides a data blood-edge analysis device, which comprises a data integration system, a Hive data warehouse system, an API data service system, a message middleware and a data blood-edge storage system, wherein the data integration system, the Hive data warehouse system and the API data service system are used for realizing the process of obtaining data blood edges and the process of analyzing SQL (structured query language) to obtain the data blood edges for the data service, and the message middleware and the data blood-edge storage system realize the process of storing the data blood edges into a storage service through the message middleware.
The data integration system comprises a data reading module, a data output module, a rate control module and a resource control module. The blood margin mainly analyzes the configuration of the data reading module and the data output module.
The Datax data acquisition task, the Hive data processing task and the API data use task comprise all the capabilities of a large data platform, and in the execution process of a large number of tasks and tasks, a complex blood-margin map shown in figure 6 can be obtained through a blood-margin analysis device, wherein T1-T13 is used for expressing different service tables, and J1-J8 are used for expressing different tasks. An algorithmic analysis may be performed based on this map, e.g., a T1 table will ultimately affect which output tables; and querying the data from any table in the full link.
And analyzing SQL to data blood edges through a Hook process, decoupling through a message middleware, and storing the SQL to a storage system according to types.
In the description herein, reference to the term "embodiment," "example," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the different embodiments or examples described in this specification and the features therein may be combined or combined by those skilled in the art without creating contradictions.
While embodiments of the present invention have been shown and described, it will be understood that the embodiments are illustrative and not to be construed as limiting the invention, and that various changes, modifications, substitutions and alterations may be made by those skilled in the art without departing from the scope of the invention.

Claims (10)

1. A method for analyzing blood-edge of data, which is characterized in that the method obtains blood-edge data through Hook mechanism analysis of a big data component, sends the blood-edge data to a storage system through message middleware, and constructs a data model in the storage system for storing the blood-edge data, and the method comprises the following steps:
step 10, analyzing and collecting blood edge data integrated by data based on the embedded code plug-in, and sending the blood edge data to a message middleware;
step 20, collecting blood edge data of the Hive data warehouse based on Hive Hook analysis, and sending the blood edge data to a message middleware;
step 30, analyzing and collecting blood edge data of the API data service based on the embedded code plug-in, and sending the blood edge data to the message middleware;
step 40, the message middleware receives the appointed Topic message and caches the message;
and 50, subscribing the message in the message middleware according to the type model, and providing analysis inquiry after storing.
2. The method for analyzing blood edges of data according to claim 1, wherein in step 10, the embedded code plug-in the data integration can analyze the blood edges of the data integration and push and store the blood edge data, and in step 10, the process of data integration is as follows:
step 101, extracting data according to a data reading module;
step 102, packaging job information into task node information;
step 103, after the data construction is completed, the data is sent to a storage system through a message middleware;
step 104, writing the target source according to the output writing configuration.
3. The method of claim 2, wherein the Hive data warehouse hook process of step 20 comprises the steps of:
step 201, setting Hive variables for transferring platform variables related to Hive tasks;
step 202, the HQL is parsed into an AST abstract syntax tree;
step 203, reading Hive variables, and transmitting platform variables to Hook plugins when a task starts;
step 204, reading the LineageInfo information, including inputting a table set
Figure QLYQS_1
And output table set
Figure QLYQS_2
In step 205,
Figure QLYQS_3
a set of input tables can be obtained, and names of the input tables are obtained by traversing ReadEntity;
in step 206 the process continues with the step of,
Figure QLYQS_4
the set of output tables can be obtained, and the names of the output tables can be obtained by traversing WriteEntity;
step 207, through the above data, constructing a unique attribute name of the table, and constructing a formula as follows:
QualifiedName = library name table name @ data source ID.
4. A method of data lineage analysis according to claim 3, wherein the API data service is a device that accesses processed data using SQL, step 30 includes the sub-steps of:
step 301, performing preliminary analysis on SQL through Antrl 4;
step 302, obtaining an AST grammar tree through analysis;
step 303, traversing the multi-path AST grammar tree through a depth priority algorithm to obtain expected input table and output table information, and storing the expected input table and output table information in the lineageInfo;
step 304, the input table and output table alias table construction notification message is stored in the storage system.
5. The method of claim 4, wherein step 302 comprises the sub-steps of:
step 3021, obtaining a root node first;
step 3022, obtaining a list of all child nodes of the node, and traversing from the leftmost child node in turn;
step 3023, judging whether the node belongs to the selected elements type, if yes, traversing all child nodes of the node until an ID attribute is found, wherein the ID attribute value is an output alias table;
and 3024, judging that the node is of the from source type, if yes, traversing all child nodes of the node, respectively finding root nodes of a left brother node and a right brother node of the AS type, and finding a final child node of the node, wherein the left brother final child node is an input table, and the right brother final child node is an output alias table.
6. The method of claim 5, wherein in step 50, the data is stored according to different types of data in a distributed storage system using Hbase and elastic search.
7. The method of claim 6, wherein step 50 comprises the sub-steps of:
step 401, receiving blood edge data analyzed in data acquisition, data warehouse and data service;
step 402, after analyzing according to different types of data, packaging the data into different data objects, and storing the data objects in Hbase and elastic search respectively.
8. A data blood edge analysis device for implementing the data blood edge analysis method according to any one of claims 1 to 7.
9. The data blood-edge analysis device of claim 8, wherein the device comprises a data integration system, a Hive data warehouse system, an API data service system, message middleware, and a data blood-edge storage system; the data integration system is embedded with codes, a Hive data warehouse system Hook and an API data service system plug-in, and is used for realizing the process of obtaining full-link data blood edges, and the message middleware and the data blood edge storage system realize the data obtaining through consuming messages.
10. The data blood edge analysis device according to claim 9, wherein the data integration system comprises a data reading module, a data output module, a rate control module, and a resource control module; the blood margin mainly analyzes the configuration of the data reading module and the data output module.
CN202310163717.7A 2023-02-24 2023-02-24 Data blood margin analysis method and device Pending CN116010428A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310163717.7A CN116010428A (en) 2023-02-24 2023-02-24 Data blood margin analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310163717.7A CN116010428A (en) 2023-02-24 2023-02-24 Data blood margin analysis method and device

Publications (1)

Publication Number Publication Date
CN116010428A true CN116010428A (en) 2023-04-25

Family

ID=86033747

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310163717.7A Pending CN116010428A (en) 2023-02-24 2023-02-24 Data blood margin analysis method and device

Country Status (1)

Country Link
CN (1) CN116010428A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582660A (en) * 2018-12-06 2019-04-05 深圳前海微众银行股份有限公司 Data consanguinity analysis method, apparatus, equipment, system and readable storage medium storing program for executing
CN110555032A (en) * 2019-09-09 2019-12-10 北京搜狐新媒体信息技术有限公司 Data blood relationship analysis method and system based on metadata
CN113326261A (en) * 2021-04-29 2021-08-31 上海淇馥信息技术有限公司 Data blood relationship extraction method and device and electronic equipment
CN114116856A (en) * 2022-01-25 2022-03-01 中电云数智科技有限公司 Field level blood relationship analysis method based on data management full link
CN115034512A (en) * 2022-07-04 2022-09-09 山东体育学院 Process optimization method, system, equipment and computer readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109582660A (en) * 2018-12-06 2019-04-05 深圳前海微众银行股份有限公司 Data consanguinity analysis method, apparatus, equipment, system and readable storage medium storing program for executing
CN110555032A (en) * 2019-09-09 2019-12-10 北京搜狐新媒体信息技术有限公司 Data blood relationship analysis method and system based on metadata
CN113326261A (en) * 2021-04-29 2021-08-31 上海淇馥信息技术有限公司 Data blood relationship extraction method and device and electronic equipment
CN114116856A (en) * 2022-01-25 2022-03-01 中电云数智科技有限公司 Field level blood relationship analysis method based on data management full link
CN115034512A (en) * 2022-07-04 2022-09-09 山东体育学院 Process optimization method, system, equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
US20110087708A1 (en) Business object based operational reporting and analysis
CN113010547B (en) Database query optimization method and system based on graph neural network
TW201419014A (en) Extracting semantic relationships from table structures in electronic documents
EP3654198A1 (en) Conversational database analysis
CN108038213A (en) A kind of method of data processing, client, server and system
CN111708774B (en) Industry analytic system based on big data
CN108509199A (en) Automatically generate the method, apparatus, equipment and storage medium of Chinese annotation
Varga et al. QB2OLAP: enabling OLAP on statistical linked open data
CN110059085A (en) A kind of parsing of JSON data and modeling method of web oriented 2.0
CN111475588B (en) Data processing method and device
CN116483850A (en) Data processing method, device, equipment and medium
CN113962597A (en) Data analysis method and device, electronic equipment and storage medium
US11200230B2 (en) Cost-based optimization for document-oriented database queries
CN110008448B (en) Method and device for automatically converting SQL code into Java code
CN111813870A (en) Machine learning algorithm resource sharing method and system based on unified description expression
CN105528424B (en) The system and method for data persistence is realized under big data environment
CN115329753B (en) Intelligent data analysis method and system based on natural language processing
CN109062913B (en) Internationalization resource intelligent acquisition method and storage medium
CN111367638A (en) Processing method and computer equipment
CN112783836A (en) Information exchange method, device and computer storage medium
CN116010428A (en) Data blood margin analysis method and device
CN114281842A (en) Method and device for sub-table query of database
CN113132457B (en) Automatic method and system for converting Internet of things application program into RESTful service on cloud
US20220043773A1 (en) Information processing method, electronic device, and storage medium
US20240037325A1 (en) Ability to add non-direct ancestor columns in child spreadsheets

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20230425

RJ01 Rejection of invention patent application after publication