CN116010428A - Data blood margin analysis method and device - Google Patents
Data blood margin analysis method and device Download PDFInfo
- Publication number
- CN116010428A CN116010428A CN202310163717.7A CN202310163717A CN116010428A CN 116010428 A CN116010428 A CN 116010428A CN 202310163717 A CN202310163717 A CN 202310163717A CN 116010428 A CN116010428 A CN 116010428A
- Authority
- CN
- China
- Prior art keywords
- data
- blood
- edge
- hive
- node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000008280 blood Substances 0.000 title claims abstract description 64
- 210000004369 blood Anatomy 0.000 title claims abstract description 64
- 238000004458 analytical method Methods 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 claims abstract description 57
- 230000007246 mechanism Effects 0.000 claims abstract description 7
- 238000013499 data model Methods 0.000 claims abstract description 4
- 230000008569 process Effects 0.000 claims description 41
- 230000010354 integration Effects 0.000 claims description 24
- 238000010276 construction Methods 0.000 claims description 5
- 238000004806 packaging method and process Methods 0.000 claims description 4
- 238000012360 testing method Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 238000013459 approach Methods 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000011218 segmentation Effects 0.000 description 2
- 208000007536 Thrombosis Diseases 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000015572 biosynthetic process Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000013501 data transformation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000011068 loading method Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The invention discloses a data blood-edge analysis method and a device, the method obtains blood-edge data through Hook mechanism analysis of a big data component, the blood-edge data is sent to a storage system through a message middleware, a data model is constructed in the storage system for storing the blood-edge data, the method comprises the following steps: step 10, analyzing and collecting blood edge data integrated by data based on the embedded code plug-in, and sending the blood edge data to a message middleware; step 20, collecting blood edge data of the Hive data warehouse based on Hive Hook analysis, and sending the blood edge data to a message middleware; step 30, analyzing and collecting blood edge data of the API data service based on the embedded code plug-in, and sending the blood edge data to the message middleware; step 40, the message middleware receives the appointed Topic message and caches the message; and 50, subscribing the message in the message middleware according to the type model, and providing analysis inquiry after storing. The invention can accurately collect and analyze the data blood edges of the big data platform in real time in a full-link way.
Description
Technical Field
The invention relates to the fields of computers, network communication technologies and big data processing, in particular to a data blood-edge analysis method and a data blood-edge analysis device.
Background
With the maturity and perfection of big data technology represented by Hadoop, big data platforms are gradually used by a large number of enterprises, individuals and scientific research institutions. It provides the capability to process massive data based on inexpensive microcomputers, with a large number of DAG distributed tasks to co-process the data, however, the large number of tasks and complex SQL can make it difficult to observe the data blood edges, and it is difficult to know which tables are affected by modifying the data of a table which is processed by which tables. The prior technical proposal is basically that: manually maintaining an Excel record processing link in big data development; or the information of the related table is obtained by carrying out regular expression and character string segmentation on SQL, and the blood margin obtained by the method is not quite accurate and can not be obtained in real time.
However, this type of approach has drawbacks: the former approach relies on a large amount of manpower, is very costly, and can be increasingly difficult to maintain over time. In the latter way, because SQL can be constructed very complicated, it is difficult to guarantee correctness for complex SQL, thus the obtained blood margin is also very inaccurate, on the other hand, because the executing process is difficult to perceive by means of regular expression and character string segmentation, it is difficult to obtain blood margin information in time.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention aims to provide a data blood-edge analysis method and device.
In order to achieve the above object, the present invention provides a data blood-edge analysis method, which obtains blood-edge data through Hook mechanism analysis of a big data component, and sends the blood-edge data to a storage system through a message middleware, wherein a data model is constructed in the storage system for storing the blood-edge data, and the method comprises the following steps:
Further, in step 10, the Hook in the data integration system may analyze the blood edge of the data integration system and push the blood edge data to the storage, and the process of data integration in step 10 is as follows:
Further, the Hive data warehouse hook process in step 20 specifically includes the following steps:
Step (a)205,A set of input tables can be obtained, and names of the input tables are obtained by traversing ReadEntity;
in step 206 the process continues with the step of,the set of output tables can be obtained, and the names of the output tables can be obtained by traversing WriteEntity;
QualifiedName = library name table name @ data source ID.
Further, the API data service is a means of accessing processed data using SQL, step 30 comprising the sub-steps of:
step 304, the input table information and the output table information construction notification message are stored in a storage system.
Further, step 302 includes the sub-steps of:
step 3021, obtaining a root node first;
step 3022, obtaining a list of all child nodes of the node, and traversing from the leftmost child node in turn;
step 3023, judging whether the node belongs to the selected elements type, if so, traversing all child nodes of the node until the ID attribute is found. For example: id=t4, which is an output alias table;
and 3024, judging that the node is of the from source type, if yes, traversing all child nodes of the node, respectively finding root nodes of a left brother node and a right brother node of the AS type, and finding a final child node of the node, wherein the left brother final child node is an input table, and the right brother final child node is an output alias table.
Further, in step 50, the data lineage store may store data in a distributed storage system that employs Hbase and elastic search based on different types of data.
Further, step 50 includes the sub-steps of:
step 401, receiving blood edge data analyzed in data acquisition, data warehouse and data service;
step 402, after analyzing according to different types of data, packaging the data into different data objects, and storing the data objects to Hbase and elastic search respectively, so as to facilitate query and retrieval.
In another aspect, the invention provides a data blood-lineage analysis device for implementing a method according to the invention.
Further, the device comprises a data integration system, a Hive data warehouse system, an API data service system, a message middleware and a data blood edge storage system, wherein the data integration system, the Hive data warehouse system and the API data service system are used for realizing the process of obtaining data blood edges and the process of analyzing SQL (structured query language) to data services, and the message middleware and the data blood edge storage system realize the process of storing the data blood edges to the storage service through the message middleware.
Further, the data integration system comprises a data reading module, a data output module, a rate control module and a resource control module. The blood margin mainly analyzes the configuration of the data reading module and the data output module.
The data blood-edge analysis method and the data blood-edge analysis device can be conveniently integrated into the assembly through the Hook mechanism of the engine to obtain the data blood-edge; the acquisition process of the data blood edges is accurate and real-time due to the deep inside of the assembly. Decoupling is performed through the message middleware, so that no performance influence is caused to the whole process; distributed storage may carry vast amounts of data. And analyzing SQL to data blood edges through a Hook process, decoupling through a message middleware, and storing the SQL to a storage system according to types. The method can be used for acquiring and analyzing the data blood edges of the big data platform in a full-link, accurate and real-time manner.
Drawings
FIG. 1 illustrates an overall architecture diagram of a data lineage analysis method and apparatus according to an embodiment of the present invention;
FIG. 2 shows a data lineage acquisition process diagram for the data integration of FIG. 1;
FIG. 3 shows a data blood edge acquisition process diagram of the Hive data warehouse of FIG. 1;
FIG. 4 shows a data lineage acquisition process diagram for the data service of FIG. 1;
FIG. 5 shows a data structure diagram of how Datax data integration, hive data warehouse, data service full link blood edge is constructed;
FIG. 6 shows an example of a data structure of complex blood clots with a large number of tasks performed multiple times.
Detailed Description
The following description of the embodiments of the present invention will be made more apparent and fully hereinafter with reference to the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.
Specific embodiments of the present invention are described in detail below with reference to fig. 1-6. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.
As shown in FIG. 1, according to the data blood-edge analysis method and device provided by the embodiment of the invention, blood-edge data is obtained through Hook mechanism analysis of a big data component, and is sent to a storage system through a message middleware, and a data model is constructed in the storage system for storing the blood-edge data. The user may be obtained by accessing an interface in the storage system. Hook is a system mechanism provided in Windows to replace "interrupt" in DOS, translated in chinese as "Hook" or "Hook". After a hook is made for a particular system event, once the hook event occurs, the program that has made the hook for that event will receive a notification from the system, at which point the program can respond to the event at the first time.
As shown in fig. 1, the basic flow of the data blood edge analysis method according to the embodiment of the invention is as follows:
and step 10, acquiring blood edge data of the data integration based on Hook analysis, and sending the blood edge data to the message middleware. Data integration Datax is an open source, widely used data integration component on the market, and the invention will extract key blood-edge data based on embedded code plug-ins and send it to message middleware designated Topic (naming: datax hook_topic). The purpose is to collect blood-edge data structures in a data integration task. In a specific embodiment, for Mysql, the integration task of Hive is synchronized, and the task is integrated from the T1, T2 table of Mysql to the T3 table of Hive, so that the blood relationship of the integration task P1 (Name) (T1- > T3) can be finally obtained. As shown in fig. 2, the data integration acquisition process in step 10 is as follows:
in step 101, the datax starts a data reading module according to JSON configuration, and is used for extracting data. The configuration JSON is exemplified as follows:
{
"core": {
"transport": {
"channel": {
"speed": {
"byte": "1048576"
}
}
}
},
"job": {
"content": [
{
"reader": {
"parameter": {
"database": "simba_test",
"partition": "name=${bdp.system.bizdate}",
"simba": true,
"nullFormat": "\\N",
"haveKerberos": false,
"column": [
{
"index": 0,
"type": "STRING"
},
{
"index": 1,
"type": "STRING"
}
],
"encoding": "UTF-8",
"tableName": "T1"
},
"name": "hivereader",
"id": "100266"
},
"writer": {
"parameter": {
"database": "simba_test",
"postSql": [],
"haveKerberos": false,
"column": [
"`oneid`",
"`is_buy_7d`"
],
"connection": [
{
"table": [
"T3"
]
}
],
"writeMode": "replace",
"batchSize": "1024",
"preSql": []
},
"name": "mysqlwriter",
"id": "100267"
}
}
],
"setting": {
"errorLimit": {
"record": 0
},
"speed": {
"channel": 1
}
}
}
}
in the execution process, analyzing JSON configuration of the Datax, and obtaining a Mysql input table T1 by analyzing a job.content.reader.parameter.tableName structure of the JSON; the Hive output table T3 is obtained by analyzing the job.center.writer.parameter.connection.table structure of JSON; obtaining a library simba_test of Mysql by analyzing the job.content.reader.parameter.database structure of the JOSN; obtaining a library sim_test of Hive by analyzing a job.center.writer.parameter.connection.database structure of JSON; obtaining an input data source ID by parsing the job.content.reader.id of the JSON; obtaining an output data source ID by parsing the job.center.writer.id of JSON; the above parsing is completed, as shown in step 102 of fig. 2. Based on the above data, a unique attribute name for the table can be constructed as follows:
qualifiedName = library name table name @ data source ID
Thus, T1 (QualifiedName: sim_test.t1@100266, attributes) - > J1 (QualifiedName: J1, attributes) - > T3 (QualifiedName: sim_test.t1@100267, attributes) can be obtained, and Attributes are other collected Attributes, and the invention is abbreviated for highlighting the focus.
After the data construction is completed, a Topic (datax_hook_topic) is sent through Kafka message middleware, as shown in step 103 of fig. 2. The message middleware is used to transfer data instead of directly calling the database, because the process of 101-104 is linearly executed, and direct calling can cause the blocking of the process, and the message can be asynchronously consumed through the message middleware, so that the throughput of the service is improved. Finally, as shown in step 104 of FIG. 2, the input data is written to the specified output destination. And step 20, acquiring blood edge data of the Hive data warehouse based on the Hook, and sending the blood edge data to the message middleware. Hive is a data warehouse tool based on Hadoop for data extraction, transformation, and loading, which is a mechanism that can store, query, and analyze large-scale data stored in Hadoop. Hive collects blood-edge data of Hive data warehouse based on Hook analysis and sends to message middleware to assign Topic (naming: hive_hook_topic);
as shown in FIG. 3, the Hive data warehouse Hook process in step 20, specifically, the execution process of the Hive data warehouse includes the following steps:
simba. Meta. Jobid: task ID identifying platform
simba. Meta. Taskid: job ID identifying platform
simba. Meta. Tenantid: tenant ID identifying a platform
simba. Meta. Datasourceid: data Source ID identifying platform
sim ba. Meta. ProjectId: item ID identifying platform, item ID and database in one-to-one relationship
sibba. Meta. Task dagtype: identifying DAG types for a platform
In step 202, the process is a Hive engine native process, the HQL is parsed into an AST abstract syntax tree, and the engine executes SQL.
And 203, reading the Hive variable, and transmitting the custom variable at the beginning of the task to the Hook plug-in. The reason for setting and reading at first is that the variable when the Hive task is submitted can only be transferred through the set grammar in SQL, so that the corresponding variable needs to be obtained through the HiveConf class in the execution process. The most core needs to obtain the simba. Meta. Datasourceid variable, which can be assumed to be 100267; obtaining the value of simba.meta.project id and obtaining the corresponding hive library name from the value, it can be assumed that the value is simba_test; the value of simba.meta.taskid is obtained, which is used to represent the unique ID of the task, which can be assumed to be J2.
in step 205,a set of input tables can be obtained, the names of the input tables can be obtained by traversing ReadEntity, and the value can be assumed to be T3;
in step 206 the process continues with the step of,a set of output tables can be obtained, the names of the output tables can be obtained by traversing the WriteEntity, and the value can be assumed to be T4;
qualifiedName = library name table name @ data source ID
Constructing the notification message from the data obtained in steps 203-206, a data structure T3 (QualifiedName: sim_test.t3@100267, attributes) - > J1 (QualifiedName: J2, attributes) - > T4 (QualifiedName: sim_test.t4@100267, attributes), attributes being other Attributes of the table, field collected in the process, for example: names, types, etc., are also abbreviated for highlighting emphasis. Sent to the designated Topic (hive hook Topic) by Kafka. It can be found that the output table quaiiedname during the Datax task is equal to the input table quaiiedname during the Hive task, where a relationship is established from data collection to data warehouse.
in step 30, the API data service is a service used after the data is processed in the data platform, and generally uses SQL as an execution configuration of the data service. As shown in fig. 4, step 30 includes the sub-steps of:
traversing the entire AST syntax tree by breadth-first traversal algorithm, step 302 comprises the sub-steps of:
step 3021, obtaining a root node first;
step 3022, obtaining a list of all child nodes of the node, and traversing sequentially from the leftmost child;
step 3023, judging whether the node belongs to the selected elements type, if yes, traversing all child nodes of the node until an ID attribute is found, id=t5, and the value is an output alias table;
step 3024, judging that the node is of the from source type, if yes, traversing all child nodes of the node, finding root nodes of left brothers and right brothers of the AS type, finding final child nodes of the node, wherein the left brothers final child nodes are input tables (T4), and the right brothers final child nodes are output nickname tables (T5)
Step 304, the data analyzed by 301-303 is brought into the previous qualifiedName composition formula, so that T4 (qualifiedName: simba_test.T4@100267, attributes) - > J1 (qualifiedName: J3, attributes) - > T5 (qualifiedName: simba_test.T5@100267, attributes) can be obtained, and the data structure is sent to a message middleware designated Topic (api_hook_topic). We can find that the qualifiedName of the output table during Hive tasks is equal to the qualifiedName of the input table during API where the blood-edge relationship from the data warehouse to the API is established.
In step 40, the message middleware uses Kafka, and the system is mainly used for receiving the message, so that data can be consumed asynchronously without affecting the performance of the original system, and belongs to the optimization improvement of blood edge acquisition by the component.
In step 50, the above relationship data are saved by respectively consuming different types of Topic data (datax_hook_topic, hive_hook_topic, api_hook_topic), so as to obtain the blood-edge link diagram shown in fig. 5.
Fig. 5 is a diagram of a unidirectional blood-edge link formed under the condition of each of a Datax task, a Hive task and a data service task.
P1 represents the blood source of the Datax task, namely the T1- > J1- > T3 process, and expresses the process that the T1 table flows to the T3 table through the J1 task. Where T1- > J1- > T3 is a Datax task that acts to extract data in Mysql, named T1, into the table in Hive, named T3. T1 is the name of the source table, T3 is the name of the target table, and J1 is the name of the Datax task;
j1 For the Datax task example, the JSON expression may be as follows:
{
"job": {
"content": [{
"reader": {
"parameter": {
"database": "default",
"column": [{
"index": 0,
"type": "STRING"
},
{
"index": 1,
"type": "STRING"
}
],
"tableName": "T1"
},
"name": "mysqlreader",
"id": "100001"
},
"writer": {
"parameter": {
"database": "default",
"column": [
"`id`",
"`name`"
],
"connection": [{
"table": [
"T2"
]
}],
"writeMode": "replace",
},
"name": "hivewriter",
"id": "100002"
}
}]
}
}
p2 represents the Hive task blood edge, i.e., the T3- > J2- > T4 process, and represents the process by which the T3 table flows through the J2 task to the T4 table. T3- > J2- > T4 is a Hive task, and functions to write data named as a T3 table in Hive into a table named as T4 in summary of Hive after processing. T3 is the source table name, T4 is the target table name, and J2 is the Hive task name.
J2 For Hive tasks, they can be expressed in SQL statements, examples of which are as follows:
Insert into table T4 from select from T3
the meaning of this SQL statement is to query all column data of T3 and then insert the direct increment into the T4 table.
P3 represents the blood-edge of the data service task, namely the process of T4- > J3- > T5, and expresses the process that the T4 table flows to the T5 table through the J3 task. T4- > J3- > T5 is a data service task which functions to expose data named as a T4 table in Hive through a data service to an externally accessible interface and provide external personnel with access to the data. Where T4 is the source table name, T5 is the table alias, and J3 is the data service task.
J3 The operation mode of the data service task can be expressed by SQL sentences, and the expression examples are as follows:
Select T5. From T4 T5
the SQL statement has the meaning of querying all columns of data of T4 and forms a temporary table with an individual name of T5, and external query is provided through the temporary table.
FIG. 6 is a schematic diagram showing the formation of a multi-way blood-edge link based on the one-way blood-edge link of FIG. 5, after a large number of different Datax tasks, hive tasks, and data service tasks are performed. On the basis of fig. 5, T2, T6, T7, T8, T9, T10, T11, T12 and T13 are table names, and J4, J5, J6, J7 and J8 are Hive task names.
In addition, the embodiment of the invention also provides a data blood-edge analysis device, which comprises a data integration system, a Hive data warehouse system, an API data service system, a message middleware and a data blood-edge storage system, wherein the data integration system, the Hive data warehouse system and the API data service system are used for realizing the process of obtaining data blood edges and the process of analyzing SQL (structured query language) to obtain the data blood edges for the data service, and the message middleware and the data blood-edge storage system realize the process of storing the data blood edges into a storage service through the message middleware.
The data integration system comprises a data reading module, a data output module, a rate control module and a resource control module. The blood margin mainly analyzes the configuration of the data reading module and the data output module.
The Datax data acquisition task, the Hive data processing task and the API data use task comprise all the capabilities of a large data platform, and in the execution process of a large number of tasks and tasks, a complex blood-margin map shown in figure 6 can be obtained through a blood-margin analysis device, wherein T1-T13 is used for expressing different service tables, and J1-J8 are used for expressing different tasks. An algorithmic analysis may be performed based on this map, e.g., a T1 table will ultimately affect which output tables; and querying the data from any table in the full link.
And analyzing SQL to data blood edges through a Hook process, decoupling through a message middleware, and storing the SQL to a storage system according to types.
In the description herein, reference to the term "embodiment," "example," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the different embodiments or examples described in this specification and the features therein may be combined or combined by those skilled in the art without creating contradictions.
While embodiments of the present invention have been shown and described, it will be understood that the embodiments are illustrative and not to be construed as limiting the invention, and that various changes, modifications, substitutions and alterations may be made by those skilled in the art without departing from the scope of the invention.
Claims (10)
1. A method for analyzing blood-edge of data, which is characterized in that the method obtains blood-edge data through Hook mechanism analysis of a big data component, sends the blood-edge data to a storage system through message middleware, and constructs a data model in the storage system for storing the blood-edge data, and the method comprises the following steps:
step 10, analyzing and collecting blood edge data integrated by data based on the embedded code plug-in, and sending the blood edge data to a message middleware;
step 20, collecting blood edge data of the Hive data warehouse based on Hive Hook analysis, and sending the blood edge data to a message middleware;
step 30, analyzing and collecting blood edge data of the API data service based on the embedded code plug-in, and sending the blood edge data to the message middleware;
step 40, the message middleware receives the appointed Topic message and caches the message;
and 50, subscribing the message in the message middleware according to the type model, and providing analysis inquiry after storing.
2. The method for analyzing blood edges of data according to claim 1, wherein in step 10, the embedded code plug-in the data integration can analyze the blood edges of the data integration and push and store the blood edge data, and in step 10, the process of data integration is as follows:
step 101, extracting data according to a data reading module;
step 102, packaging job information into task node information;
step 103, after the data construction is completed, the data is sent to a storage system through a message middleware;
step 104, writing the target source according to the output writing configuration.
3. The method of claim 2, wherein the Hive data warehouse hook process of step 20 comprises the steps of:
step 201, setting Hive variables for transferring platform variables related to Hive tasks;
step 202, the HQL is parsed into an AST abstract syntax tree;
step 203, reading Hive variables, and transmitting platform variables to Hook plugins when a task starts;
In step 205,a set of input tables can be obtained, and names of the input tables are obtained by traversing ReadEntity;
in step 206 the process continues with the step of,the set of output tables can be obtained, and the names of the output tables can be obtained by traversing WriteEntity;
step 207, through the above data, constructing a unique attribute name of the table, and constructing a formula as follows:
QualifiedName = library name table name @ data source ID.
4. A method of data lineage analysis according to claim 3, wherein the API data service is a device that accesses processed data using SQL, step 30 includes the sub-steps of:
step 301, performing preliminary analysis on SQL through Antrl 4;
step 302, obtaining an AST grammar tree through analysis;
step 303, traversing the multi-path AST grammar tree through a depth priority algorithm to obtain expected input table and output table information, and storing the expected input table and output table information in the lineageInfo;
step 304, the input table and output table alias table construction notification message is stored in the storage system.
5. The method of claim 4, wherein step 302 comprises the sub-steps of:
step 3021, obtaining a root node first;
step 3022, obtaining a list of all child nodes of the node, and traversing from the leftmost child node in turn;
step 3023, judging whether the node belongs to the selected elements type, if yes, traversing all child nodes of the node until an ID attribute is found, wherein the ID attribute value is an output alias table;
and 3024, judging that the node is of the from source type, if yes, traversing all child nodes of the node, respectively finding root nodes of a left brother node and a right brother node of the AS type, and finding a final child node of the node, wherein the left brother final child node is an input table, and the right brother final child node is an output alias table.
6. The method of claim 5, wherein in step 50, the data is stored according to different types of data in a distributed storage system using Hbase and elastic search.
7. The method of claim 6, wherein step 50 comprises the sub-steps of:
step 401, receiving blood edge data analyzed in data acquisition, data warehouse and data service;
step 402, after analyzing according to different types of data, packaging the data into different data objects, and storing the data objects in Hbase and elastic search respectively.
8. A data blood edge analysis device for implementing the data blood edge analysis method according to any one of claims 1 to 7.
9. The data blood-edge analysis device of claim 8, wherein the device comprises a data integration system, a Hive data warehouse system, an API data service system, message middleware, and a data blood-edge storage system; the data integration system is embedded with codes, a Hive data warehouse system Hook and an API data service system plug-in, and is used for realizing the process of obtaining full-link data blood edges, and the message middleware and the data blood edge storage system realize the data obtaining through consuming messages.
10. The data blood edge analysis device according to claim 9, wherein the data integration system comprises a data reading module, a data output module, a rate control module, and a resource control module; the blood margin mainly analyzes the configuration of the data reading module and the data output module.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310163717.7A CN116010428A (en) | 2023-02-24 | 2023-02-24 | Data blood margin analysis method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310163717.7A CN116010428A (en) | 2023-02-24 | 2023-02-24 | Data blood margin analysis method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116010428A true CN116010428A (en) | 2023-04-25 |
Family
ID=86033747
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310163717.7A Pending CN116010428A (en) | 2023-02-24 | 2023-02-24 | Data blood margin analysis method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116010428A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109582660A (en) * | 2018-12-06 | 2019-04-05 | 深圳前海微众银行股份有限公司 | Data consanguinity analysis method, apparatus, equipment, system and readable storage medium storing program for executing |
CN110555032A (en) * | 2019-09-09 | 2019-12-10 | 北京搜狐新媒体信息技术有限公司 | Data blood relationship analysis method and system based on metadata |
CN113326261A (en) * | 2021-04-29 | 2021-08-31 | 上海淇馥信息技术有限公司 | Data blood relationship extraction method and device and electronic equipment |
CN114116856A (en) * | 2022-01-25 | 2022-03-01 | 中电云数智科技有限公司 | Field level blood relationship analysis method based on data management full link |
CN115034512A (en) * | 2022-07-04 | 2022-09-09 | 山东体育学院 | Process optimization method, system, equipment and computer readable storage medium |
-
2023
- 2023-02-24 CN CN202310163717.7A patent/CN116010428A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109582660A (en) * | 2018-12-06 | 2019-04-05 | 深圳前海微众银行股份有限公司 | Data consanguinity analysis method, apparatus, equipment, system and readable storage medium storing program for executing |
CN110555032A (en) * | 2019-09-09 | 2019-12-10 | 北京搜狐新媒体信息技术有限公司 | Data blood relationship analysis method and system based on metadata |
CN113326261A (en) * | 2021-04-29 | 2021-08-31 | 上海淇馥信息技术有限公司 | Data blood relationship extraction method and device and electronic equipment |
CN114116856A (en) * | 2022-01-25 | 2022-03-01 | 中电云数智科技有限公司 | Field level blood relationship analysis method based on data management full link |
CN115034512A (en) * | 2022-07-04 | 2022-09-09 | 山东体育学院 | Process optimization method, system, equipment and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20110087708A1 (en) | Business object based operational reporting and analysis | |
CN113010547B (en) | Database query optimization method and system based on graph neural network | |
TW201419014A (en) | Extracting semantic relationships from table structures in electronic documents | |
EP3654198A1 (en) | Conversational database analysis | |
CN108038213A (en) | A kind of method of data processing, client, server and system | |
CN111708774B (en) | Industry analytic system based on big data | |
CN108509199A (en) | Automatically generate the method, apparatus, equipment and storage medium of Chinese annotation | |
Varga et al. | QB2OLAP: enabling OLAP on statistical linked open data | |
CN110059085A (en) | A kind of parsing of JSON data and modeling method of web oriented 2.0 | |
CN111475588B (en) | Data processing method and device | |
CN116483850A (en) | Data processing method, device, equipment and medium | |
CN113962597A (en) | Data analysis method and device, electronic equipment and storage medium | |
US11200230B2 (en) | Cost-based optimization for document-oriented database queries | |
CN110008448B (en) | Method and device for automatically converting SQL code into Java code | |
CN111813870A (en) | Machine learning algorithm resource sharing method and system based on unified description expression | |
CN105528424B (en) | The system and method for data persistence is realized under big data environment | |
CN115329753B (en) | Intelligent data analysis method and system based on natural language processing | |
CN109062913B (en) | Internationalization resource intelligent acquisition method and storage medium | |
CN111367638A (en) | Processing method and computer equipment | |
CN112783836A (en) | Information exchange method, device and computer storage medium | |
CN116010428A (en) | Data blood margin analysis method and device | |
CN114281842A (en) | Method and device for sub-table query of database | |
CN113132457B (en) | Automatic method and system for converting Internet of things application program into RESTful service on cloud | |
US20220043773A1 (en) | Information processing method, electronic device, and storage medium | |
US20240037325A1 (en) | Ability to add non-direct ancestor columns in child spreadsheets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20230425 |
|
RJ01 | Rejection of invention patent application after publication |