CN115098600A - Directed acyclic graph construction method and device for data warehouse and computer equipment - Google Patents
Directed acyclic graph construction method and device for data warehouse and computer equipment Download PDFInfo
- Publication number
- CN115098600A CN115098600A CN202210708843.1A CN202210708843A CN115098600A CN 115098600 A CN115098600 A CN 115098600A CN 202210708843 A CN202210708843 A CN 202210708843A CN 115098600 A CN115098600 A CN 115098600A
- Authority
- CN
- China
- Prior art keywords
- directed acyclic
- acyclic graph
- information
- task
- subtask
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010276 construction Methods 0.000 title claims abstract description 10
- 210000004369 blood Anatomy 0.000 claims abstract description 63
- 239000008280 blood Substances 0.000 claims abstract description 62
- 238000000034 method Methods 0.000 claims abstract description 43
- 238000004590 computer program Methods 0.000 claims description 14
- 238000004422 calculation algorithm Methods 0.000 claims description 12
- 238000007405 data analysis Methods 0.000 claims description 4
- 238000005457 optimization Methods 0.000 abstract description 17
- 238000007726 management method Methods 0.000 description 11
- 230000008569 process Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000013075 data extraction Methods 0.000 description 3
- 238000012423 maintenance Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000010606 normalization Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 230000001419 dependent effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000002349 favourable effect Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000011084 recovery Methods 0.000 description 2
- 230000000087 stabilizing effect Effects 0.000 description 2
- 230000001360 synchronised effect Effects 0.000 description 2
- 238000011144 upstream manufacturing Methods 0.000 description 2
- 241000972773 Aulopiformes Species 0.000 description 1
- 230000003190 augmentative effect Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 235000019515 salmon Nutrition 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Program initiating; Program switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application relates to the technical field of data warehouses of big data, and provides a directed acyclic graph construction method and device for the data warehouses and computer equipment, wherein the method comprises the following steps: acquiring the blood relationship of each model table in a data warehouse; acquiring metadata of each scheduling system; acquiring subtask information corresponding to each model table in each scheduling system according to the metadata, wherein the subtask information comprises task names and task execution time information; and generating the directed acyclic graph by taking the model tables as nodes, the blood relationship of the model tables as edges, and the information of the model tables and the subtask information as node attributes. The directed acyclic graph construction method, the directed acyclic graph construction device and the computer equipment for the data warehouse obtain the DAG graph with complete data warehouse tasks and task dependence, and therefore optimization of the data warehouse tasks is facilitated.
Description
Technical Field
The application relates to the technical field of data warehouses of big data, in particular to a directed acyclic graph construction method and device for a data warehouse and computer equipment.
Background
The execution of complex tasks in a data warehouse typically relies on the sequential unrolling of a directed acyclic graph (i.e., a DAG graph). Generally, before executing a task of a data warehouse, a scheduling system of the data warehouse may construct a DAG graph on which the task depends to facilitate the execution of the task. However, for a data warehouse, a complete task runs through a data access stage, a data analysis stage, and a data display stage, and therefore, the task is usually distributed in a plurality of scheduling systems, which makes it difficult to directly obtain a DAG graph on which the complete task depends, and thus makes analysis of a task key node difficult, which is not favorable for optimization of a data warehouse task.
Disclosure of Invention
The method and the device for constructing the directed acyclic graph for the data warehouse and the computer equipment aim at solving the technical problems that a DAG graph with complete data warehouse tasks and task dependence is difficult to obtain and optimization of the data warehouse tasks is not facilitated.
In order to achieve the above object, the present application provides a directed acyclic graph building method for a data warehouse, including:
acquiring a blood relationship of each model table in a data warehouse;
acquiring metadata of each scheduling system;
acquiring subtask information corresponding to each model table in each scheduling system according to the metadata, wherein the subtask information comprises task names and task execution time information;
and generating the directed acyclic graph by taking the model tables as nodes, the blood relationship of the model tables as edges, and the information of the model tables and the subtask information as node attributes.
In some embodiments, after the generating the directed acyclic graph with the model tables as nodes, the blood relationship of the model tables as edges, the information of the model tables, and the subtask information as node attributes, the directed acyclic graph constructing method further includes:
and calculating to obtain the key nodes of the directed acyclic graph by taking the task execution time information as a weight through a weighted key path algorithm.
In some embodiments, the generating the directed acyclic graph with the model tables as nodes, the blood relationship of the model tables as edges, the information of the model tables, and the subtask information as node attributes includes:
setting each model table as a node, setting the blood relationship of each model table as an edge, and finding a source node and a destination node corresponding to each edge from the nodes;
marking the edge as an outgoing edge of the corresponding source node;
marking the edge as an incoming edge of a corresponding destination node;
and adding the information of each model table and the subtask information into corresponding node attributes to generate the acyclic graph.
In some embodiments, the task execution time information includes at least one of a task start time, a task end time, and a task scheduling plan time.
In some embodiments, the blood-related relationships of the model tables are obtained by Apache Atlas.
In some embodiments, the obtaining the key node of the directed acyclic graph includes:
and calculating to obtain the key nodes through a weighted key node algorithm of the directed acyclic graph.
The present application further provides a directed acyclic graph constructing apparatus for a data warehouse, including:
the system comprises a blood relationship acquisition module, a data warehouse and a data analysis module, wherein the blood relationship acquisition module is used for acquiring blood relationship of each model table in the data warehouse;
the metadata acquisition module is used for acquiring metadata of each scheduling system;
a subtask information obtaining module, configured to obtain, according to the metadata, subtask information corresponding to each model table in each scheduling system, where the subtask information includes a subtask name and subtask execution time information;
and the directed acyclic graph generating module is used for generating the directed acyclic graph by taking the model tables as nodes, the blood relationship of the model tables as edges, and the information of the model tables and the subtask information as node attributes.
In some embodiments, the directed acyclic graph building apparatus for a data warehouse further comprises:
and the key node acquisition module is used for calculating and acquiring the key nodes of the directed acyclic graph by taking the task execution time information as weight through a weighted key path algorithm.
In some embodiments, the task execution time information includes at least one of a task start time, a task end time, and a task scheduling plan time.
The present application further provides a computer device, which includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the directed acyclic graph constructing method for a data warehouse provided in any one of the embodiments when executing the computer program.
The present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the directed acyclic graph building method for a data warehouse provided in any one of the above embodiments.
The method, the device and the computer equipment for the data warehouse are used for acquiring the blood relationship of each model table in the data warehouse; acquiring metadata of each scheduling system; acquiring subtask information corresponding to each model table in each scheduling system according to the metadata, wherein the subtask information comprises a subtask name and subtask execution time information; and generating the directed acyclic graph by taking the model tables as nodes, the blood relationship of the model tables as edges, and the information of the model tables and the subtask information as node attributes. The relationship between the model table and the subtasks is established through the acquired blood relationship between the model tables and the metadata of the scheduling system, so that a complete task dependence blood relationship is established, a complete task dependence directed acyclic graph is formed, and the optimization of data warehouse tasks is facilitated.
Drawings
FIG. 1 is a flowchart illustrating a directed acyclic graph creation method for a data warehouse according to an embodiment of the present application;
fig. 2 is a schematic flowchart of step S40 in a directed acyclic graph creating method for a data warehouse according to an embodiment of the present application;
FIG. 3 is a directed acyclic graph according to an embodiment of the present application;
FIG. 4 is a directed acyclic graph according to another embodiment of the present application;
FIG. 5 is a schematic flowchart of a directed acyclic graph creating method for a data warehouse according to another embodiment of the present application;
FIG. 6 is a flow chart illustrating an example of the weighted critical path algorithm according to the present application;
FIG. 7 is a block diagram illustrating an exemplary structure of an apparatus for creating a directed acyclic graph for a data warehouse according to an embodiment of the present disclosure;
fig. 8 is a block diagram illustrating a structure of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The data warehouse (data warehouse, DW) refers to a data storage set, and is used for screening and integrating various types of business data and providing various types of data support for decisions of all levels of an enterprise, the input direction of the data warehouse is various data sources (such as various databases), and the final output direction is the directions of data analysis, data mining, data reporting and the like of the enterprise.
The data hierarchy of a common data warehouse is generally divided into a three-layer structure of a mart layer, a middle layer and a data base layer. The data base layer mainly completes the work including data acquisition, namely, the data of different data sources are uniformly acquired on a platform; data cleaning, namely cleaning data which do not meet the quality requirement, and avoiding dirty data from participating in subsequent data calculation; data classification, namely establishing a data directory, and generally classifying according to a source system and a service domain on a basic layer; data structuring, namely structuring the semi-structured and unstructured data; and (4) data normalization, namely performing normalization operations such as normalization dimension identification and uniform measurement units. The most important goal of the data middle layer is to get through the data from different sources of the same entity, because in the current business form, the data of the same entity may be scattered in different systems and sources, and the identifiers of the data to the same entity may be different. In addition, the data middle layer can abstract the relationship from the behaviors. The basic relationship abstracted from the behavior can be a very important data dependency of the future upper-layer application. In the middle tier, appropriate data redundancy is often implemented in order to ensure the integrity of the subject matter or to increase the ease of use of the data. The data mart layer is usually built driven by the demand scene, and is vertically constructed among the marts. At the data mart level, the data value can be deeply mined.
As determined by the hierarchical structure of the data warehouse, a complete data warehouse task is usually distributed in multiple scheduling systems, which makes it difficult to directly acquire a DAG graph on which the complete task depends, and thus makes analysis of the task key nodes difficult, and is not favorable for optimization of the data warehouse task.
In order to solve the above problem, referring to fig. 1, an embodiment of the present application provides a directed acyclic graph building method for a data warehouse, including steps S10-S70, and details of each step of the directed acyclic graph building method are described as follows.
In one embodiment, the directed acyclic graph building method includes:
s10, obtaining the blood relationship of each model table in the data warehouse;
s20, acquiring metadata of each scheduling system;
s30, acquiring corresponding subtask information of each model table in each scheduling system according to the metadata, wherein the subtask information comprises a subtask name and subtask execution time information;
and S40, generating the directed acyclic graph by taking the model tables as nodes, the blood relationship of the model tables as edges, and the information of the model tables and the subtask information as node attributes.
As described in step S10, the relationship between the blood relationship of each model table is the relationship map between the model tables. For example, taking model table C as the target model table, if the data base of model table C is derived from model table B and the data base of model table B is derived from model table a, the blood-related relationship between model tables A, B, C can be represented as: a → B → C. The blood relationship of each model table can be obtained by the existing blood relationship management tools such as Apache atlas, salmon, Huashi cloud and the like. In this example, Apacheaatlas was used to obtain the blood relationship for each model table. Apache atlas is an open-source metadata management tool and is widely applied to the field of big data at present. The method supports the extraction and management of metadata from HBase, Hive, Sqoop, Storm and Kafka, and can also define a metadata model by self in a Rest Api mode to generate metadata; and the deployment and the operation are simple and quick.
As described in the above steps S20-S30, the metadata is obtained from the metadata database of each scheduling system, thereby reversely obtaining the corresponding subtask information of each model table in each scheduling system, and the subtask information includes the subtask name and the subtask execution time information. In one embodiment, the task execution time information includes a task start time, a task end time, and a task scheduling plan time.
In the field of data warehouse, metadata can be divided into technical metadata and business metadata according to usage. First, metadata can provide user-based information; second, metadata can support the management and maintenance of data by the system. Specifically, in a data warehouse system, the metadata mechanism mainly supports the following system management functions: describing which data is in the data warehouse; defining data to be entered into and data to be generated from the data warehouse; recording the data extraction working time arrangement carried out along with the occurrence of the business event; recording and detecting the requirement and the execution condition of the system data consistency; and measuring the data quality.
As described in the above step S40, referring to fig. 2, in an embodiment, the step S40 specifically includes the following steps:
s401, setting each model table as a node, setting the blood relationship of each model table as an edge, and finding a source node and a destination node corresponding to each edge from the nodes;
s402, marking the edge as an outgoing edge of the corresponding source node;
s403, marking the edges as the incoming edges of the corresponding destination nodes;
s404, adding the information of each model table and the subtask information into corresponding node attributes, and generating the directed acyclic graph.
As described above in steps S401-S404, a directed acyclic graph is a finite directed graph with no directed cycles. Specifically, it consists of a finite number of nodes and directed edges, each directed edge pointing from one node to another; starting from any node, the node can not return to the original node through the directed edges. Illustratively, referring to FIG. 3, assume that model tables A, B, C, D, E are included in a data repository, and the relationship of the bloods borders of each model table is A → B → D → F and A → C → E → F. From the blood-related relationships of the model table, a directed acyclic graph for a data warehouse as shown in fig. 3 can be analyzed.
The node attribute of any node in the directed acyclic graph comprises table information and subtask information of a model table corresponding to the node, the subtask information comprises a subtask name and subtask execution time information, and the subtask execution time information comprises at least one of task starting time, task ending time and task scheduling planning time. Referring to fig. 4, it is assumed that a node a (model table a) in the directed acyclic graph shown in fig. 4 includes an execution sub-task a, a node B (model table B) includes an execution sub-task B, a node C (model table C) includes an execution sub-task C, and a node D (model table D) includes an execution sub-task D. Because certain blood-edge relationships exist among the model tables and tasks corresponding to the model tables are executed according to the front-back sequence of the blood-edge relationships among the model tables, if one task in the data warehouse is associated with a node in the directed acyclic graph shown in fig. 4, task dependent blood edges of the subtasks a, b, c and d can be obtained according to the directed acyclic graph, and therefore the complete task dependent directed acyclic graph is obtained.
The directed acyclic graph construction method for the data warehouse obtains the blood relationship of each model table in the data warehouse; acquiring metadata of each scheduling system; acquiring subtask information corresponding to each model table in each scheduling system according to the metadata, wherein the subtask information comprises a subtask name and subtask execution time information; and generating the directed acyclic graph by taking the model tables as nodes, the blood relationship of the model tables as edges, and the information of the model tables and the subtask information as node attributes. The relationship between the model tables and the subtasks is clarified by acquiring the blood relationship between the model tables and the metadata of the scheduling system, and a directed acyclic graph is established based on the relationship, so that a complete task dependence blood relationship is established, and the optimization of the data warehouse task is facilitated.
In an embodiment, referring to fig. 5, a method for constructing a directed acyclic graph for a data warehouse further includes the following steps:
and S50, calculating and obtaining the key nodes of the directed acyclic graph by taking the task execution time information as a weight through a weighted key path algorithm.
As described in step S50, in the directed acyclic graph, assuming that there are only one node with an in-degree of 0 (referred to as a source) and one node with an out-degree of 0 (referred to as a sink), the longest path from the source to the sink is referred to as a critical path, and the nodes traversed by the critical path are the critical nodes. It should be noted that the sum of the time corresponding to the critical path is the shortest time required for completing the task, and if the execution time of the task is to be shortened (i.e., the task is optimized), the critical node on the critical path essentially needs to be optimized to shorten the execution time of the subtask corresponding to the critical node, thereby achieving the purpose of optimizing the task.
In this embodiment, the task execution time information may include a task start time, a task end time, and a task planning time, a directed acyclic graph on which the task depends represents a complete task of the data warehouse, a directed edge represents a blood relation between the model tables, a weight on a node may represent a time required to complete the task of the node, and a sum of weights on a certain critical path from a source point to a sink point represents a planning time for completing the entire task.
Referring to fig. 6, the weighted critical path algorithm adopted in the present embodiment specifically includes the following steps:
s501, carrying out topological sequencing on nodes of the directed acyclic graph;
s502, according to the topological sorting, the earliest starting time of each node is obtained;
s503, calculating an inverse topological sorting sequence of the directed acyclic graph nodes;
s504, according to the inverse topology sequence, the latest occurrence time of each node is obtained;
s505, solving the earliest occurrence time of the subtask in each node, wherein the earliest occurrence time of the subtask is the earliest starting time of the node corresponding to the subtask;
s506, calculating the latest occurrence time of each subtask, wherein the latest occurrence time of the subtask is the difference between the latest occurrence time of the next node of the node corresponding to the subtask and the completion time of the subtask;
and S507, setting the node corresponding to the subtask with the earliest time and the latest time equal to the subtask as a key node.
In an embodiment, after the calculating obtains the key node of the directed acyclic graph, the method further includes:
optimizing the key node to shorten the execution time of the task.
Because the critical path is the longest path from the source point to the sink point of the directed acyclic graph, and the sum of the time corresponding to the critical path is also the shortest time required for task completion, that is, if the execution time of the task is to be shortened, the critical nodes on the critical path need to be optimized (that is, the subtasks in the critical nodes are optimized, so that the completion time of the subtasks is shortened), so that the execution time of one complete task in the data warehouse is shortened, the execution efficiency of the tasks in the data warehouse is improved, and the performance of the data warehouse is improved.
Illustratively, the optimization of the key nodes may be performed from the following aspects: model optimization, such as deleting some useless fields, selecting data types with low storage, hierarchically storing data and the like; code optimization, such as table first association which can reduce data volume, scene multi-application approximate calculation which has less strict requirement on accuracy and the like when multi-table association is realized; environment optimization, such as timing adjustment, running at a task low peak, avoiding the situation of cluster resource shortage, stabilizing cluster running tasks, avoiding task retry caused by downtime or resource recovery, and the like; external optimization, such as through the adjustment of upstream and downstream tasks, reduces the complexity of the processing of tasks.
Referring to fig. 7, an embodiment of the present application further provides a directed acyclic graph constructing apparatus for a data warehouse, including:
a blood relationship obtaining module 701, configured to obtain blood relationship of each model table in the data warehouse;
a metadata obtaining module 702, configured to obtain metadata of each scheduling system;
a subtask information obtaining module 703, configured to obtain, according to the metadata, subtask information corresponding to each model table in each scheduling system, where the subtask information includes a subtask name and subtask execution time information;
and a directed acyclic graph generating module 704, configured to generate the directed acyclic graph with the model tables as nodes, the blood relationship of each model table as an edge, and the information of each model table and the subtask information as node attributes.
In this embodiment, the blood-related relationship of each model table is a relationship map between each model table. For example, taking model table C as the target model table, if the data base of model table C is derived from model table B and the data base of model table B is derived from model table a, the blood-related relationship between model tables A, B, C can be represented as: a → B → C. Wherein, the relationship of each model table can be obtained by the existing open source blood margin management tools such as apache, atlas and the like. In this example, atlas was used to obtain the relationship of the blood margins of each model table. It should be noted that the technical solution of obtaining the relationship between the blood margins of each model by atlas is the prior art, and therefore is not described herein again.
And acquiring metadata from a metadata base of each scheduling system, thereby reversely acquiring corresponding subtask information of each model table in each scheduling system, wherein the subtask information comprises task names and task execution time information. In one embodiment, the task execution time information includes a task start time, a task end time, and a task scheduling plan time.
In the field of data warehouse, metadata can be divided into technical metadata and business metadata according to usage. First, metadata can provide user-based information; second, metadata can support the management and maintenance of data by the system. Specifically, in a data warehouse system, the metadata mechanism mainly supports the following system management functions: describing which data is in the data warehouse; defining data to be entered into and data to be generated from the data warehouse; recording the data extraction working time arrangement carried out along with the occurrence of the business event; recording and detecting the requirement and the execution condition of the system data consistency; and measuring the data quality.
In this embodiment, the step of generating the directed acyclic graph by the directed acyclic graph generating module 704 specifically includes:
(1) setting each model table as a node, setting the blood relationship of each model table as an edge, and finding a source node and a destination node corresponding to each edge from the nodes;
(2) marking the edge as an outgoing edge of the corresponding source node;
(3) marking the edge as an incoming edge of a corresponding destination node;
(4) and adding the information of each model table and the subtask information as corresponding node attributes to generate the directed acyclic graph.
In an embodiment, the apparatus for constructing a directed acyclic graph for a data warehouse further includes:
and a key node obtaining module 705, configured to calculate and obtain a key node of the directed acyclic graph through a weighted key path algorithm by using the task execution time information as a weight.
In a directed acyclic graph, assuming that there is only one node with an in-degree of 0 (called a source) and one node with an out-degree of 0 (called a sink), the longest path from the source to the sink is called a critical path.
In this embodiment, the task execution time information may include a task start time, a task end time, and a task planning time, a directed acyclic graph on which the task depends represents a complete task of the data warehouse, a directed edge represents a blood relation between the model tables, a weight on a node may represent a time required to complete the task of the node, and a sum of weights on a certain critical path from a source point to a sink point represents a planning time for completing the entire task.
The weighted critical path algorithm adopted in this embodiment specifically includes: carrying out topological sorting on nodes of the directed acyclic graph; according to the topological sorting, the earliest starting time of each node is obtained; calculating an inverse topological sorting sequence of the directed acyclic graph nodes; according to the inverse topological sorting, the latest occurrence time of each node is obtained; solving the earliest occurrence time of the subtask in each node, wherein the earliest occurrence time of the subtask is the earliest starting time of the node corresponding to the subtask; calculating the latest occurrence time of each subtask, wherein the latest occurrence time of the subtask is the difference between the latest occurrence time of the next node of the node corresponding to the subtask and the completion time of the subtask; and setting the node corresponding to the subtask with the earliest time and the latest time equal to the subtask as the key node.
It can be understood that each component of the directed acyclic graph building apparatus for a data warehouse provided in the present application may implement the function of any one of the directed acyclic graph building methods for a data warehouse provided in any one of the embodiments described above, and a specific structure is not described again.
Referring to fig. 8, an embodiment of the present application further provides a computer device, and an internal structure of the computer device may be as shown in fig. 8. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the computer designed processor is used to provide computational and control capabilities. The memory of the computer device comprises a storage medium and an internal memory. The storage medium stores an operating system, a computer program, and a database. The memory provides an environment for the operation of the operating system and computer programs in the storage medium. The database of the computer device is used for storing relevant data of a directed acyclic graph building method of a data warehouse. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a directed acyclic graph construction method for a data warehouse.
The processor executes the steps of the directed acyclic graph building method for a data warehouse, and the steps include: acquiring the blood relationship of each model table in a data warehouse; acquiring metadata of each scheduling system; acquiring corresponding subtask information of each model table in each scheduling system according to the metadata, wherein the subtask information comprises task names and task execution time information; and generating the directed acyclic graph by taking the model tables as nodes, the blood relationship of the model tables as edges, and the information of the model tables and the subtask information as node attributes.
In this embodiment, the blood-related relationship of each model table is a relationship map between each model table. For example, taking model table C as the target model table, if the data base of model table C is derived from model table B and the data base of model table B is derived from model table a, the blood-related relationship between model tables A, B, C can be represented as: a → B → C. The relationship of each model table can be obtained by the existing open source blood margin management tools such as apache, atlas and the like. In this example, atlas was used to obtain the relationship of the blood margins of each model table. It should be noted that the technical solution of obtaining the relationship between the blood margins of each model by atlas is the prior art, and therefore is not described herein again.
And acquiring metadata from a metadata base of each scheduling system, thereby reversely acquiring corresponding subtask information of each model table in each scheduling system, wherein the subtask information comprises task names and task execution time information. In one embodiment, the task execution time information includes a task start time, a task end time, and a task scheduling plan time.
In the field of data warehouse, metadata can be divided into technical metadata and business metadata according to usage. First, metadata can provide user-based information; second, metadata can support the management and maintenance of data by the system. Specifically, in a data warehouse system, the metadata mechanism mainly supports the following system management functions: describing which data is in the data warehouse; defining data to be entered into and data to be generated from the data warehouse; recording the data extraction working time arrangement carried out along with the occurrence of the business event; recording and detecting the requirement and the execution condition of the system data consistency; and measuring the data quality.
In an embodiment, the step of executing the directed acyclic graph building method for a data warehouse by the processor further includes: and calculating to obtain the key nodes of the directed acyclic graph by taking the task execution time information as a weight through a weighted key path algorithm.
In the directed acyclic graph, assuming that there are only one node with an in-degree of 0 (called a source) and one node with an out-degree of 0 (called a sink), the longest path from the source to the sink is called a critical path, and the nodes traversed by the critical path are the critical nodes. It should be noted that the sum of the time corresponding to the critical path is the shortest time required for completing the task, and if the execution time of the task is to be shortened (i.e., the task is optimized), the critical node on the critical path essentially needs to be optimized to shorten the execution time of the subtask corresponding to the critical node, thereby achieving the purpose of optimizing the task.
In this embodiment, the task execution time information may include a task start time, a task end time, and a task planning time, a directed acyclic graph on which the task depends represents a complete task of the data warehouse, a directed edge represents a blood relation between the model tables, a weight on a node may represent a time required to complete the task of the node, and a sum of weights on a certain critical path from a source point to a sink point represents a planning time for completing the entire task.
The steps of the weighted critical path algorithm adopted in this embodiment specifically include: carrying out topological sorting on nodes of the directed acyclic graph; according to the topological sorting, the earliest starting time of each node is obtained; calculating an inverse topological sorting sequence of the directed acyclic graph nodes; according to the inverse topological sorting, the latest occurrence time of each node is obtained; solving the earliest occurrence time of the subtask in each node, wherein the earliest occurrence time of the subtask is the earliest starting time of the node corresponding to the subtask; calculating the latest occurrence time of each subtask, wherein the latest occurrence time of the subtask is the difference between the latest occurrence time of the next node of the node corresponding to the subtask and the completion time of the subtask; and setting the node corresponding to the subtask with the earliest time and the latest time equal to the subtask as the key node.
In an embodiment, after the processor performs the steps of the directed acyclic graph building method for a data warehouse, the processor further performs the following steps: optimizing the key node to shorten the execution time of the task.
Because the critical path is the longest path from the source point to the sink point of the directed acyclic graph, and the sum of the time corresponding to the critical path is also the shortest time required for task completion, that is, if the execution time of the task is to be shortened, the critical nodes on the critical path need to be optimized (that is, the subtasks in the critical nodes are optimized, so that the completion time of the subtasks is shortened), so that the execution time of one complete task in the data warehouse is shortened, the execution efficiency of the tasks in the data warehouse is improved, and the performance of the data warehouse is improved.
Illustratively, the optimization of the key nodes may be performed from the following aspects: model optimization, such as deleting some useless fields, selecting data types with low storage, hierarchically storing data and the like; code optimization, such as table first association which can reduce data volume, scene multi-application approximate calculation which has less strict requirement on accuracy and the like when multi-table association is realized; environment optimization, such as timing adjustment, running at a task low peak, avoiding the situation of cluster resource shortage, stabilizing cluster running tasks, avoiding task retry caused by downtime or resource recovery, and the like; external optimization, such as through adjustments of upstream and downstream tasks, reduces the complexity of the processing of tasks.
The embodiments of the present application further provide a computer-readable storage medium, which may be non-volatile or volatile, and has a computer program stored thereon, where the computer program, when executed by a processor, implements the directed acyclic graph building method for a data warehouse provided in any one of the above embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. Any reference to memory, storage, database or other medium provided herein and used in the examples may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), dual data rate SDRAM (SSRSDRAM), augmented SDRAM (ESDRAM), synchronous Link (Synchlink) DRAM (SLDRAM), Rambus (Rambus) direct RAM (RDRAM), direct bused dynamic RAM (DRDRAM), and bused dynamic RAM (RDRAM).
In summary, the directed acyclic graph constructing method, the directed acyclic graph constructing device, and the computer device for the data warehouse provided by the present application obtain the blood relationship of each model table in the data warehouse; acquiring metadata of each scheduling system; acquiring subtask information corresponding to each model table in each scheduling system according to the metadata, wherein the subtask information comprises a subtask name and subtask execution time information; and generating the directed acyclic graph by taking the model tables as nodes, the blood relationship of the model tables as edges, and the information of the model tables and the subtask information as node attributes. The relationship between the model tables and the subtasks is clarified by obtaining the blood relationship between the model tables and the metadata of the scheduling system, and a directed acyclic graph is established based on the relationship, so that the complete task dependence blood relationship is established, and the optimization of the task is facilitated.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, apparatus, article or method that comprises the element.
The above description is only a preferred embodiment of the present application, and not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application, or which are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.
Claims (10)
1. A directed acyclic graph building method for a data warehouse, comprising:
acquiring the blood relationship of each model table in a data warehouse;
acquiring metadata of each scheduling system;
acquiring subtask information corresponding to each model table in each scheduling system according to the metadata, wherein the subtask information comprises task names and task execution time information;
and generating the directed acyclic graph by taking the model tables as nodes, the blood relationship of the model tables as edges, and the information of the model tables and the subtask information as node attributes.
2. The method of claim 1, wherein after generating the directed acyclic graph with the model tables as nodes, the blood-related relationships of the model tables as edges, the information of the model tables, and the subtask information as node attributes, the method further comprises:
and calculating to obtain the key nodes of the directed acyclic graph by taking the task execution time information as a weight through a weighted key path algorithm.
3. The method according to claim 1, wherein the generating the directed acyclic graph with the model tables as nodes, the blood-related relationships of the model tables as edges, the information of the model tables and the subtask information as node attributes comprises:
setting each model table as a node, setting the blood relationship of each model table as an edge, and finding a source node and a destination node corresponding to each edge from the nodes;
marking the edge as an outgoing edge of the corresponding source node;
marking the edge as an incoming edge of a corresponding destination node;
and adding the information of each model table and the subtask information into corresponding node attributes to generate the directed acyclic graph.
4. The method of directed acyclic graph building for a data warehouse of claim 1, wherein the task execution time information includes at least one of a task start time, a task end time, and a task scheduling plan time.
5. The method of constructing a directed acyclic graph for a data warehouse as claimed in claim 1, wherein the blood-edge relations of each model table are obtained by Apache Atlas.
6. The method of constructing a directed acyclic graph for a data warehouse according to claim 2, further comprising, after the computing obtains key nodes of the directed acyclic graph:
optimizing the key node to shorten the execution time of the task.
7. A directed acyclic graph building apparatus for a data warehouse, comprising:
the system comprises a blood relationship acquisition module, a data warehouse and a data analysis module, wherein the blood relationship acquisition module is used for acquiring blood relationship of each model table in the data warehouse;
the metadata acquisition module is used for acquiring metadata of each scheduling system;
a subtask information obtaining module, configured to obtain, according to the metadata, subtask information corresponding to each model table in each scheduling system, where the subtask information includes a subtask name and subtask execution time information;
and the directed acyclic graph generating module is used for generating the directed acyclic graph by taking the model tables as nodes, the blood relationship of the model tables as edges, and the information of the model tables and the subtask information as node attributes.
8. The directed acyclic graph building apparatus for a data warehouse of claim 7, further comprising:
and the key node acquisition module is used for calculating and acquiring the key nodes of the directed acyclic graph by taking the task execution time information as weight through a weighted key path algorithm.
9. A computer device comprising a memory and a processor, the memory having stored therein a computer program, wherein the processor when executing the computer program performs the steps of the method for directed acyclic graph construction for a data store according to any of claims 1-6.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method for directed acyclic graph construction for a data warehouse according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210708843.1A CN115098600A (en) | 2022-06-21 | 2022-06-21 | Directed acyclic graph construction method and device for data warehouse and computer equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210708843.1A CN115098600A (en) | 2022-06-21 | 2022-06-21 | Directed acyclic graph construction method and device for data warehouse and computer equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115098600A true CN115098600A (en) | 2022-09-23 |
Family
ID=83292938
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210708843.1A Pending CN115098600A (en) | 2022-06-21 | 2022-06-21 | Directed acyclic graph construction method and device for data warehouse and computer equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115098600A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115757655A (en) * | 2022-11-14 | 2023-03-07 | 中国兵器工业计算机应用技术研究所 | Data blood relationship analysis system and method based on metadata management |
CN116340436A (en) * | 2023-03-14 | 2023-06-27 | 山东浪潮数字商业科技有限公司 | Data packet processing scheduling method and device, medium and equipment |
CN117874009A (en) * | 2024-03-13 | 2024-04-12 | 云筑信息科技(成都)有限公司 | System for creating and managing several warehouse models |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110297820A (en) * | 2019-06-28 | 2019-10-01 | 京东数字科技控股有限公司 | A kind of data processing method, device, equipment and storage medium |
CN111241078A (en) * | 2020-01-07 | 2020-06-05 | 网易(杭州)网络有限公司 | Data analysis system, data analysis method and device |
CN111309712A (en) * | 2020-03-16 | 2020-06-19 | 北京三快在线科技有限公司 | Optimized task scheduling method, device, equipment and medium based on data warehouse |
-
2022
- 2022-06-21 CN CN202210708843.1A patent/CN115098600A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110297820A (en) * | 2019-06-28 | 2019-10-01 | 京东数字科技控股有限公司 | A kind of data processing method, device, equipment and storage medium |
CN111241078A (en) * | 2020-01-07 | 2020-06-05 | 网易(杭州)网络有限公司 | Data analysis system, data analysis method and device |
CN111309712A (en) * | 2020-03-16 | 2020-06-19 | 北京三快在线科技有限公司 | Optimized task scheduling method, device, equipment and medium based on data warehouse |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115757655A (en) * | 2022-11-14 | 2023-03-07 | 中国兵器工业计算机应用技术研究所 | Data blood relationship analysis system and method based on metadata management |
CN116340436A (en) * | 2023-03-14 | 2023-06-27 | 山东浪潮数字商业科技有限公司 | Data packet processing scheduling method and device, medium and equipment |
CN116340436B (en) * | 2023-03-14 | 2024-05-24 | 山东浪潮数字商业科技有限公司 | Data packet processing scheduling method and device, medium and equipment |
CN117874009A (en) * | 2024-03-13 | 2024-04-12 | 云筑信息科技(成都)有限公司 | System for creating and managing several warehouse models |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115098600A (en) | Directed acyclic graph construction method and device for data warehouse and computer equipment | |
US8719271B2 (en) | Accelerating data profiling process | |
CN111984659B (en) | Data updating method, device, computer equipment and storage medium | |
US20170070398A1 (en) | Predicting attribute values for user segmentation | |
CN112202617B (en) | Resource management system monitoring method, device, computer equipment and storage medium | |
KR101975272B1 (en) | System and method for recommending component reuse based on collaboration dependency | |
Peter et al. | Sisyphus, a workflow manager designed for machine translation and automatic speech recognition | |
CN111124872A (en) | Branch detection method and device based on difference code analysis and storage medium | |
CN117271481B (en) | Automatic database optimization method and equipment | |
CN112579705B (en) | Metadata acquisition method, device, computer equipment and storage medium | |
US11151088B2 (en) | Systems and methods for verifying performance of a modification request in a database system | |
CN113220530B (en) | Data quality monitoring method and platform | |
CN106933857B (en) | Method and device for scheduling tasks in data warehouse | |
CN115329011A (en) | Data model construction method, data query method, data model construction device and data query device, and storage medium | |
US10289672B1 (en) | Threading spreadsheet calculations | |
Gorbenko et al. | Real distribution of response time instability in service-oriented architecture | |
CN111612098A (en) | Method and device for predicting milestone completion time in collaborative development community | |
Siddiqui et al. | Effectiveness of requirement prioritization using analytical hierarchy process (AHP) and planning game (PG): A comparative study | |
US20220207048A1 (en) | Signal of trust access prioritization | |
Milios et al. | Component aggregation for PEPA models: An approach based on approximate strong equivalence | |
CN110688387A (en) | Data processing method and device | |
Tsirigotis et al. | Oríon: Experiment version control for efficient hyperparameter optimization | |
CN112988457A (en) | Data backup method, device and system and computer equipment | |
CN109753405B (en) | Application resource consumption detection method and device, storage medium and electronic equipment | |
CN111814001A (en) | Method and device for feeding back information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |