CN115134224A

CN115134224A - DAG graph monitoring method and system

Info

Publication number: CN115134224A
Application number: CN202211052740.0A
Authority: CN
Inventors: 赵振智; 陈吉平
Original assignee: Hangzhou Daishu Technology Co ltd
Current assignee: Hangzhou Daishu Technology Co ltd
Priority date: 2022-08-31
Filing date: 2022-08-31
Publication date: 2022-09-30

Abstract

The utility model relates to a monitoring method system of DAG picture, through defining the leaf node, not only monitor the leaf node, and to the leaf node all upstream nodes that have the relevance in the DAG picture carry out the synchronous monitoring, realize the holistic control to the DAG picture, rather than the control to single node, improved the accuracy of reporting an emergency and asking for help or increased vigilance when waiting to monitor the node abnormal conditions, consequently this application is to monitor each node of waiting to monitor of configuration from the holistic angle of DAG picture, solved the invalid warning under and dispose the problem that the warning work load is big.

Description

DAG graph monitoring method and system

Technical Field

The present application relates to the field of big data processing technologies, and in particular, to a method and a system for monitoring a DAG graph.

Background

The Chinese name of DAG (directed Acyclic graph) graph is called directed Acyclic graph. DAG graphs are a very important graph-theoretic data structure. If a directed graph cannot go from any vertex back to the point through several edges, the graph is called a directed acyclic graph.

In the technical data processing process of big data processing, DAG computation often refers to decomposing a computation task into several sub-tasks internally, and constructing the logical relationship or sequence between the sub-tasks into the structural relationship exhibited in a DAG graph.

DAG graphs are very common in distributed computing, and are often applied to various subdivision fields, such as Dryad, flumejva and Tez, which are typical for explicitly building a DAG computing model, and further, for example, a system like Storm of streaming computing or a machine learning framework Spark and the like, the computing task of the DAG graph mostly occurs in the form of the DAG graph.

Then, in the technical data processing process of big data processing, the monitoring work of the DAG graph becomes very important. The existing monitoring method of the DAG graph generally monitors each node in the DAG graph. This approach has two major drawbacks:

1) a lot of invalid alarms are generated, which not only increases the monitoring cost, but also reduces the accuracy of the monitoring result. This is because the runtime of each node in the DAG is uncertain, and the time requirements for data production are not the same for different nodes. For example, the dependency relationship among the three nodes a, b and c is a (13-runtime) -b (15-runtime) -c (17-runtime), if the data yield of a reaches 16, the node b is caused to run at 16, and at this time, if a task is monitored, an alarm is given, but if b can yield data at 17, then the DAG graph is actually delayed locally. But has no delay as a whole, so that manual intervention is not required, and the alarm belongs to an invalid alarm.

2) When the number of DAG graph nodes is large, the monitoring efficiency is low. If the number of the DAG graph nodes exceeds a certain number, related workers must specially configure an alarm program for each node, errors are prone to occur in the process of configuring the alarm program, and a large amount of time is wasted.

Disclosure of Invention

Therefore, it is necessary to provide a DAG graph monitoring method for solving the problems of high monitoring cost, low monitoring result accuracy and low monitoring efficiency of the conventional DAG graph monitoring method.

The application provides a DAG graph monitoring method, which comprises the following steps:

the server acquires a monitoring configuration file and a monitoring rule file sent by a client;

the server generates at least one monitoring rule instance according to the monitoring configuration file and the monitoring rule file and stores the monitoring rule instance in a database;

the server scans at least one monitoring rule example generated in the database in the previous day, monitors the running state of the DAG graph according to the at least one monitoring rule example, and alarms in real time when the running state of the DAG graph is abnormal;

the server generates at least one monitoring rule instance according to the monitoring rule file and stores the monitoring rule instance in a database, and the method comprises the following steps:

the server acquires the leaf nodes according to the monitoring configuration file and acquires all upstream nodes which are associated with the leaf nodes in the DAG graph;

the server calculates the monitoring indexes of the leaf nodes and the monitoring indexes of each upstream node which is associated with the leaf nodes in the DAG graph according to the monitoring configuration file;

the server scans at least one monitoring rule example generated in the database in the previous day, monitors the running state of the DAG graph according to the at least one monitoring rule example, and alarms in real time when the running state of the DAG graph is abnormal, wherein the method comprises the following steps:

the server takes the node corresponding to each monitoring rule instance as a node to be monitored, when the node to be monitored is a leaf node, the leaf node is monitored according to the monitoring index of the leaf node, when the node to be monitored is an upstream node which is associated with the leaf node in the DAG graph, the server monitors the upstream node which is associated with the leaf node in the DAG graph according to the monitoring index of the upstream node which is associated with the leaf node in the DAG graph, and when any one node to be monitored is abnormal, real-time alarm is given to the abnormal situation.

The present application further provides a monitoring system for a DAG graph, including:

at least one client;

and the server is in communication connection with each client and is used for executing the monitoring method of the DAG graph mentioned in the foregoing content.

Drawings

Fig. 1 is a schematic flow diagram of a monitoring method for a DAG graph according to an embodiment of the present disclosure.

Fig. 2 is a schematic flow diagram of a monitoring method for a DAG graph according to an embodiment of the present disclosure.

Fig. 3 is a schematic structural diagram of a monitoring system of a DAG graph according to an embodiment of the present application.

Fig. 4 is a DAG graph in an embodiment of a DAG graph monitoring method provided by the present application.

Fig. 5 is a schematic diagram of an internal display situation of a first stack in the DAG graph monitoring method provided in this application.

Fig. 6 is a schematic diagram of an internal display situation of a second stack in the DAG graph monitoring method provided by the present application.

Detailed Description

For the purpose of making the present application more apparent, technical solutions and advantages thereof are described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The application provides a DAG graph monitoring method. It should be noted that the monitoring method for the DAG graph provided by the present application is applicable to any type of DAG graph.

In addition, the monitoring method of the DAG graph provided by the application is not limited to the execution subject. Optionally, an execution subject of the DAG graph monitoring method provided by the present application may be a DAG graph monitoring system. Specifically, the execution subject of the monitoring method for the DAG graph provided by the present application may be a server in the monitoring system for the DAG graph.

As shown in fig. 1 and fig. 2, in an embodiment of the present application, a monitoring method of a DAG graph includes:

s100, the server acquires a monitoring configuration file and a monitoring rule file sent by the client.

Specifically, the monitoring configuration file includes monitoring indexes of each node to be monitored. The monitoring rule file comprises a specific mode of alarming when the node is abnormal.

S200, the server generates at least one monitoring rule instance according to the monitoring configuration file and the monitoring rule file and stores the at least one monitoring rule instance in a database.

Specifically, each monitoring rule instance corresponds to a node in the unique DAG graph.

S300, the server scans at least one monitoring rule example generated in the database in the previous day, monitors the running state of the DAG graph according to the at least one monitoring rule example, and gives an alarm in real time when the running state of the DAG graph is abnormal.

Specifically, the at least one monitoring rule instance generated by executing S100 to S200 is used for the monitoring work of the following day, so that the monitoring work of the current day applies the at least one monitoring rule instance generated by the previous day.

For example, all monitoring rule instances generated at 24/8/2022, applied at 25/8/2022,

optionally, S100 to S200 and S300 are asynchronously operated, i.e. may be operated simultaneously without mutual interference.

For example, the server executes S300 to monitor the running state of the DAG graph all day from 8/24/2022, and executes S100 to S200 to generate at least one monitoring rule instance for monitoring work at 8/25/2022, from 11/24/2022.

The S200 includes:

s210, the server acquires the leaf nodes according to the monitoring configuration file and acquires all upstream nodes which are associated with the leaf nodes in the DAG graph.

In particular, the leaf nodes are pre-set, which may be any node of the DAG graph. The monitoring profile includes leaf nodes that have been set. Of course, the leaf nodes may be one or more. As shown in fig. 4, if node 7 is used as the only leaf node to be monitored, the object to be monitored is node 7 and all the nodes upstream of node 7, including node 3, node 5, node 2, node 4 and node 1.

If the leaf node selected by the user is the node 9, as shown in fig. 4, all the nodes that need to be monitored similarly include: node 9, node 7, node 3, node 5, node 2, node 4, node 1.

The leaf nodes thus determine the region in which the entire DAG graph is monitored, the leaf nodes being the most downstream nodes in the region in which the entire DAG graph is monitored.

S220, the server calculates the monitoring indexes of the leaf nodes and the monitoring indexes of each upstream node which is associated with the leaf nodes in the DAG graph according to the monitoring configuration file.

Specifically, the node 7 is a leaf node, and then the monitoring index of each of the nodes 7, 3, 5, 2, 4 and 1 needs to be calculated.

The S300 includes:

and S310, the server takes the node corresponding to each monitoring rule instance as a node to be monitored, when the node to be monitored is a leaf node, the leaf node is monitored according to the monitoring index of the leaf node, when the node to be monitored is an upstream node associated with the leaf node in the DAG graph, the server monitors the upstream node associated with the leaf node in the DAG graph according to the monitoring index of the upstream node associated with the leaf node in the DAG graph, and when any one node to be monitored is abnormal, the server gives an alarm in real time.

The utility model relates to a monitoring method system of DAG picture, through defining the leaf node, not only monitor the leaf node, but also monitor simultaneously all upstream nodes that have the relevance in the DAG picture with the leaf node, realize the holistic control of DAG picture, rather than the control to single node, improved the accuracy of reporting an emergency and asking for help or increased vigilance when waiting to monitor the node abnormal conditions, consequently this application is to treat monitoring of monitoring the node of waiting to monitor of configuration from DAG picture holistic angle, solved under the present invalid report an emergency and ask for help or increased vigilance the problem that work load is big with the configuration.

In an embodiment of the present application, the monitoring profile includes one or more of a commitment time of each node, an expected operation duration of each node, and a time margin of each node; the time margin is the maximum delay time of the node for receiving data.

Specifically, the client may pre-configure the leaf nodes that need to be monitored. The user selects the leaf nodes needing to be monitored at the client, sets the commitment time, the expected operation time length and the time margin of the leaf nodes, and sets the commitment time, the expected operation time length and the time margin of each upstream node which is associated with the leaf nodes in the DAG graph.

Of course, before configuring the leaf nodes to be monitored, the commitment time of each node, the expected operation duration of each node, and the time margin may be preconfigured for all nodes in the DAG graph.

Alternatively, the time margin of each node may be set to be the same. The commitment time and the expected operation time of each node can be set differently.

The commitment time is the time that the user expects to complete the task of the node. The time margin is the maximum delay time for the node to receive data. Both values are preset values. The expected operation time length is also a preset value and is a preset operation time length of each node.

In an embodiment of the present application, the monitoring rule file includes one or more of an alarm name, an alarm type, a node ID, an alarm triggering method, an alarm message sending method, and alarm message recipient information.

Specifically, the ue configures all monitoring rules in advance. After the monitoring rule is configured, the client side packages the monitoring rule into a json character string, and the json character string is a monitoring rule file. The client transmits the json character strings to the server, and the server analyzes the json data transmitted by the client after receiving the request of monitoring the DAG graph. After passing some necessary checks, the operation record is persisted, and then the client side jumps to a node display page after receiving a response.

The json string may be in the form of:

{"name":"baseline","alarmBusinessType":0,"taskIds":[393],"myTriggers":[0],"senderTypes":["default_MAIL_2"],"receivers":"0","isTaskHolder":1}

as can be seen, the json string is made up of multiple key value pairs. The meaning of the next key value to the key is explained below. Name is the Name of the alarm. alarmBusinessType: the type of alarm. taskIds: task id (you can think of it is the node id of the DAG). myTriggers: and triggering modes, such as task execution failure, baseline breakage and other events. senderTypes: and a sending channel, such as a short message mail box. Receivers: a recipient.

In an embodiment of the present application, the S220 includes:

s221, all direct upstream nodes of the leaf nodes are obtained.

For example, taking node 7 as the leaf node, then all nodes immediately upstream of the leaf node are node 3 and node 5. Note that this step is said to be a "direct upstream node" rather than an "upstream node". The upstream nodes include direct upstream nodes and indirect upstream nodes.

The nodes immediately upstream of node 7 are node 3, node 5. The nodes indirectly upstream of node 7 are node 2, node 4 and node 1.

S222, acquiring the planned starting time of the leaf node.

Specifically, the scheduled start is a preset time value representing a scheduled start time of a node.

S223, calculating the predicted end time of each direct upstream node of the leaf nodes by adopting a stack method.

Specifically, the leaf node takes node 7, then the estimated end time of node 3 is calculated using the stack method, as well as the estimated end time of node 5.

S224, calculating the predicted start time of the leaf node according to the formula 1;

ti _ predict _ start = Max (Ti _ plan, Ti _ i1_ end, Ti _ i2_ end., Ti _ im _ end) formula 1;

wherein, Ti _ predict _ start is a predicted start time of the leaf node, Ti _ plan is a planned start time of the leaf node, Ti _ im _ end is a predicted end time of a node immediately upstream of the leaf node, i is a sequence number of the leaf node, and im is a sequence number of a node immediately upstream of the leaf node.

Specifically, the expected start time of the leaf node 7 = Max (planned start of node 7, expected end time of node 3, expected end time of node 5).

S225, calculating the predicted end time of the leaf node according to the formula 2;

ti _ predict _ end = Ti _ predict _ start + Ti _ time equation 2;

wherein, Ti _ predict _ end is the predicted end time of the leaf node, Ti _ predict _ start is the predicted start time of the leaf node, Ti _ time is the predicted operation duration of the leaf node, and i is the sequence number of the leaf node.

Specifically, the expected end time of the leaf node 7 = the expected start time of the leaf node 7 + the expected run time of the leaf node 7. The expected operation time is a preset time period, and how to set the expected operation time is explained in the foregoing.

In an embodiment of the present application, S223 includes:

s223a, traversing all upstream nodes of each of the direct upstream nodes of the leaf nodes in the DAG graph, obtaining the most upstream node of all upstream nodes of each of the direct upstream nodes of the leaf nodes, and calculating the expected start time of the direct upstream node by using a stack method in an order from top to bottom from the most upstream node.

Specifically, to calculate the expected start time of node 3, the expected end time of node 2 needs to be known because the expected start time of node 3 = Max (the planned start time of node 3, the expected end time of node 2). The expected end time of node 2 = the expected start time of node 2 + the expected operating time of node 2. Knowing the expected start time of node 2, it is then necessary to know the expected end time of node 1, since the expected start time of node 2 = Max (planned start time of node 2, expected end time of node 1), thus going back to the most upstream node, node 1. The expected end time of node 1 = the expected start time of node 1 + the expected operating time of node 1. And node 1 has no direct upstream node, so the expected start time of node 1 = the planned start time of node 1.

The calculation method of the estimated start time of the node 5 is the same as the calculation method of the estimated start time of the node 3, and the calculation method is finally traced back to the most upstream node, i.e., the node 1.

It will be appreciated that to know the expected start times of nodes 3 and 5, the expected start time of node 1 must first be known.

The whole calculation process is simplified to calculate the expected start time of each node from top to bottom.

The recursive method is too inefficient in computation and is only suitable for the case of small number of nodes. If the DAG graph has massive nodes, the nodes to be monitored are massive, and the leaf nodes are also in the positions of the downstream, the calculation by using a recursive method is difficult, and the cost is high and the efficiency is low. The present application therefore uses a stack approach for the calculations.

S223b, calculating the predicted end time of each direct upstream node of the leaf nodes according to equation 3;

tim _ predict _ end = Tim _ predict _ start + Tim _ time formula 3;

the predicted end time of a direct upstream node im of a leaf node i is Tim _ predicted _ start, the Tim _ predicted _ start is the predicted start time of the direct upstream node im of the leaf node i, the Tim _ time is the predicted running time of the direct upstream node im of the leaf node i, im is the serial number of the direct upstream node of the leaf node i, and i is the serial number of the leaf node.

Specifically, the principle of equation 3 is the same as that of equation 2.

In an embodiment of the present application, S220 further includes:

and S226, calculating the early warning ending time and the early warning starting time of the leaf node according to the formula 4.

Ti_warning_end=Ti_C+ Ti_allowance

Ti _ warning _ start = Ti _ warning _ end-Ti _ time equation 4;

the method comprises the steps of obtaining a leaf node, a warning start time, a warning end time, a predicted operation time and a committed time of the leaf node, wherein Ti _ warning _ start is the warning start time of the leaf node, Ti _ warning _ end is the warning end time of a monitoring node, Ti _ time is the predicted operation time of the leaf node, Ti _ C is the committed time of the leaf node, Ti _ allowance is the time allowance of the leaf node, and i is the sequence number of the leaf node.

S227, calculating the line breaking end time and the line breaking start time of the leaf node according to a formula 5;

Ti_broken_end= Ti_C

ti _ break _ start = Ti _ break _ end-Ti _ time equation 5;

the method comprises the steps of obtaining a leaf node, determining a predicted operation time of the leaf node, and determining a predicted operation time of the leaf node according to the predicted operation time of the leaf node.

And S228, traversing all upstream nodes which are associated with the leaf nodes in the DAG graph, and calculating early warning ending time, early warning starting time, line breaking ending time and line breaking starting time of each upstream node which is associated with the leaf nodes in the DAG graph by adopting a stacking method from the leaf nodes in the sequence from bottom to top.

Specifically, the early warning and the broken line are two states listed in the present application. In this embodiment, the start time of the wire breakage of each node is later than or equal to the early warning start time.

In this embodiment, formula 4 is used to calculate the early warning start time and the early warning end time of the leaf node, and formula 5 is used to calculate the line breaking start time and the line breaking end time of the leaf node.

In an embodiment of the present application, S223a includes:

s223a1, stack one and stack two are created.

S223a2, the leaf node is placed on stack one.

S223a3, extracting all the direct upstream nodes of the leaf node, and placing all the direct upstream nodes of the leaf node on the first stack.

S223a4, extracting each further upstream node, and placing each further upstream node on stack one.

S223a5, iteratively executing the S223a4 until the most upstream node is placed on stack one.

S223a6, calculating the expected start time of each node in the first stack according to the first-in-last-out and last-in-first-out principle, and after calculating the expected start time of a node, moving the node out of the first stack and placing the node in the second stack until the first stack is empty.

Specifically, as shown in fig. 5, fig. 5 also takes node 7 as an example of a leaf node, so that only node 1, node 2, node 3, node 4, node 5, and node 7 are included in the first stack.

The calculation principle of data calculation in the stack method is first in, second out and first in, so that the task of the node 1 is originally placed on the stack one at last and is calculated at first, that is, the expected start time of the node 1 is calculated at first.

Similarly, node 7 is first placed on stack one, is last computed, and the computer last computes the expected start time of node 7.

The stack is used for processing the node tasks, so that the efficiency can be improved, the overflow in the stack can be still kept when the number of the nodes is hundreds or thousands, and the efficiency is high.

In an embodiment of the present application, S228 includes:

and S228a, calculating the early warning end time, the early warning start time, the line breaking end time and the line breaking start time of each node in the second stack by adopting a formula 6 according to the principle of first-in, last-out and last-in, first-out, and clearing the node from the second stack after calculating the early warning end time, the early warning start time, the line breaking end time and the line breaking start time of the node.

Tk_warning_end=Min(Tk1_warning_start，Tk2_warning_start，...，Tkn_warning_start)

Tk_warning_start= Tk_warning_end -Tk_time

Tk_broken_end=Min(Tk1_ broken _start，Tk2_ broken _start，...，Tkn_ broken_start)

Tk _ break _ start = Tk _ break _ end-Tk _ time formula 6;

the method comprises the steps that k is a serial number of a node, kn is a serial number of a node downstream node directly, Tk _ warning _ end is an early warning ending time of the node k, Tk _ warning _ start is an early warning starting time of the node k, Tkn _ warning _ start is an early warning starting time of the node k downstream node kn directly, Tk _ braking _ end is a broken line ending time of the node k, Tk _ braking _ start is a broken line starting time of the node k, Tkn _ braking _ start is a broken line starting time of the node k downstream node kn directly, and Tk _ time is an estimated running time of the node k.

When the early warning end time, the early warning start time, the line breaking end time and the line breaking start time of the leaf node are calculated, the calculation results of the formula 4 and the formula 5 are directly used.

Specifically, similarly, fig. 6 also takes node 7 as a leaf node for example, so that stack two only includes node 1, node 2, node 3, node 4, node 5, and node 7. As shown in fig. 6, the calculation of the second stack also follows the principle of first-in-first-out and last-in-first-out, and as shown in fig. 6, when the second stack is transferred from the first stack to the second stack, the node 7 is originally placed on the second stack, but is calculated first, that is, the early warning end time, the early warning start time, the wire breakage end time and the wire breakage start time of the node 7 are calculated first.

Similarly, the node 1 is firstly placed on the stack two, is calculated at last, and calculates the early warning ending time, the early warning starting time, the wire breakage ending time and the wire breakage starting time of the node 1 at last.

In an embodiment of the present application, S310 includes:

s311, obtaining the current time, and obtaining the early warning start time and the broken line start time of the node to be monitored.

And S312, judging whether the current time is greater than or equal to the early warning starting time of the node to be monitored.

S313, if the current time is greater than or equal to the early warning start time of the node to be monitored, further determining whether the current time is greater than or equal to the break start time of the node to be monitored.

And S314, if the current time is greater than or equal to the break start time of the node to be monitored, marking the monitoring rule instance corresponding to the node to be monitored as a break state, and outputting a break message.

And S315, if the current time is less than the break start time of the node to be monitored, marking the monitoring rule instance corresponding to the node to be monitored as an early warning state, and outputting an early warning message.

Specifically, the monitoring logic of this embodiment is that if the current time is greater than or equal to the pre-warning start time of the node to be monitored but is less than the pre-broken line start time of the node to be monitored, a light warning is required at this time, and a warning is required but the line is not broken yet.

If the current time is greater than or equal to the early warning starting time of the node to be monitored and is greater than or equal to the line breakage starting time of the node to be monitored, a heavy warning is needed at the moment, and the line is broken as well as early warning.

A broken line is an alarm condition more serious than an early warning.

Optionally, after S312, the S310 further includes:

if the current time of the node to be monitored is less than the early warning starting time, the node state is considered to be safe, and the S312 is returned to continue monitoring.

The application also provides a monitoring system of the DAG graph.

As shown in fig. 3, in an embodiment of the present application, a monitoring system of a DAG graph includes at least one client 100 and a server 200.

A server 200 is communicatively connected to each client 100, and the server 300 is configured to perform the foregoing DAG graph monitoring method.

The technical features of the embodiments described above may be arbitrarily combined, the order of execution of the method steps is not limited, and for simplicity of description, all possible combinations of the technical features in the embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the combinations of the technical features should be considered as the scope of the present description.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims

1. A method for monitoring a DAG graph, the method comprising:

the server scans at least one monitoring rule example generated in the previous day in the database, monitors the running state of the DAG graph according to the at least one monitoring rule example, and gives an alarm in real time when the running state of the DAG graph is abnormal, and the method comprises the following steps:

2. The method of monitoring a DAG graph of claim 1, wherein the monitoring profile includes one or more of a commitment time for each node, an expected run length for each node, and a time margin for each node; the time margin is the maximum delay time of the node for receiving data.

3. The monitoring method for a DAG graph according to claim 1, wherein the monitoring rule file includes one or more of an alarm name, an alarm type, a node ID, an alarm triggering manner, an alarm message sending manner, and alarm message receiver information.

4. The DAG graph monitoring method according to claim 2, wherein the server calculates the monitoring metrics of the leaf nodes and the monitoring metrics of each upstream node associated with the leaf nodes in the DAG graph according to the monitoring configuration file, and the method comprises the following steps:

acquiring all direct upstream nodes of the leaf nodes;

acquiring the planned starting time of the leaf node;

calculating the predicted end time of each direct upstream node of the leaf nodes by adopting a stack method;

calculating the predicted start time of the leaf node according to formula 1;

the method comprises the following steps that Ti _ predict _ start is predicted starting time of a leaf node, Ti _ plan is planned starting time of the leaf node, Ti _ im _ end is predicted ending time of a direct upstream node of the leaf node, i is a serial number of the leaf node, and im is the serial number of the direct upstream node of the leaf node;

calculating the predicted end time of the leaf node according to a formula 2;

ti _ predict _ end = Ti _ predict _ start + Ti _ time equation 2;

5. The method of monitoring a DAG graph as recited in claim 4, wherein the computing the predicted end time of each immediately upstream node of the leaf nodes using a stack approach comprises:

traversing all upstream nodes of each direct upstream node of the leaf nodes in the DAG graph, acquiring the most upstream node of all upstream nodes of each direct upstream node of the leaf nodes, and calculating the predicted starting time of the direct upstream nodes by adopting a stacking method from the most upstream node according to the sequence from top to bottom;

calculating the predicted end time of each direct upstream node of the leaf nodes according to formula 3;

tim _ predict _ end = Tim _ predict _ start + Tim _ time formula 3;

6. The DAG graph monitoring method according to claim 5, wherein the server calculates the monitoring metrics of the leaf nodes and the monitoring metrics of each upstream node associated with the leaf nodes in the DAG graph according to the monitoring configuration file, further comprising:

calculating early warning ending time and early warning starting time of the leaf nodes according to a formula 4;

Ti_warning_end=Ti_C+ Ti_allowance

ti _ warning _ start = Ti _ warning _ end-Ti _ time equation 4;

the method comprises the following steps that Ti _ warning _ start is early warning starting time of a leaf node, Ti _ warning _ end is early warning ending time of a monitoring node, Ti _ time is predicted operation duration of the leaf node, Ti _ C is promised time of the leaf node, Ti _ allowance is time allowance of the leaf node, and i is a serial number of the leaf node;

calculating the line breaking end time and the line breaking start time of the leaf node according to a formula 5;

Ti_broken_end= Ti_C

ti _ break _ start = Ti _ break _ end-Ti _ time equation 5;

the method comprises the steps that Ti _ break _ start is the broken line starting time of a leaf node, Ti _ break _ end is the broken line ending time of the leaf node, Ti _ time is the predicted running time of the leaf node, Ti _ C is the commitment time of the leaf node, and i is the serial number of the leaf node;

traversing all upstream nodes which are associated with the leaf nodes in the DAG graph, and calculating early warning end time, early warning start time, line breaking end time and line breaking start time of each upstream node which is associated with the leaf nodes in the DAG graph by adopting a stacking method from the leaf nodes in the order from bottom to top.

7. The method of monitoring a DAG graph as claimed in claim 6, wherein traversing all upstream nodes of each of the immediately upstream nodes of the leaf nodes in the DAG graph, obtaining a most upstream node of all upstream nodes of each of the immediately upstream nodes of the leaf nodes, and calculating an expected start time of the immediately upstream node in a stack method in order from top to bottom from the most upstream node, comprises:

creating a first stack and a second stack;

placing leaf nodes on a first stack;

extracting all direct upstream nodes of the leaf nodes, and placing all the direct upstream nodes of the leaf nodes on a first stack;

extracting each further upstream node, and placing each further upstream node on a first stack;

repeatedly executing the extracting each further upstream node, placing each further upstream node on stack one until the most upstream node is placed on stack one;

and calculating the expected starting time of each node in the first stack according to the principle of first-in, last-out and last-in, first-out, and after calculating the expected starting time of one node, moving the node out of the first stack and placing the node into the second stack until the first stack is emptied.

8. The method for monitoring the DAG graph according to claim 7, wherein traversing all upstream nodes associated with leaf nodes in the DAG graph, and calculating the early warning end time, the early warning start time, the line breaking end time and the line breaking start time of each upstream node associated with leaf nodes in the DAG graph by using a stack method in the order from bottom to top from the leaf nodes comprises:

according to the principle of first-in, last-out and last-in, first-out, calculating the early warning end time, the early warning start time, the wire breakage end time and the wire breakage start time of each node in the second stack by adopting a formula 6, and removing the node from the second stack after calculating the early warning end time, the early warning start time, the wire breakage end time and the wire breakage start time of the node;

Tk_warning_start= Tk_warning_end -Tk_time

tk _ break _ start = Tk _ break _ end-Tk _ time equation 6;

the method comprises the steps that k is a serial number of a node, kn is a serial number of a node downstream node directly, Tk _ warning _ end is an early warning ending time of the node k, Tk _ warning _ start is an early warning starting time of the node k, Tkn _ warning _ start is an early warning starting time of the node k downstream node kn directly, Tk _ braking _ end is a broken line ending time of the node k, Tk _ braking _ start is a broken line starting time of the node k, Tkn _ braking _ start is a broken line starting time of the node k downstream node kn directly, and Tk _ time is an estimated running time of the node k;

9. The method as claimed in claim 8, wherein the server takes the node corresponding to each monitoring rule instance as a node to be monitored, and when the node to be monitored is a leaf node, monitors the leaf node according to the monitoring index of the leaf node, and when the node to be monitored is an upstream node associated with the leaf node in the DAG graph, the server monitors the upstream node associated with the leaf node in the DAG graph according to the monitoring index of the upstream node associated with the leaf node in the DAG graph, and when any one node to be monitored has an abnormal condition, alarms in real time, including:

acquiring current time, and acquiring early warning start time and line breaking start time of a node to be monitored;

judging whether the current time is greater than or equal to the early warning starting time of the node to be monitored;

if the current time is greater than or equal to the early warning starting time of the node to be monitored, further judging whether the current time is greater than or equal to the line breakage starting time of the node to be monitored;

if the current time is greater than or equal to the line breaking starting time of the node to be monitored, marking the monitoring rule instance corresponding to the node to be monitored as a line breaking state, and outputting a line breaking message;

and if the current time is less than the line breaking starting time of the node to be monitored, marking the monitoring rule instance corresponding to the node to be monitored as an early warning state, and outputting an early warning message.

10. A monitoring system for a DAG graph, comprising:

at least one client;

a server communicatively coupled to each client, the server configured to perform the method of monitoring a DAG graph as recited in any of claims 1-9.