CN115134224A - DAG graph monitoring method and system - Google Patents

DAG graph monitoring method and system Download PDF

Info

Publication number
CN115134224A
CN115134224A CN202211052740.0A CN202211052740A CN115134224A CN 115134224 A CN115134224 A CN 115134224A CN 202211052740 A CN202211052740 A CN 202211052740A CN 115134224 A CN115134224 A CN 115134224A
Authority
CN
China
Prior art keywords
node
time
leaf
monitoring
upstream
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211052740.0A
Other languages
Chinese (zh)
Inventor
赵振智
陈吉平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Daishu Technology Co ltd
Original Assignee
Hangzhou Daishu Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Daishu Technology Co ltd filed Critical Hangzhou Daishu Technology Co ltd
Priority to CN202211052740.0A priority Critical patent/CN115134224A/en
Publication of CN115134224A publication Critical patent/CN115134224A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0681Configuration of triggering conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/04Processing captured monitoring data, e.g. for logfile generation
    • H04L43/045Processing captured monitoring data, e.g. for logfile generation for graphical visualisation of monitoring data
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/06Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/2866Architectures; Arrangements
    • H04L67/30Profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Mining & Analysis (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The utility model relates to a monitoring method system of DAG picture, through defining the leaf node, not only monitor the leaf node, and to the leaf node all upstream nodes that have the relevance in the DAG picture carry out the synchronous monitoring, realize the holistic control to the DAG picture, rather than the control to single node, improved the accuracy of reporting an emergency and asking for help or increased vigilance when waiting to monitor the node abnormal conditions, consequently this application is to monitor each node of waiting to monitor of configuration from the holistic angle of DAG picture, solved the invalid warning under and dispose the problem that the warning work load is big.

Description

DAG graph monitoring method and system
Technical Field
The present application relates to the field of big data processing technologies, and in particular, to a method and a system for monitoring a DAG graph.
Background
The Chinese name of DAG (directed Acyclic graph) graph is called directed Acyclic graph. DAG graphs are a very important graph-theoretic data structure. If a directed graph cannot go from any vertex back to the point through several edges, the graph is called a directed acyclic graph.
In the technical data processing process of big data processing, DAG computation often refers to decomposing a computation task into several sub-tasks internally, and constructing the logical relationship or sequence between the sub-tasks into the structural relationship exhibited in a DAG graph.
DAG graphs are very common in distributed computing, and are often applied to various subdivision fields, such as Dryad, flumejva and Tez, which are typical for explicitly building a DAG computing model, and further, for example, a system like Storm of streaming computing or a machine learning framework Spark and the like, the computing task of the DAG graph mostly occurs in the form of the DAG graph.
Then, in the technical data processing process of big data processing, the monitoring work of the DAG graph becomes very important. The existing monitoring method of the DAG graph generally monitors each node in the DAG graph. This approach has two major drawbacks:
1) a lot of invalid alarms are generated, which not only increases the monitoring cost, but also reduces the accuracy of the monitoring result. This is because the runtime of each node in the DAG is uncertain, and the time requirements for data production are not the same for different nodes. For example, the dependency relationship among the three nodes a, b and c is a (13-runtime) -b (15-runtime) -c (17-runtime), if the data yield of a reaches 16, the node b is caused to run at 16, and at this time, if a task is monitored, an alarm is given, but if b can yield data at 17, then the DAG graph is actually delayed locally. But has no delay as a whole, so that manual intervention is not required, and the alarm belongs to an invalid alarm.
2) When the number of DAG graph nodes is large, the monitoring efficiency is low. If the number of the DAG graph nodes exceeds a certain number, related workers must specially configure an alarm program for each node, errors are prone to occur in the process of configuring the alarm program, and a large amount of time is wasted.
Disclosure of Invention
Therefore, it is necessary to provide a DAG graph monitoring method for solving the problems of high monitoring cost, low monitoring result accuracy and low monitoring efficiency of the conventional DAG graph monitoring method.
The application provides a DAG graph monitoring method, which comprises the following steps:
the server acquires a monitoring configuration file and a monitoring rule file sent by a client;
the server generates at least one monitoring rule instance according to the monitoring configuration file and the monitoring rule file and stores the monitoring rule instance in a database;
the server scans at least one monitoring rule example generated in the database in the previous day, monitors the running state of the DAG graph according to the at least one monitoring rule example, and alarms in real time when the running state of the DAG graph is abnormal;
the server generates at least one monitoring rule instance according to the monitoring rule file and stores the monitoring rule instance in a database, and the method comprises the following steps:
the server acquires the leaf nodes according to the monitoring configuration file and acquires all upstream nodes which are associated with the leaf nodes in the DAG graph;
the server calculates the monitoring indexes of the leaf nodes and the monitoring indexes of each upstream node which is associated with the leaf nodes in the DAG graph according to the monitoring configuration file;
the server scans at least one monitoring rule example generated in the database in the previous day, monitors the running state of the DAG graph according to the at least one monitoring rule example, and alarms in real time when the running state of the DAG graph is abnormal, wherein the method comprises the following steps:
the server takes the node corresponding to each monitoring rule instance as a node to be monitored, when the node to be monitored is a leaf node, the leaf node is monitored according to the monitoring index of the leaf node, when the node to be monitored is an upstream node which is associated with the leaf node in the DAG graph, the server monitors the upstream node which is associated with the leaf node in the DAG graph according to the monitoring index of the upstream node which is associated with the leaf node in the DAG graph, and when any one node to be monitored is abnormal, real-time alarm is given to the abnormal situation.
The present application further provides a monitoring system for a DAG graph, including:
at least one client;
and the server is in communication connection with each client and is used for executing the monitoring method of the DAG graph mentioned in the foregoing content.
The utility model relates to a monitoring method system of DAG picture, through defining the leaf node, not only monitor the leaf node, and to the leaf node all upstream nodes that have the relevance in the DAG picture carry out the synchronous monitoring, realize the holistic control to the DAG picture, rather than the control to single node, improved the accuracy of reporting an emergency and asking for help or increased vigilance when waiting to monitor the node abnormal conditions, consequently this application is to monitor each node of waiting to monitor of configuration from the holistic angle of DAG picture, solved the invalid warning under and dispose the problem that the warning work load is big.
Drawings
Fig. 1 is a schematic flow diagram of a monitoring method for a DAG graph according to an embodiment of the present disclosure.
Fig. 2 is a schematic flow diagram of a monitoring method for a DAG graph according to an embodiment of the present disclosure.
Fig. 3 is a schematic structural diagram of a monitoring system of a DAG graph according to an embodiment of the present application.
Fig. 4 is a DAG graph in an embodiment of a DAG graph monitoring method provided by the present application.
Fig. 5 is a schematic diagram of an internal display situation of a first stack in the DAG graph monitoring method provided in this application.
Fig. 6 is a schematic diagram of an internal display situation of a second stack in the DAG graph monitoring method provided by the present application.
Detailed Description
For the purpose of making the present application more apparent, technical solutions and advantages thereof are described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The application provides a DAG graph monitoring method. It should be noted that the monitoring method for the DAG graph provided by the present application is applicable to any type of DAG graph.
In addition, the monitoring method of the DAG graph provided by the application is not limited to the execution subject. Optionally, an execution subject of the DAG graph monitoring method provided by the present application may be a DAG graph monitoring system. Specifically, the execution subject of the monitoring method for the DAG graph provided by the present application may be a server in the monitoring system for the DAG graph.
As shown in fig. 1 and fig. 2, in an embodiment of the present application, a monitoring method of a DAG graph includes:
s100, the server acquires a monitoring configuration file and a monitoring rule file sent by the client.
Specifically, the monitoring configuration file includes monitoring indexes of each node to be monitored. The monitoring rule file comprises a specific mode of alarming when the node is abnormal.
S200, the server generates at least one monitoring rule instance according to the monitoring configuration file and the monitoring rule file and stores the at least one monitoring rule instance in a database.
Specifically, each monitoring rule instance corresponds to a node in the unique DAG graph.
S300, the server scans at least one monitoring rule example generated in the database in the previous day, monitors the running state of the DAG graph according to the at least one monitoring rule example, and gives an alarm in real time when the running state of the DAG graph is abnormal.
Specifically, the at least one monitoring rule instance generated by executing S100 to S200 is used for the monitoring work of the following day, so that the monitoring work of the current day applies the at least one monitoring rule instance generated by the previous day.
For example, all monitoring rule instances generated at 24/8/2022, applied at 25/8/2022,
optionally, S100 to S200 and S300 are asynchronously operated, i.e. may be operated simultaneously without mutual interference.
For example, the server executes S300 to monitor the running state of the DAG graph all day from 8/24/2022, and executes S100 to S200 to generate at least one monitoring rule instance for monitoring work at 8/25/2022, from 11/24/2022.
The S200 includes:
s210, the server acquires the leaf nodes according to the monitoring configuration file and acquires all upstream nodes which are associated with the leaf nodes in the DAG graph.
In particular, the leaf nodes are pre-set, which may be any node of the DAG graph. The monitoring profile includes leaf nodes that have been set. Of course, the leaf nodes may be one or more. As shown in fig. 4, if node 7 is used as the only leaf node to be monitored, the object to be monitored is node 7 and all the nodes upstream of node 7, including node 3, node 5, node 2, node 4 and node 1.
If the leaf node selected by the user is the node 9, as shown in fig. 4, all the nodes that need to be monitored similarly include: node 9, node 7, node 3, node 5, node 2, node 4, node 1.
The leaf nodes thus determine the region in which the entire DAG graph is monitored, the leaf nodes being the most downstream nodes in the region in which the entire DAG graph is monitored.
S220, the server calculates the monitoring indexes of the leaf nodes and the monitoring indexes of each upstream node which is associated with the leaf nodes in the DAG graph according to the monitoring configuration file.
Specifically, the node 7 is a leaf node, and then the monitoring index of each of the nodes 7, 3, 5, 2, 4 and 1 needs to be calculated.
The S300 includes:
and S310, the server takes the node corresponding to each monitoring rule instance as a node to be monitored, when the node to be monitored is a leaf node, the leaf node is monitored according to the monitoring index of the leaf node, when the node to be monitored is an upstream node associated with the leaf node in the DAG graph, the server monitors the upstream node associated with the leaf node in the DAG graph according to the monitoring index of the upstream node associated with the leaf node in the DAG graph, and when any one node to be monitored is abnormal, the server gives an alarm in real time.
The utility model relates to a monitoring method system of DAG picture, through defining the leaf node, not only monitor the leaf node, but also monitor simultaneously all upstream nodes that have the relevance in the DAG picture with the leaf node, realize the holistic control of DAG picture, rather than the control to single node, improved the accuracy of reporting an emergency and asking for help or increased vigilance when waiting to monitor the node abnormal conditions, consequently this application is to treat monitoring of monitoring the node of waiting to monitor of configuration from DAG picture holistic angle, solved under the present invalid report an emergency and ask for help or increased vigilance the problem that work load is big with the configuration.
In an embodiment of the present application, the monitoring profile includes one or more of a commitment time of each node, an expected operation duration of each node, and a time margin of each node; the time margin is the maximum delay time of the node for receiving data.
Specifically, the client may pre-configure the leaf nodes that need to be monitored. The user selects the leaf nodes needing to be monitored at the client, sets the commitment time, the expected operation time length and the time margin of the leaf nodes, and sets the commitment time, the expected operation time length and the time margin of each upstream node which is associated with the leaf nodes in the DAG graph.
Of course, before configuring the leaf nodes to be monitored, the commitment time of each node, the expected operation duration of each node, and the time margin may be preconfigured for all nodes in the DAG graph.
Alternatively, the time margin of each node may be set to be the same. The commitment time and the expected operation time of each node can be set differently.
The commitment time is the time that the user expects to complete the task of the node. The time margin is the maximum delay time for the node to receive data. Both values are preset values. The expected operation time length is also a preset value and is a preset operation time length of each node.
In an embodiment of the present application, the monitoring rule file includes one or more of an alarm name, an alarm type, a node ID, an alarm triggering method, an alarm message sending method, and alarm message recipient information.
Specifically, the ue configures all monitoring rules in advance. After the monitoring rule is configured, the client side packages the monitoring rule into a json character string, and the json character string is a monitoring rule file. The client transmits the json character strings to the server, and the server analyzes the json data transmitted by the client after receiving the request of monitoring the DAG graph. After passing some necessary checks, the operation record is persisted, and then the client side jumps to a node display page after receiving a response.
The json string may be in the form of:
{"name":"baseline","alarmBusinessType":0,"taskIds":[393],"myTriggers":[0],"senderTypes":["default_MAIL_2"],"receivers":"0","isTaskHolder":1}
as can be seen, the json string is made up of multiple key value pairs. The meaning of the next key value to the key is explained below. Name is the Name of the alarm. alarmBusinessType: the type of alarm. taskIds: task id (you can think of it is the node id of the DAG). myTriggers: and triggering modes, such as task execution failure, baseline breakage and other events. senderTypes: and a sending channel, such as a short message mail box. Receivers: a recipient.
In an embodiment of the present application, the S220 includes:
s221, all direct upstream nodes of the leaf nodes are obtained.
For example, taking node 7 as the leaf node, then all nodes immediately upstream of the leaf node are node 3 and node 5. Note that this step is said to be a "direct upstream node" rather than an "upstream node". The upstream nodes include direct upstream nodes and indirect upstream nodes.
The nodes immediately upstream of node 7 are node 3, node 5. The nodes indirectly upstream of node 7 are node 2, node 4 and node 1.
S222, acquiring the planned starting time of the leaf node.
Specifically, the scheduled start is a preset time value representing a scheduled start time of a node.
S223, calculating the predicted end time of each direct upstream node of the leaf nodes by adopting a stack method.
Specifically, the leaf node takes node 7, then the estimated end time of node 3 is calculated using the stack method, as well as the estimated end time of node 5.
S224, calculating the predicted start time of the leaf node according to the formula 1;
ti _ predict _ start = Max (Ti _ plan, Ti _ i1_ end, Ti _ i2_ end., Ti _ im _ end) formula 1;
wherein, Ti _ predict _ start is a predicted start time of the leaf node, Ti _ plan is a planned start time of the leaf node, Ti _ im _ end is a predicted end time of a node immediately upstream of the leaf node, i is a sequence number of the leaf node, and im is a sequence number of a node immediately upstream of the leaf node.
Specifically, the expected start time of the leaf node 7 = Max (planned start of node 7, expected end time of node 3, expected end time of node 5).
S225, calculating the predicted end time of the leaf node according to the formula 2;
ti _ predict _ end = Ti _ predict _ start + Ti _ time equation 2;
wherein, Ti _ predict _ end is the predicted end time of the leaf node, Ti _ predict _ start is the predicted start time of the leaf node, Ti _ time is the predicted operation duration of the leaf node, and i is the sequence number of the leaf node.
Specifically, the expected end time of the leaf node 7 = the expected start time of the leaf node 7 + the expected run time of the leaf node 7. The expected operation time is a preset time period, and how to set the expected operation time is explained in the foregoing.
In an embodiment of the present application, S223 includes:
s223a, traversing all upstream nodes of each of the direct upstream nodes of the leaf nodes in the DAG graph, obtaining the most upstream node of all upstream nodes of each of the direct upstream nodes of the leaf nodes, and calculating the expected start time of the direct upstream node by using a stack method in an order from top to bottom from the most upstream node.
Specifically, to calculate the expected start time of node 3, the expected end time of node 2 needs to be known because the expected start time of node 3 = Max (the planned start time of node 3, the expected end time of node 2). The expected end time of node 2 = the expected start time of node 2 + the expected operating time of node 2. Knowing the expected start time of node 2, it is then necessary to know the expected end time of node 1, since the expected start time of node 2 = Max (planned start time of node 2, expected end time of node 1), thus going back to the most upstream node, node 1. The expected end time of node 1 = the expected start time of node 1 + the expected operating time of node 1. And node 1 has no direct upstream node, so the expected start time of node 1 = the planned start time of node 1.
The calculation method of the estimated start time of the node 5 is the same as the calculation method of the estimated start time of the node 3, and the calculation method is finally traced back to the most upstream node, i.e., the node 1.
It will be appreciated that to know the expected start times of nodes 3 and 5, the expected start time of node 1 must first be known.
The whole calculation process is simplified to calculate the expected start time of each node from top to bottom.
The recursive method is too inefficient in computation and is only suitable for the case of small number of nodes. If the DAG graph has massive nodes, the nodes to be monitored are massive, and the leaf nodes are also in the positions of the downstream, the calculation by using a recursive method is difficult, and the cost is high and the efficiency is low. The present application therefore uses a stack approach for the calculations.
S223b, calculating the predicted end time of each direct upstream node of the leaf nodes according to equation 3;
tim _ predict _ end = Tim _ predict _ start + Tim _ time formula 3;
the predicted end time of a direct upstream node im of a leaf node i is Tim _ predicted _ start, the Tim _ predicted _ start is the predicted start time of the direct upstream node im of the leaf node i, the Tim _ time is the predicted running time of the direct upstream node im of the leaf node i, im is the serial number of the direct upstream node of the leaf node i, and i is the serial number of the leaf node.
Specifically, the principle of equation 3 is the same as that of equation 2.
In an embodiment of the present application, S220 further includes:
and S226, calculating the early warning ending time and the early warning starting time of the leaf node according to the formula 4.
Ti_warning_end=Ti_C+ Ti_allowance
Ti _ warning _ start = Ti _ warning _ end-Ti _ time equation 4;
the method comprises the steps of obtaining a leaf node, a warning start time, a warning end time, a predicted operation time and a committed time of the leaf node, wherein Ti _ warning _ start is the warning start time of the leaf node, Ti _ warning _ end is the warning end time of a monitoring node, Ti _ time is the predicted operation time of the leaf node, Ti _ C is the committed time of the leaf node, Ti _ allowance is the time allowance of the leaf node, and i is the sequence number of the leaf node.
S227, calculating the line breaking end time and the line breaking start time of the leaf node according to a formula 5;
Ti_broken_end= Ti_C
ti _ break _ start = Ti _ break _ end-Ti _ time equation 5;
the method comprises the steps of obtaining a leaf node, determining a predicted operation time of the leaf node, and determining a predicted operation time of the leaf node according to the predicted operation time of the leaf node.
And S228, traversing all upstream nodes which are associated with the leaf nodes in the DAG graph, and calculating early warning ending time, early warning starting time, line breaking ending time and line breaking starting time of each upstream node which is associated with the leaf nodes in the DAG graph by adopting a stacking method from the leaf nodes in the sequence from bottom to top.
Specifically, the early warning and the broken line are two states listed in the present application. In this embodiment, the start time of the wire breakage of each node is later than or equal to the early warning start time.
In this embodiment, formula 4 is used to calculate the early warning start time and the early warning end time of the leaf node, and formula 5 is used to calculate the line breaking start time and the line breaking end time of the leaf node.
In an embodiment of the present application, S223a includes:
s223a1, stack one and stack two are created.
S223a2, the leaf node is placed on stack one.
S223a3, extracting all the direct upstream nodes of the leaf node, and placing all the direct upstream nodes of the leaf node on the first stack.
S223a4, extracting each further upstream node, and placing each further upstream node on stack one.
S223a5, iteratively executing the S223a4 until the most upstream node is placed on stack one.
S223a6, calculating the expected start time of each node in the first stack according to the first-in-last-out and last-in-first-out principle, and after calculating the expected start time of a node, moving the node out of the first stack and placing the node in the second stack until the first stack is empty.
Specifically, as shown in fig. 5, fig. 5 also takes node 7 as an example of a leaf node, so that only node 1, node 2, node 3, node 4, node 5, and node 7 are included in the first stack.
The calculation principle of data calculation in the stack method is first in, second out and first in, so that the task of the node 1 is originally placed on the stack one at last and is calculated at first, that is, the expected start time of the node 1 is calculated at first.
Similarly, node 7 is first placed on stack one, is last computed, and the computer last computes the expected start time of node 7.
The stack is used for processing the node tasks, so that the efficiency can be improved, the overflow in the stack can be still kept when the number of the nodes is hundreds or thousands, and the efficiency is high.
In an embodiment of the present application, S228 includes:
and S228a, calculating the early warning end time, the early warning start time, the line breaking end time and the line breaking start time of each node in the second stack by adopting a formula 6 according to the principle of first-in, last-out and last-in, first-out, and clearing the node from the second stack after calculating the early warning end time, the early warning start time, the line breaking end time and the line breaking start time of the node.
Tk_warning_end=Min(Tk1_warning_start,Tk2_warning_start,...,Tkn_warning_start)
Tk_warning_start= Tk_warning_end -Tk_time
Tk_broken_end=Min(Tk1_ broken _start,Tk2_ broken _start,...,Tkn_ broken_start)
Tk _ break _ start = Tk _ break _ end-Tk _ time formula 6;
the method comprises the steps that k is a serial number of a node, kn is a serial number of a node downstream node directly, Tk _ warning _ end is an early warning ending time of the node k, Tk _ warning _ start is an early warning starting time of the node k, Tkn _ warning _ start is an early warning starting time of the node k downstream node kn directly, Tk _ braking _ end is a broken line ending time of the node k, Tk _ braking _ start is a broken line starting time of the node k, Tkn _ braking _ start is a broken line starting time of the node k downstream node kn directly, and Tk _ time is an estimated running time of the node k.
When the early warning end time, the early warning start time, the line breaking end time and the line breaking start time of the leaf node are calculated, the calculation results of the formula 4 and the formula 5 are directly used.
Specifically, similarly, fig. 6 also takes node 7 as a leaf node for example, so that stack two only includes node 1, node 2, node 3, node 4, node 5, and node 7. As shown in fig. 6, the calculation of the second stack also follows the principle of first-in-first-out and last-in-first-out, and as shown in fig. 6, when the second stack is transferred from the first stack to the second stack, the node 7 is originally placed on the second stack, but is calculated first, that is, the early warning end time, the early warning start time, the wire breakage end time and the wire breakage start time of the node 7 are calculated first.
Similarly, the node 1 is firstly placed on the stack two, is calculated at last, and calculates the early warning ending time, the early warning starting time, the wire breakage ending time and the wire breakage starting time of the node 1 at last.
In an embodiment of the present application, S310 includes:
s311, obtaining the current time, and obtaining the early warning start time and the broken line start time of the node to be monitored.
And S312, judging whether the current time is greater than or equal to the early warning starting time of the node to be monitored.
S313, if the current time is greater than or equal to the early warning start time of the node to be monitored, further determining whether the current time is greater than or equal to the break start time of the node to be monitored.
And S314, if the current time is greater than or equal to the break start time of the node to be monitored, marking the monitoring rule instance corresponding to the node to be monitored as a break state, and outputting a break message.
And S315, if the current time is less than the break start time of the node to be monitored, marking the monitoring rule instance corresponding to the node to be monitored as an early warning state, and outputting an early warning message.
Specifically, the monitoring logic of this embodiment is that if the current time is greater than or equal to the pre-warning start time of the node to be monitored but is less than the pre-broken line start time of the node to be monitored, a light warning is required at this time, and a warning is required but the line is not broken yet.
If the current time is greater than or equal to the early warning starting time of the node to be monitored and is greater than or equal to the line breakage starting time of the node to be monitored, a heavy warning is needed at the moment, and the line is broken as well as early warning.
A broken line is an alarm condition more serious than an early warning.
Optionally, after S312, the S310 further includes:
if the current time of the node to be monitored is less than the early warning starting time, the node state is considered to be safe, and the S312 is returned to continue monitoring.
The application also provides a monitoring system of the DAG graph.
As shown in fig. 3, in an embodiment of the present application, a monitoring system of a DAG graph includes at least one client 100 and a server 200.
A server 200 is communicatively connected to each client 100, and the server 300 is configured to perform the foregoing DAG graph monitoring method.
The technical features of the embodiments described above may be arbitrarily combined, the order of execution of the method steps is not limited, and for simplicity of description, all possible combinations of the technical features in the embodiments are not described, however, as long as there is no contradiction between the combinations of the technical features, the combinations of the technical features should be considered as the scope of the present description.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (10)

1. A method for monitoring a DAG graph, the method comprising:
the server acquires a monitoring configuration file and a monitoring rule file sent by a client;
the server generates at least one monitoring rule instance according to the monitoring configuration file and the monitoring rule file and stores the monitoring rule instance in a database;
the server scans at least one monitoring rule example generated in the database in the previous day, monitors the running state of the DAG graph according to the at least one monitoring rule example, and alarms in real time when the running state of the DAG graph is abnormal;
the server generates at least one monitoring rule instance according to the monitoring rule file and stores the monitoring rule instance in a database, and the method comprises the following steps:
the server acquires the leaf nodes according to the monitoring configuration file and acquires all upstream nodes which are associated with the leaf nodes in the DAG graph;
the server calculates the monitoring indexes of the leaf nodes and the monitoring indexes of each upstream node which is associated with the leaf nodes in the DAG graph according to the monitoring configuration file;
the server scans at least one monitoring rule example generated in the previous day in the database, monitors the running state of the DAG graph according to the at least one monitoring rule example, and gives an alarm in real time when the running state of the DAG graph is abnormal, and the method comprises the following steps:
the server takes the node corresponding to each monitoring rule instance as a node to be monitored, when the node to be monitored is a leaf node, the leaf node is monitored according to the monitoring index of the leaf node, when the node to be monitored is an upstream node which is associated with the leaf node in the DAG graph, the server monitors the upstream node which is associated with the leaf node in the DAG graph according to the monitoring index of the upstream node which is associated with the leaf node in the DAG graph, and when any one node to be monitored is abnormal, real-time alarm is given to the abnormal situation.
2. The method of monitoring a DAG graph of claim 1, wherein the monitoring profile includes one or more of a commitment time for each node, an expected run length for each node, and a time margin for each node; the time margin is the maximum delay time of the node for receiving data.
3. The monitoring method for a DAG graph according to claim 1, wherein the monitoring rule file includes one or more of an alarm name, an alarm type, a node ID, an alarm triggering manner, an alarm message sending manner, and alarm message receiver information.
4. The DAG graph monitoring method according to claim 2, wherein the server calculates the monitoring metrics of the leaf nodes and the monitoring metrics of each upstream node associated with the leaf nodes in the DAG graph according to the monitoring configuration file, and the method comprises the following steps:
acquiring all direct upstream nodes of the leaf nodes;
acquiring the planned starting time of the leaf node;
calculating the predicted end time of each direct upstream node of the leaf nodes by adopting a stack method;
calculating the predicted start time of the leaf node according to formula 1;
ti _ predict _ start = Max (Ti _ plan, Ti _ i1_ end, Ti _ i2_ end., Ti _ im _ end) formula 1;
the method comprises the following steps that Ti _ predict _ start is predicted starting time of a leaf node, Ti _ plan is planned starting time of the leaf node, Ti _ im _ end is predicted ending time of a direct upstream node of the leaf node, i is a serial number of the leaf node, and im is the serial number of the direct upstream node of the leaf node;
calculating the predicted end time of the leaf node according to a formula 2;
ti _ predict _ end = Ti _ predict _ start + Ti _ time equation 2;
wherein, Ti _ predict _ end is the predicted end time of the leaf node, Ti _ predict _ start is the predicted start time of the leaf node, Ti _ time is the predicted operation duration of the leaf node, and i is the sequence number of the leaf node.
5. The method of monitoring a DAG graph as recited in claim 4, wherein the computing the predicted end time of each immediately upstream node of the leaf nodes using a stack approach comprises:
traversing all upstream nodes of each direct upstream node of the leaf nodes in the DAG graph, acquiring the most upstream node of all upstream nodes of each direct upstream node of the leaf nodes, and calculating the predicted starting time of the direct upstream nodes by adopting a stacking method from the most upstream node according to the sequence from top to bottom;
calculating the predicted end time of each direct upstream node of the leaf nodes according to formula 3;
tim _ predict _ end = Tim _ predict _ start + Tim _ time formula 3;
the predicted end time of a direct upstream node im of a leaf node i is Tim _ predicted _ start, the Tim _ predicted _ start is the predicted start time of the direct upstream node im of the leaf node i, the Tim _ time is the predicted running time of the direct upstream node im of the leaf node i, im is the serial number of the direct upstream node of the leaf node i, and i is the serial number of the leaf node.
6. The DAG graph monitoring method according to claim 5, wherein the server calculates the monitoring metrics of the leaf nodes and the monitoring metrics of each upstream node associated with the leaf nodes in the DAG graph according to the monitoring configuration file, further comprising:
calculating early warning ending time and early warning starting time of the leaf nodes according to a formula 4;
Ti_warning_end=Ti_C+ Ti_allowance
ti _ warning _ start = Ti _ warning _ end-Ti _ time equation 4;
the method comprises the following steps that Ti _ warning _ start is early warning starting time of a leaf node, Ti _ warning _ end is early warning ending time of a monitoring node, Ti _ time is predicted operation duration of the leaf node, Ti _ C is promised time of the leaf node, Ti _ allowance is time allowance of the leaf node, and i is a serial number of the leaf node;
calculating the line breaking end time and the line breaking start time of the leaf node according to a formula 5;
Ti_broken_end= Ti_C
ti _ break _ start = Ti _ break _ end-Ti _ time equation 5;
the method comprises the steps that Ti _ break _ start is the broken line starting time of a leaf node, Ti _ break _ end is the broken line ending time of the leaf node, Ti _ time is the predicted running time of the leaf node, Ti _ C is the commitment time of the leaf node, and i is the serial number of the leaf node;
traversing all upstream nodes which are associated with the leaf nodes in the DAG graph, and calculating early warning end time, early warning start time, line breaking end time and line breaking start time of each upstream node which is associated with the leaf nodes in the DAG graph by adopting a stacking method from the leaf nodes in the order from bottom to top.
7. The method of monitoring a DAG graph as claimed in claim 6, wherein traversing all upstream nodes of each of the immediately upstream nodes of the leaf nodes in the DAG graph, obtaining a most upstream node of all upstream nodes of each of the immediately upstream nodes of the leaf nodes, and calculating an expected start time of the immediately upstream node in a stack method in order from top to bottom from the most upstream node, comprises:
creating a first stack and a second stack;
placing leaf nodes on a first stack;
extracting all direct upstream nodes of the leaf nodes, and placing all the direct upstream nodes of the leaf nodes on a first stack;
extracting each further upstream node, and placing each further upstream node on a first stack;
repeatedly executing the extracting each further upstream node, placing each further upstream node on stack one until the most upstream node is placed on stack one;
and calculating the expected starting time of each node in the first stack according to the principle of first-in, last-out and last-in, first-out, and after calculating the expected starting time of one node, moving the node out of the first stack and placing the node into the second stack until the first stack is emptied.
8. The method for monitoring the DAG graph according to claim 7, wherein traversing all upstream nodes associated with leaf nodes in the DAG graph, and calculating the early warning end time, the early warning start time, the line breaking end time and the line breaking start time of each upstream node associated with leaf nodes in the DAG graph by using a stack method in the order from bottom to top from the leaf nodes comprises:
according to the principle of first-in, last-out and last-in, first-out, calculating the early warning end time, the early warning start time, the wire breakage end time and the wire breakage start time of each node in the second stack by adopting a formula 6, and removing the node from the second stack after calculating the early warning end time, the early warning start time, the wire breakage end time and the wire breakage start time of the node;
Tk_warning_end=Min(Tk1_warning_start,Tk2_warning_start,...,Tkn_warning_start)
Tk_warning_start= Tk_warning_end -Tk_time
Tk_broken_end=Min(Tk1_ broken _start,Tk2_ broken _start,...,Tkn_ broken_start)
tk _ break _ start = Tk _ break _ end-Tk _ time equation 6;
the method comprises the steps that k is a serial number of a node, kn is a serial number of a node downstream node directly, Tk _ warning _ end is an early warning ending time of the node k, Tk _ warning _ start is an early warning starting time of the node k, Tkn _ warning _ start is an early warning starting time of the node k downstream node kn directly, Tk _ braking _ end is a broken line ending time of the node k, Tk _ braking _ start is a broken line starting time of the node k, Tkn _ braking _ start is a broken line starting time of the node k downstream node kn directly, and Tk _ time is an estimated running time of the node k;
when the early warning end time, the early warning start time, the line breaking end time and the line breaking start time of the leaf node are calculated, the calculation results of the formula 4 and the formula 5 are directly used.
9. The method as claimed in claim 8, wherein the server takes the node corresponding to each monitoring rule instance as a node to be monitored, and when the node to be monitored is a leaf node, monitors the leaf node according to the monitoring index of the leaf node, and when the node to be monitored is an upstream node associated with the leaf node in the DAG graph, the server monitors the upstream node associated with the leaf node in the DAG graph according to the monitoring index of the upstream node associated with the leaf node in the DAG graph, and when any one node to be monitored has an abnormal condition, alarms in real time, including:
acquiring current time, and acquiring early warning start time and line breaking start time of a node to be monitored;
judging whether the current time is greater than or equal to the early warning starting time of the node to be monitored;
if the current time is greater than or equal to the early warning starting time of the node to be monitored, further judging whether the current time is greater than or equal to the line breakage starting time of the node to be monitored;
if the current time is greater than or equal to the line breaking starting time of the node to be monitored, marking the monitoring rule instance corresponding to the node to be monitored as a line breaking state, and outputting a line breaking message;
and if the current time is less than the line breaking starting time of the node to be monitored, marking the monitoring rule instance corresponding to the node to be monitored as an early warning state, and outputting an early warning message.
10. A monitoring system for a DAG graph, comprising:
at least one client;
a server communicatively coupled to each client, the server configured to perform the method of monitoring a DAG graph as recited in any of claims 1-9.
CN202211052740.0A 2022-08-31 2022-08-31 DAG graph monitoring method and system Pending CN115134224A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211052740.0A CN115134224A (en) 2022-08-31 2022-08-31 DAG graph monitoring method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211052740.0A CN115134224A (en) 2022-08-31 2022-08-31 DAG graph monitoring method and system

Publications (1)

Publication Number Publication Date
CN115134224A true CN115134224A (en) 2022-09-30

Family

ID=83387150

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211052740.0A Pending CN115134224A (en) 2022-08-31 2022-08-31 DAG graph monitoring method and system

Country Status (1)

Country Link
CN (1) CN115134224A (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9674249B1 (en) * 2013-03-11 2017-06-06 DataTorrent, Inc. Distributed streaming platform for real-time applications
CN107291533A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 Determine method, the device of upstream node bottleneck degree and system bottleneck degree
CN108270618A (en) * 2017-12-30 2018-07-10 杭州华为数字技术有限公司 Alert the method, apparatus and warning system of judgement
CN108334563A (en) * 2018-01-09 2018-07-27 北京明略软件系统有限公司 A kind of method and device of data query
CN108388495A (en) * 2018-01-10 2018-08-10 链家网(北京)科技有限公司 A kind of data monitoring method and system
CN109558292A (en) * 2017-09-26 2019-04-02 阿里巴巴集团控股有限公司 A kind of monitoring method and device
CN111737095A (en) * 2020-08-05 2020-10-02 北京必示科技有限公司 Batch processing task time monitoring method and device, electronic equipment and storage medium
CN112328377A (en) * 2020-11-04 2021-02-05 北京字节跳动网络技术有限公司 Baseline monitoring method and device, readable medium and electronic equipment
US20210089356A1 (en) * 2018-03-26 2021-03-25 Uvue Ltd Data Processing System using Directed Acyclic Graph and Method of use thereof
CN113220542A (en) * 2021-04-01 2021-08-06 深圳市云网万店科技有限公司 Early warning method and device for computing task, computer equipment and storage medium
CN114679378A (en) * 2022-04-21 2022-06-28 青岛海尔科技有限公司 Log monitoring and analyzing method and system, storage medium and electronic device

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9674249B1 (en) * 2013-03-11 2017-06-06 DataTorrent, Inc. Distributed streaming platform for real-time applications
CN107291533A (en) * 2016-03-31 2017-10-24 阿里巴巴集团控股有限公司 Determine method, the device of upstream node bottleneck degree and system bottleneck degree
CN109558292A (en) * 2017-09-26 2019-04-02 阿里巴巴集团控股有限公司 A kind of monitoring method and device
CN108270618A (en) * 2017-12-30 2018-07-10 杭州华为数字技术有限公司 Alert the method, apparatus and warning system of judgement
CN108334563A (en) * 2018-01-09 2018-07-27 北京明略软件系统有限公司 A kind of method and device of data query
CN108388495A (en) * 2018-01-10 2018-08-10 链家网(北京)科技有限公司 A kind of data monitoring method and system
US20210089356A1 (en) * 2018-03-26 2021-03-25 Uvue Ltd Data Processing System using Directed Acyclic Graph and Method of use thereof
CN111737095A (en) * 2020-08-05 2020-10-02 北京必示科技有限公司 Batch processing task time monitoring method and device, electronic equipment and storage medium
CN112328377A (en) * 2020-11-04 2021-02-05 北京字节跳动网络技术有限公司 Baseline monitoring method and device, readable medium and electronic equipment
WO2022095848A1 (en) * 2020-11-04 2022-05-12 北京字节跳动网络技术有限公司 Baseline monitoring method and apparatus, readable medium, and electronic device
CN113220542A (en) * 2021-04-01 2021-08-06 深圳市云网万店科技有限公司 Early warning method and device for computing task, computer equipment and storage medium
CN114679378A (en) * 2022-04-21 2022-06-28 青岛海尔科技有限公司 Log monitoring and analyzing method and system, storage medium and electronic device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
刘大有: "《数据结构》", 30 July 2001 *
明仲: "利用双堆栈降低时间复杂度的新算法", 《深圳大学学报(理工版)》 *
阳俊将等: "关于PLC梯形图到指令表转换算法的研究", 《信息技术》 *
陈冲等: "基于双堆栈的区域标记算法", 《武汉轻工大学学报》 *

Similar Documents

Publication Publication Date Title
US12013680B2 (en) Adaptive distributed analytics system
US10452467B2 (en) Automatic model-based computing environment performance monitoring
US10558544B2 (en) Multiple modeling paradigm for predictive analytics
JP6577455B2 (en) Predictive diagnosis of SLA violations in cloud services by grasping and predicting seasonal trends using thread strength analysis
CN105989155B (en) Identify the method and device of risk behavior
CN112152830A (en) Intelligent fault root cause analysis method and system
CN107465575A (en) The monitoring method and system of a kind of cluster
CN109255523A (en) Analysis indexes computing platform based on KKS coding rule and big data framework
US11210149B2 (en) Prioritization of data collection and analysis for incident detection
CN110633189A (en) Intelligent operation and maintenance monitoring method and intelligent operation and maintenance monitoring system of IT system
CN114567538A (en) Alarm information processing method and device
CN113377559A (en) Big data based exception handling method, device, equipment and storage medium
Zhong et al. Study on network failure prediction based on alarm logs
CN112445583A (en) Task management method, task management system, electronic device, and storage medium
CN116028315A (en) Operation early warning method, device, medium and electronic equipment
CN115134224A (en) DAG graph monitoring method and system
CN104346246B (en) Failure prediction method and device
CN115766768A (en) Method and device for designing sensing center in computational power network operating system
CN115378794A (en) Gateway fault detection method and device based on snapshot mode
CN114168371A (en) Intelligent automatic fault alarm system
CN113887799A (en) Artificial intelligent alarm method for safety production of hydraulic power plant
CN116185787B (en) Self-learning type monitoring alarm method, device, equipment and storage medium
CN114862121B (en) Associated infrastructure system modeling method and device considering human factor influence
CN110413431B (en) Intelligent identification early warning method for large data platform fault
CN117171213B (en) Big data supervision system and method based on heterogeneous computer system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220930

RJ01 Rejection of invention patent application after publication