CN113535528A

CN113535528A - Log management system, method and medium for distributed graph iterative computation operation

Info

Publication number: CN113535528A
Application number: CN202110728761.9A
Authority: CN
Inventors: 王志刚; 涂懿磊; 殷波; 王宁; 聂婕; 宋德海; �田�浩
Original assignee: Ocean University of China
Current assignee: Ocean University of China
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-10-22
Anticipated expiration: 2041-06-29
Also published as: CN113535528B

Abstract

The invention discloses a log management system, a log management method and a log management medium for distributed graph iterative computation operation, wherein after the distributed graph iterative computation operation is started, tracing is carried out after a fault occurs, and the fault is traced by using a log incremental change analysis tracing method based on a unified time measurement standard: continuously monitoring the incremental change condition of the logs of each node, and judging the order of stopping updating of the logs of each node by taking the time of the master control node as a reference so as to give a candidate fault source node; after the failure tracing, optimizing log analysis in debugging the program, and collecting key log information for debugging through migrating and executing a retrieval command in a distributed manner; and when the distributed graph is subjected to iterative computation, the iterative step information is checked in real time through an increment retrieval method. By the method and the device, after a user determines the node where the fault source is located, the operation details of the program can be quickly tracked and analyzed, and the program debugging is completed.

Description

Log management system, method and medium for distributed graph iterative computation operation

Technical Field

The invention belongs to the technical field of data processing, relates to a log management method, and particularly relates to a log management system, a log management method and a log management medium for distributed graph iterative computation operation.

Background

The distributed graph iterative computing system adopts a Master-Slave (Master-Slave) architecture, as shown in fig. 1, a job is divided into a plurality of tasks and is completed by a plurality of machines in a cluster together, wherein one machine is selected as a Master node Master, and the rest are working nodes Slave. Each working node reports the processing progress of the data in charge of the working node to the main control node periodically, and the main control node displays the processing progress of the whole operation to a user after gathering. This periodic reporting mechanism, commonly referred to as a "heartbeat" mechanism, can be used to perform management and monitoring functions in a master-slave architecture.

Due to the good encapsulation of the distributed large-graph computing platform, a user cannot analyze the running process of the operation program submitted by the user by using tools such as single-step debugging, variable value monitoring and the like under a single-machine programming environment. In addition, the instability of the cross-machine network communication connection of the distributed platform and the uncertainty of the multithreading concurrent computation result increase the debugging difficulty of the graph iterative computation operation.

At present, the main debugging means of the distributed program is to print log information when the program is running, however, the distributed system logs are distributed on each working node, and each iteration step in the graph calculation process may need to check corresponding information and analyze the running correctness of the program, so that the logs of a plurality of nodes need to be checked frequently in the debugging process. The complexity of cross-node log retrieval and redundant information in related log files among different iteration steps reduce the Debug efficiency. Secondly, because a graph algorithm usually accesses a vertex along an outgoing edge, strong coupling exists among subgraphs distributed in different working nodes, when one working node is abnormal, abnormal error reporting occurs in logs on a plurality of physical machines, and at the moment, the error reporting of a plurality of nodes generates certain interference on the judgment of 'first abnormal error reporting', namely, an abnormal machine cannot be quickly positioned and a fault cannot be traced.

The distributed system fault tracing is mainly divided into a rule-based fault tracing method and a modeling-based fault tracing method, and both need to extract relevant knowledge by analyzing log information of the system for a long time to establish a rule or a model. At present, log management of a mainstream distributed computing platform is rarely specially suitable for a graph computing system, and even not specially aiming at graph computing jobs. The existing method can only be applied to a common distributed system, and if log storage is carried out on a distributed graph computing system, the method has the defects and problems that: when the graph computing system has errors, only the log records of the current single job distributed in the cluster need to be checked, the errors are searched, and the log can be deleted after the errors are eliminated without storing the log. A certain memory space is required for storing all logs, and a certain communication overhead is also caused in the log collection process. In addition, the difference between the graph algorithm operation and other distributed operations is that the graph algorithm needs to perform iterative computation for many times, each iteration generates some information, when debugging is performed, log information output in real time needs to be checked in the iteration process, then log storage cannot be checked naturally, the log storage needs to be checked for many times, and if all log information of the operation is transmitted in each collection, log redundant transmission exists.

In summary, the existing distributed graph computing system only provides a simple log recording function, and does not support the fault tracing and log management function in the program debugging process. In the problem of tracing the fault of a distributed system, the prior art needs a large amount of past fault data to be accumulated, and the data are modeled to identify the abnormity, so that the methods have the problem of serious dependence on prior knowledge. Therefore, the invention provides a log management method for the iterative computation operation of the distributed graph aiming at the problem of low program debugging efficiency in the iterative computation system of the distributed graph, and the log management method has stronger independence of single operation, does not have a large amount of data and does not need modeling.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a log management system, a log management method and a log management medium for distributed graph iterative computation operation, wherein a solution is provided from the two aspects of fault tracing and log retrieval, firstly, the fault tracing is carried out, modeling or rule agreement is not needed, and only the position of the source of an abnormal condition is judged according to the update state of the log; after the fault source tracing, the log analysis is optimized in the debugging process of the program, the log is retrieved on each working node, the required content is sent to the Master after the retrieval, and part of useless content is not sent, so that a large amount of sending time is saved, and after the user determines the node where the fault source is located, the operation details of the program can be quickly tracked and analyzed, and the program debugging is completed.

In order to solve the technical problems, the invention adopts the technical scheme that:

firstly, the invention provides a log management method facing to the distributed graph iterative computation operation, after the distributed graph iterative computation operation starts, a Master coordinates each node to load graph data to the local and starts computation, and the log management is realized by the following method:

(1) firstly, tracing the source of the fault by using a log incremental change analysis tracing method based on a unified time measurement standard: continuously monitoring the log increment change condition of each node through a heartbeat mechanism between a master node and a slave node, and judging the updating stop sequence of each node log by taking the time of the master node as a reference so as to give a candidate fault source node;

(2) after the failure tracing, optimizing log analysis in debugging the program, and collecting key log information for debugging through migrating and executing a retrieval command in a distributed manner;

and when the distributed graph is subjected to iterative computation, the iterative step information is checked in real time through an increment retrieval method.

Further, the specific operation steps of the log incremental change analysis traceability method based on the unified time measurement standard in the step (1) are as follows:

step1, firstly, utilizing a regular reporting mechanism of n milliseconds in a graph computing system on each working node to collect and report the log record updating state of each node;

step2, when every other heartbeat, each node compares the local current log quantity with the log quantity when the previous heartbeat is ended, and obtains an incremental change value delta_logAnd will be_logReporting to a Master;

step3, judging whether the abnormal condition occurs, and when the abnormal condition occurs in the calculation operation of the graph, firstly checking the delta recorded by the Master_logIf the working node i reports

The working node i is considered to have a fault; if not found

And reducing the heartbeat interval n, increasing the fault tracing sensitivity, and running the operation again until the log of the fault source is captured and is not updated any more.

Further, in step (2), the migrating and distributed execution of the retrieval command refers to a method that when the distributed graph iterative computing system performs program debugging, each node first performs local retrieval, and then transmits the retrieved key log information to the Master, which specifically includes the following steps:

sending a search command to the Slave working nodes from the Master, retrieving logs locally according to the command after each Slave receives the command, and operating the retrieval command in a distributed manner by each node;

returning part of the key log information, and finally presenting the result on the Master.

Further, in step (2), the viewing iteration step information by the incremental retrieval method includes: for one-time operation, when the information of the iteration step n is checked for the first time, the required information of the iteration step n is directly output, and then a shaping variable outIteraction is set to be n and used for recording the output of the information of the iteration step n; if the information of the iteration step m needs to be output next time, firstly checking whether m is larger than n, when m is larger than n, only outputting the log information of the iteration steps from n +1 to m, then setting outIteration as m, when m is smaller than n, prompting that the log information is already output, and checking the log information of the iteration step n.

The invention also provides a log management system facing the distributed graph iterative computation operation, which is used for managing logs and comprises the following steps:

the log incremental change analysis traceability module is used for tracing the fault, continuously monitoring the log incremental change condition of each node through a heartbeat mechanism between a master node and a slave node, and judging the order of stopping updating of logs of each node by taking the time of the master node as a reference so as to give a candidate fault source node;

the distributed migration retrieval module is used for optimizing log analysis in debugging a program after fault tracing, and collecting key log information for debugging through migration and distributed execution of retrieval commands;

and the increment retrieval module is used for viewing the iteration step information in real time through an increment retrieval command during the iterative computation of the distributed graph.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the distributed graph iterative computation job oriented log management method as described above.

Compared with the prior art, the invention has the advantages that:

(1) aiming at the problem of difficult source tracing of a node where a real fault is located when multiple nodes report errors under a strong coupling correlation background, the invention provides a log increment change analysis source tracing method based on a unified time measurement standard, the characteristic that local logs of the nodes are not updated after a fault occurs is utilized, the log increment change condition of each node is continuously monitored through a heartbeat mechanism between a master node and a slave node, the time of a master control node is taken as a reference, the order of stopping updating of logs of each node is judged, and then a candidate fault source node is given.

Modeling or rule agreement is not required, and it is only necessary to determine where the anomaly is rooted based on the update status of the log.

(2) Aiming at the problem of low efficiency of cross-node frequent log retrieval, the invention provides a data redundancy deletion method based on distributed incremental retrieval, which replaces log collection by migrating and executing a retrieval command in a distributed manner, retrieves the logs on each working node, sends the required content to a Master after retrieval, and partially does not send the useless content, thereby saving a large amount of sending time, reducing the network transmission overhead of redundant information, improving the retrieval efficiency by executing the retrieval command in a distributed manner, and enabling a user to quickly trace and analyze the operation details of a program after determining the node where a fault source is located, and completing program debugging.

(3) During iterative computation of the distributed graph, iterative step information is checked in real time through an incremental retrieval method, and repeated scanning of log information among different iterative steps is avoided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.

FIG. 1 is a prior art distributed graph iterative computing system architecture diagram;

FIG. 2 is a schematic diagram of log record update and reporting in accordance with the present invention;

FIG. 3 is a comparison of two log transmission strategies according to the present invention and the prior art;

FIG. 4 is a flowchart of a log management method for distributed graph-oriented iterative computation according to the present invention.

Detailed Description

The invention is further described with reference to the following figures and specific embodiments.

The embodiment provides a log management method for distributed graph iterative computation operation, after the distributed graph iterative computation operation starts, a Master coordinates each node to load graph data to the local and starts computation, and the log management is realized by the following method:

(1) firstly, tracing the source of the fault by using a log incremental change analysis tracing method based on a unified time measurement standard: by utilizing the characteristic that local logs of nodes are not updated any more after a fault occurs, the log increment change condition of each node is continuously monitored through a heartbeat mechanism between a master node and a slave node, the time of the master node is taken as a reference, the order of stopping updating of the logs of each node is judged, and then a candidate fault source node is given.

(2) After the failure is traced, the log analysis is optimized during the debugging of the program, and the log collection is replaced by migrating and executing the retrieval command in a distributed manner, so that the network transmission overhead of redundant information is reduced, the key log information is collected for debugging, and the retrieval efficiency is improved.

When the distributed graph is subjected to iterative computation, the iteration step information is checked in real time through an increment retrieval method, and repeated scanning of log information among different iteration steps is avoided.

The following are introduced from two aspects of fault tracing and log retrieval respectively:

log incremental change analysis tracing method based on unified time measurement standard

In the distributed graph computing system, in the computing process, each working node needs to perform network communication, an output value of a sending working node is often an input value required by another receiving working node for the next iteration, so that other working nodes fail at the working node i, cannot receive the required input value, and cannot perform operation by themselves, so that the whole system of subsequent iteration is in linkage failure, and thus the computing task fails. When a large-area working node log records a fault, the fault tracing cannot be carried out quickly. The invention utilizes the log increment change analysis traceability method based on the unified time measurement standard of the system, and considers that delta is reported first_logThe abnormal working node is typically the originating working node that caused the entire fault. Thus, the fault source can be quickly positioned.

The key point is that according to the update state of the log record, the judgment is made on where the failure source of the current operation is. When the tasks of the distributed graph computing system are carried out, each working node continuously adds the logs according to the progress of the tasks, and the logs are stopped updating when the tasks are finished, so that the content of the logs is increased along with the progress of the computing tasks, and the log content and the computing tasks are in a direct proportion relation. Therefore, the present invention determines the abnormality according to the update status of the log record, and as shown in fig. 2, the specific operation steps are as follows:

step1, firstly, a periodic reporting mechanism of n milliseconds, namely 'heartbeat', in a graph computing system on each working node is utilized to collect and report the log record updating state of each node. The purpose is to collect and report the log record updating state of each node at the same time, achieve indiscriminate treatment, and avoid the situations of less report and report missing of the node.

Step2, when every other heartbeat, each node compares the local current log quantity with the log quantity when the previous heartbeat is ended, and obtains an incremental change value delta_logAnd will be_logAnd reporting to the Master.

With others

There are great differences

The working node i is considered to have a fault; if not found

The size of the "heartbeat" interval relates to the balance between the traceability sensitivity and the detection time complexity, and needs to be selected according to specific requirements.

Step3 relates to failure tracing agentAnd when n is set to be too small, the log updating condition can be frequently recorded and reported to the master control working node, and although the accuracy of fault tracing can be improved, limited computing resources can be occupied and the communication cost can be increased. If n is set too large, accuracy will be reduced. Providing a compromise method, increasing the value of n in the daily calculation process, reducing resource consumption, and if not capturing the value

The value of n can be reduced, the sensitivity is increased, and the operation calculation is carried out again, and at the moment, the main purpose is to carry out the fault tracing instead of the operation calculation, so that the accuracy of the fault tracing is obtained by sacrificing the calculation resources.

Data redundancy deletion method based on distributed incremental retrieval

When a program of a distributed graph iterative computing system is debugged, log information is distributed on each computing node, cross-node operation is needed, logs are checked on different working nodes, the problem of operation repetition exists, the existing distributed log management system is not suitable for distributed graph computing operation, and the distributed graph computing system only provides a simple log recording function. In addition, in the distributed graph iterative computation system, when one-time operation computation is performed, the iterative step output information needs to be checked for many times, after the iterative step n information is checked, when the iterative step n + i is checked again, the iterative step n information is still output, and the problem of retrieval and iterative step information redundancy exists.

The invention solves the two problems by migrating and executing the retrieval command and the increment retrieval command in a distributed way, and the two problems are as follows:

1. migrating and distributively executing search commands

Fig. 3 shows the existing log transmission scheme that cannot efficiently perform log analysis on the left side, and the right side is the efficient method adopted by the present invention, that is, when the distributed graph iterative computation system performs program debugging, in order to avoid back-and-forth operation among the computation nodes, a method is adopted in which each node first performs local retrieval and then transmits the retrieved key log information to the Master: the Slave Master sends a search command to the Slave working nodes, each Slave receives the command and then searches the logs locally according to the command, and the nodes run the search command in a distributed mode, so that compared with the Master which collects the whole cluster logs firstly and then searches the log records on each node one by one, the efficiency of searching is greatly improved. Returning part of the key log information, and finally presenting the result on the Master. Because the log records also comprise some useless information, only the key logs are transmitted, and the communication loss is greatly reduced.

2. Incremental search command

The step of avoiding redundant display of the information in the iteration step is to prevent the output information from being displayed. For one-time operation, when the information of the iteration step n is checked for the first time, redundant output does not exist, the required information of the iteration step n is directly output, and then a shaping variable outIteration is set to be n and used for recording the output of the information of the iteration step n; if the information of the iteration step m needs to be output next time, firstly checking whether m is larger than n, when m is larger than n, only outputting the log information of the iteration steps from n +1 to m, then setting outIteration as m, when m is smaller than n, prompting that the log information is already output, and checking the log information of the iteration step n.

With reference to the flowchart shown in fig. 4, after the distributed graph computation operation starts, the Master coordinates each node to load graph data to the local and starts computation. After each Slave node passes a heartbeat interval, each node compares the current local log quantity with the log quantity of the previous heartbeat interval to obtain a change value delta_logAnd will be_logAnd reporting to the Master. Master records these Δ_logIf the operation is abnormal, recording delta to Master_logMaking a judgment if the last reported delta of a certain working node_logAnd 0, it can be determined as the root cause of the failure. This section is a failure tracing technique.

If the iteration step information is needed to be checked during calculation, the incremental retrieval technology is utilized, and the redundant output of the log information is reduced.

When the system is abnormal and the log is checked for program debugging, the migration command technology is utilized to collect the key log information for debugging.

As another embodiment of the present invention, there is also provided a log management system for a distributed graph-oriented iterative computation job, configured to manage logs, including:

The functions and implementation manners of the modules are the same as those of the log management method facing the distributed graph iterative computation operation, and are not described herein again.

As another embodiment of the present invention, a computer-readable storage medium is further provided, on which a computer program is stored, and when the computer program is executed by a processor, the log management method for distributed graph-oriented iterative computation jobs is implemented, which is not described herein again.

In summary, the invention provides a solution from two angles of fault tracing and log retrieval, and firstly, the fault tracing does not need modeling or rule agreement, and only needs to determine where the abnormal root comes from according to the update state of the log; after the fault source tracing, the log analysis is optimized in the debugging process of the program, the log is retrieved on each working node, the required content is sent to the Master after the retrieval, and part of useless content is not sent, so that a large amount of sending time is saved, and after the user determines the node where the fault source is located, the operation details of the program can be quickly tracked and analyzed, and the program debugging is completed.

The same or similar parts among the various embodiments of the present description may be referred to each other, and each embodiment is described with emphasis on differences from the other embodiments. Moreover, the structure of the system embodiment is only schematic, wherein the program modules described by the separable components may or may not be physically separated, and in actual application, some or all of the modules may be selected as needed to achieve the purpose of the solution of the embodiment.

The steps of the present invention may be implemented using general purpose computer means, or alternatively, they may be implemented using program code executable by computing means, whereby the steps may be stored in memory means for execution by the computing means, or separately fabricated into individual integrated circuit modules, or multiple modules or steps thereof fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

It is understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should understand that they can make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

Claims

1. The log management method facing the distributed graph iterative computation operation is characterized in that after the distributed graph iterative computation operation starts, a Master coordinates each node to load graph data to the local and starts computation, and the log management is realized through the following method:

2. The log management method for the distributed graph iterative computation job according to claim 1, wherein the specific operation steps of the log incremental change analysis tracing method based on the unified time metric in step (1) are as follows:

The working node i is considered to have a fault; if not found

3. The distributed graph iterative computation job-oriented log management method according to claim 1, wherein in step (2), the migration and distributed execution retrieval command refers to a method that when the distributed graph iterative computation system performs program debugging, each node first retrieves locally, and then transmits the retrieved key log information to the Master, specifically as follows:

4. The distributed graph iterative computation job-oriented log management method according to claim 1, wherein in step (2), the viewing iteration step information by the incremental retrieval method includes: for one-time operation, when the information of the iteration step n is checked for the first time, the required information of the iteration step n is directly output, and then a shaping variable outIteraction is set to be n and used for recording the output of the information of the iteration step n; if the information of the iteration step m needs to be output next time, firstly checking whether m is larger than n, when m is larger than n, only outputting the log information of the iteration steps from n +1 to m, then setting outIteration as m, when m is smaller than n, prompting that the log information is already output, and checking the log information of the iteration step n.

5. A log management system facing distributed graph iterative computation operation is used for managing logs and is characterized by comprising the following steps:

6. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a method of log management for a distributed graph-oriented iterative computing job according to any one of claims 1 to 4.