CN113535528A - Log management system, method and medium for distributed graph iterative computation operation - Google Patents

Log management system, method and medium for distributed graph iterative computation operation Download PDF

Info

Publication number
CN113535528A
CN113535528A CN202110728761.9A CN202110728761A CN113535528A CN 113535528 A CN113535528 A CN 113535528A CN 202110728761 A CN202110728761 A CN 202110728761A CN 113535528 A CN113535528 A CN 113535528A
Authority
CN
China
Prior art keywords
log
node
distributed
information
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110728761.9A
Other languages
Chinese (zh)
Other versions
CN113535528B (en
Inventor
王志刚
涂懿磊
殷波
王宁
聂婕
宋德海
�田�浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ocean University of China
Original Assignee
Ocean University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ocean University of China filed Critical Ocean University of China
Priority to CN202110728761.9A priority Critical patent/CN113535528B/en
Publication of CN113535528A publication Critical patent/CN113535528A/en
Application granted granted Critical
Publication of CN113535528B publication Critical patent/CN113535528B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3466Performance evaluation by tracing or monitoring
    • G06F11/3476Data logging
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a log management system, a log management method and a log management medium for distributed graph iterative computation operation, wherein after the distributed graph iterative computation operation is started, tracing is carried out after a fault occurs, and the fault is traced by using a log incremental change analysis tracing method based on a unified time measurement standard: continuously monitoring the incremental change condition of the logs of each node, and judging the order of stopping updating of the logs of each node by taking the time of the master control node as a reference so as to give a candidate fault source node; after the failure tracing, optimizing log analysis in debugging the program, and collecting key log information for debugging through migrating and executing a retrieval command in a distributed manner; and when the distributed graph is subjected to iterative computation, the iterative step information is checked in real time through an increment retrieval method. By the method and the device, after a user determines the node where the fault source is located, the operation details of the program can be quickly tracked and analyzed, and the program debugging is completed.

Description

Log management system, method and medium for distributed graph iterative computation operation
Technical Field
The invention belongs to the technical field of data processing, relates to a log management method, and particularly relates to a log management system, a log management method and a log management medium for distributed graph iterative computation operation.
Background
The distributed graph iterative computing system adopts a Master-Slave (Master-Slave) architecture, as shown in fig. 1, a job is divided into a plurality of tasks and is completed by a plurality of machines in a cluster together, wherein one machine is selected as a Master node Master, and the rest are working nodes Slave. Each working node reports the processing progress of the data in charge of the working node to the main control node periodically, and the main control node displays the processing progress of the whole operation to a user after gathering. This periodic reporting mechanism, commonly referred to as a "heartbeat" mechanism, can be used to perform management and monitoring functions in a master-slave architecture.
Due to the good encapsulation of the distributed large-graph computing platform, a user cannot analyze the running process of the operation program submitted by the user by using tools such as single-step debugging, variable value monitoring and the like under a single-machine programming environment. In addition, the instability of the cross-machine network communication connection of the distributed platform and the uncertainty of the multithreading concurrent computation result increase the debugging difficulty of the graph iterative computation operation.
At present, the main debugging means of the distributed program is to print log information when the program is running, however, the distributed system logs are distributed on each working node, and each iteration step in the graph calculation process may need to check corresponding information and analyze the running correctness of the program, so that the logs of a plurality of nodes need to be checked frequently in the debugging process. The complexity of cross-node log retrieval and redundant information in related log files among different iteration steps reduce the Debug efficiency. Secondly, because a graph algorithm usually accesses a vertex along an outgoing edge, strong coupling exists among subgraphs distributed in different working nodes, when one working node is abnormal, abnormal error reporting occurs in logs on a plurality of physical machines, and at the moment, the error reporting of a plurality of nodes generates certain interference on the judgment of 'first abnormal error reporting', namely, an abnormal machine cannot be quickly positioned and a fault cannot be traced.
The distributed system fault tracing is mainly divided into a rule-based fault tracing method and a modeling-based fault tracing method, and both need to extract relevant knowledge by analyzing log information of the system for a long time to establish a rule or a model. At present, log management of a mainstream distributed computing platform is rarely specially suitable for a graph computing system, and even not specially aiming at graph computing jobs. The existing method can only be applied to a common distributed system, and if log storage is carried out on a distributed graph computing system, the method has the defects and problems that: when the graph computing system has errors, only the log records of the current single job distributed in the cluster need to be checked, the errors are searched, and the log can be deleted after the errors are eliminated without storing the log. A certain memory space is required for storing all logs, and a certain communication overhead is also caused in the log collection process. In addition, the difference between the graph algorithm operation and other distributed operations is that the graph algorithm needs to perform iterative computation for many times, each iteration generates some information, when debugging is performed, log information output in real time needs to be checked in the iteration process, then log storage cannot be checked naturally, the log storage needs to be checked for many times, and if all log information of the operation is transmitted in each collection, log redundant transmission exists.
In summary, the existing distributed graph computing system only provides a simple log recording function, and does not support the fault tracing and log management function in the program debugging process. In the problem of tracing the fault of a distributed system, the prior art needs a large amount of past fault data to be accumulated, and the data are modeled to identify the abnormity, so that the methods have the problem of serious dependence on prior knowledge. Therefore, the invention provides a log management method for the iterative computation operation of the distributed graph aiming at the problem of low program debugging efficiency in the iterative computation system of the distributed graph, and the log management method has stronger independence of single operation, does not have a large amount of data and does not need modeling.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a log management system, a log management method and a log management medium for distributed graph iterative computation operation, wherein a solution is provided from the two aspects of fault tracing and log retrieval, firstly, the fault tracing is carried out, modeling or rule agreement is not needed, and only the position of the source of an abnormal condition is judged according to the update state of the log; after the fault source tracing, the log analysis is optimized in the debugging process of the program, the log is retrieved on each working node, the required content is sent to the Master after the retrieval, and part of useless content is not sent, so that a large amount of sending time is saved, and after the user determines the node where the fault source is located, the operation details of the program can be quickly tracked and analyzed, and the program debugging is completed.
In order to solve the technical problems, the invention adopts the technical scheme that:
firstly, the invention provides a log management method facing to the distributed graph iterative computation operation, after the distributed graph iterative computation operation starts, a Master coordinates each node to load graph data to the local and starts computation, and the log management is realized by the following method:
(1) firstly, tracing the source of the fault by using a log incremental change analysis tracing method based on a unified time measurement standard: continuously monitoring the log increment change condition of each node through a heartbeat mechanism between a master node and a slave node, and judging the updating stop sequence of each node log by taking the time of the master node as a reference so as to give a candidate fault source node;
(2) after the failure tracing, optimizing log analysis in debugging the program, and collecting key log information for debugging through migrating and executing a retrieval command in a distributed manner;
and when the distributed graph is subjected to iterative computation, the iterative step information is checked in real time through an increment retrieval method.
Further, the specific operation steps of the log incremental change analysis traceability method based on the unified time measurement standard in the step (1) are as follows:
step1, firstly, utilizing a regular reporting mechanism of n milliseconds in a graph computing system on each working node to collect and report the log record updating state of each node;
step2, when every other heartbeat, each node compares the local current log quantity with the log quantity when the previous heartbeat is ended, and obtains an incremental change value deltalogAnd will belogReporting to a Master;
step3, judging whether the abnormal condition occurs, and when the abnormal condition occurs in the calculation operation of the graph, firstly checking the delta recorded by the MasterlogIf the working node i reports
Figure BDA0003138533590000031
The working node i is considered to have a fault; if not found
Figure BDA0003138533590000032
And reducing the heartbeat interval n, increasing the fault tracing sensitivity, and running the operation again until the log of the fault source is captured and is not updated any more.
Further, in step (2), the migrating and distributed execution of the retrieval command refers to a method that when the distributed graph iterative computing system performs program debugging, each node first performs local retrieval, and then transmits the retrieved key log information to the Master, which specifically includes the following steps:
sending a search command to the Slave working nodes from the Master, retrieving logs locally according to the command after each Slave receives the command, and operating the retrieval command in a distributed manner by each node;
returning part of the key log information, and finally presenting the result on the Master.
Further, in step (2), the viewing iteration step information by the incremental retrieval method includes: for one-time operation, when the information of the iteration step n is checked for the first time, the required information of the iteration step n is directly output, and then a shaping variable outIteraction is set to be n and used for recording the output of the information of the iteration step n; if the information of the iteration step m needs to be output next time, firstly checking whether m is larger than n, when m is larger than n, only outputting the log information of the iteration steps from n +1 to m, then setting outIteration as m, when m is smaller than n, prompting that the log information is already output, and checking the log information of the iteration step n.
The invention also provides a log management system facing the distributed graph iterative computation operation, which is used for managing logs and comprises the following steps:
the log incremental change analysis traceability module is used for tracing the fault, continuously monitoring the log incremental change condition of each node through a heartbeat mechanism between a master node and a slave node, and judging the order of stopping updating of logs of each node by taking the time of the master node as a reference so as to give a candidate fault source node;
the distributed migration retrieval module is used for optimizing log analysis in debugging a program after fault tracing, and collecting key log information for debugging through migration and distributed execution of retrieval commands;
and the increment retrieval module is used for viewing the iteration step information in real time through an increment retrieval command during the iterative computation of the distributed graph.
The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the distributed graph iterative computation job oriented log management method as described above.
Compared with the prior art, the invention has the advantages that:
(1) aiming at the problem of difficult source tracing of a node where a real fault is located when multiple nodes report errors under a strong coupling correlation background, the invention provides a log increment change analysis source tracing method based on a unified time measurement standard, the characteristic that local logs of the nodes are not updated after a fault occurs is utilized, the log increment change condition of each node is continuously monitored through a heartbeat mechanism between a master node and a slave node, the time of a master control node is taken as a reference, the order of stopping updating of logs of each node is judged, and then a candidate fault source node is given.
Modeling or rule agreement is not required, and it is only necessary to determine where the anomaly is rooted based on the update status of the log.
(2) Aiming at the problem of low efficiency of cross-node frequent log retrieval, the invention provides a data redundancy deletion method based on distributed incremental retrieval, which replaces log collection by migrating and executing a retrieval command in a distributed manner, retrieves the logs on each working node, sends the required content to a Master after retrieval, and partially does not send the useless content, thereby saving a large amount of sending time, reducing the network transmission overhead of redundant information, improving the retrieval efficiency by executing the retrieval command in a distributed manner, and enabling a user to quickly trace and analyze the operation details of a program after determining the node where a fault source is located, and completing program debugging.
(3) During iterative computation of the distributed graph, iterative step information is checked in real time through an incremental retrieval method, and repeated scanning of log information among different iterative steps is avoided.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive labor.
FIG. 1 is a prior art distributed graph iterative computing system architecture diagram;
FIG. 2 is a schematic diagram of log record update and reporting in accordance with the present invention;
FIG. 3 is a comparison of two log transmission strategies according to the present invention and the prior art;
FIG. 4 is a flowchart of a log management method for distributed graph-oriented iterative computation according to the present invention.
Detailed Description
The invention is further described with reference to the following figures and specific embodiments.
The embodiment provides a log management method for distributed graph iterative computation operation, after the distributed graph iterative computation operation starts, a Master coordinates each node to load graph data to the local and starts computation, and the log management is realized by the following method:
(1) firstly, tracing the source of the fault by using a log incremental change analysis tracing method based on a unified time measurement standard: by utilizing the characteristic that local logs of nodes are not updated any more after a fault occurs, the log increment change condition of each node is continuously monitored through a heartbeat mechanism between a master node and a slave node, the time of the master node is taken as a reference, the order of stopping updating of the logs of each node is judged, and then a candidate fault source node is given.
(2) After the failure is traced, the log analysis is optimized during the debugging of the program, and the log collection is replaced by migrating and executing the retrieval command in a distributed manner, so that the network transmission overhead of redundant information is reduced, the key log information is collected for debugging, and the retrieval efficiency is improved.
When the distributed graph is subjected to iterative computation, the iteration step information is checked in real time through an increment retrieval method, and repeated scanning of log information among different iteration steps is avoided.
The following are introduced from two aspects of fault tracing and log retrieval respectively:
log incremental change analysis tracing method based on unified time measurement standard
In the distributed graph computing system, in the computing process, each working node needs to perform network communication, an output value of a sending working node is often an input value required by another receiving working node for the next iteration, so that other working nodes fail at the working node i, cannot receive the required input value, and cannot perform operation by themselves, so that the whole system of subsequent iteration is in linkage failure, and thus the computing task fails. When a large-area working node log records a fault, the fault tracing cannot be carried out quickly. The invention utilizes the log increment change analysis traceability method based on the unified time measurement standard of the system, and considers that delta is reported firstlogThe abnormal working node is typically the originating working node that caused the entire fault. Thus, the fault source can be quickly positioned.
The key point is that according to the update state of the log record, the judgment is made on where the failure source of the current operation is. When the tasks of the distributed graph computing system are carried out, each working node continuously adds the logs according to the progress of the tasks, and the logs are stopped updating when the tasks are finished, so that the content of the logs is increased along with the progress of the computing tasks, and the log content and the computing tasks are in a direct proportion relation. Therefore, the present invention determines the abnormality according to the update status of the log record, and as shown in fig. 2, the specific operation steps are as follows:
step1, firstly, a periodic reporting mechanism of n milliseconds, namely 'heartbeat', in a graph computing system on each working node is utilized to collect and report the log record updating state of each node. The purpose is to collect and report the log record updating state of each node at the same time, achieve indiscriminate treatment, and avoid the situations of less report and report missing of the node.
Step2, when every other heartbeat, each node compares the local current log quantity with the log quantity when the previous heartbeat is ended, and obtains an incremental change value deltalogAnd will belogAnd reporting to the Master.
Step3, judging whether the abnormal condition occurs, and when the abnormal condition occurs in the calculation operation of the graph, firstly checking the delta recorded by the MasterlogIf the working node i reports
Figure BDA0003138533590000061
With others
Figure BDA0003138533590000062
There are great differences
Figure BDA0003138533590000063
The working node i is considered to have a fault; if not found
Figure BDA0003138533590000064
And reducing the heartbeat interval n, increasing the fault tracing sensitivity, and running the operation again until the log of the fault source is captured and is not updated any more.
The size of the "heartbeat" interval relates to the balance between the traceability sensitivity and the detection time complexity, and needs to be selected according to specific requirements.
Step3 relates to failure tracing agentAnd when n is set to be too small, the log updating condition can be frequently recorded and reported to the master control working node, and although the accuracy of fault tracing can be improved, limited computing resources can be occupied and the communication cost can be increased. If n is set too large, accuracy will be reduced. Providing a compromise method, increasing the value of n in the daily calculation process, reducing resource consumption, and if not capturing the value
Figure BDA0003138533590000065
The value of n can be reduced, the sensitivity is increased, and the operation calculation is carried out again, and at the moment, the main purpose is to carry out the fault tracing instead of the operation calculation, so that the accuracy of the fault tracing is obtained by sacrificing the calculation resources.
Data redundancy deletion method based on distributed incremental retrieval
When a program of a distributed graph iterative computing system is debugged, log information is distributed on each computing node, cross-node operation is needed, logs are checked on different working nodes, the problem of operation repetition exists, the existing distributed log management system is not suitable for distributed graph computing operation, and the distributed graph computing system only provides a simple log recording function. In addition, in the distributed graph iterative computation system, when one-time operation computation is performed, the iterative step output information needs to be checked for many times, after the iterative step n information is checked, when the iterative step n + i is checked again, the iterative step n information is still output, and the problem of retrieval and iterative step information redundancy exists.
The invention solves the two problems by migrating and executing the retrieval command and the increment retrieval command in a distributed way, and the two problems are as follows:
1. migrating and distributively executing search commands
Fig. 3 shows the existing log transmission scheme that cannot efficiently perform log analysis on the left side, and the right side is the efficient method adopted by the present invention, that is, when the distributed graph iterative computation system performs program debugging, in order to avoid back-and-forth operation among the computation nodes, a method is adopted in which each node first performs local retrieval and then transmits the retrieved key log information to the Master: the Slave Master sends a search command to the Slave working nodes, each Slave receives the command and then searches the logs locally according to the command, and the nodes run the search command in a distributed mode, so that compared with the Master which collects the whole cluster logs firstly and then searches the log records on each node one by one, the efficiency of searching is greatly improved. Returning part of the key log information, and finally presenting the result on the Master. Because the log records also comprise some useless information, only the key logs are transmitted, and the communication loss is greatly reduced.
2. Incremental search command
The step of avoiding redundant display of the information in the iteration step is to prevent the output information from being displayed. For one-time operation, when the information of the iteration step n is checked for the first time, redundant output does not exist, the required information of the iteration step n is directly output, and then a shaping variable outIteration is set to be n and used for recording the output of the information of the iteration step n; if the information of the iteration step m needs to be output next time, firstly checking whether m is larger than n, when m is larger than n, only outputting the log information of the iteration steps from n +1 to m, then setting outIteration as m, when m is smaller than n, prompting that the log information is already output, and checking the log information of the iteration step n.
With reference to the flowchart shown in fig. 4, after the distributed graph computation operation starts, the Master coordinates each node to load graph data to the local and starts computation. After each Slave node passes a heartbeat interval, each node compares the current local log quantity with the log quantity of the previous heartbeat interval to obtain a change value deltalogAnd will belogAnd reporting to the Master. Master records these ΔlogIf the operation is abnormal, recording delta to MasterlogMaking a judgment if the last reported delta of a certain working nodelogAnd 0, it can be determined as the root cause of the failure. This section is a failure tracing technique.
If the iteration step information is needed to be checked during calculation, the incremental retrieval technology is utilized, and the redundant output of the log information is reduced.
When the system is abnormal and the log is checked for program debugging, the migration command technology is utilized to collect the key log information for debugging.
As another embodiment of the present invention, there is also provided a log management system for a distributed graph-oriented iterative computation job, configured to manage logs, including:
the log incremental change analysis traceability module is used for tracing the fault, continuously monitoring the log incremental change condition of each node through a heartbeat mechanism between a master node and a slave node, and judging the order of stopping updating of logs of each node by taking the time of the master node as a reference so as to give a candidate fault source node;
the distributed migration retrieval module is used for optimizing log analysis in debugging a program after fault tracing, and collecting key log information for debugging through migration and distributed execution of retrieval commands;
and the increment retrieval module is used for viewing the iteration step information in real time through an increment retrieval command during the iterative computation of the distributed graph.
The functions and implementation manners of the modules are the same as those of the log management method facing the distributed graph iterative computation operation, and are not described herein again.
As another embodiment of the present invention, a computer-readable storage medium is further provided, on which a computer program is stored, and when the computer program is executed by a processor, the log management method for distributed graph-oriented iterative computation jobs is implemented, which is not described herein again.
In summary, the invention provides a solution from two angles of fault tracing and log retrieval, and firstly, the fault tracing does not need modeling or rule agreement, and only needs to determine where the abnormal root comes from according to the update state of the log; after the fault source tracing, the log analysis is optimized in the debugging process of the program, the log is retrieved on each working node, the required content is sent to the Master after the retrieval, and part of useless content is not sent, so that a large amount of sending time is saved, and after the user determines the node where the fault source is located, the operation details of the program can be quickly tracked and analyzed, and the program debugging is completed.
The same or similar parts among the various embodiments of the present description may be referred to each other, and each embodiment is described with emphasis on differences from the other embodiments. Moreover, the structure of the system embodiment is only schematic, wherein the program modules described by the separable components may or may not be physically separated, and in actual application, some or all of the modules may be selected as needed to achieve the purpose of the solution of the embodiment.
The steps of the present invention may be implemented using general purpose computer means, or alternatively, they may be implemented using program code executable by computing means, whereby the steps may be stored in memory means for execution by the computing means, or separately fabricated into individual integrated circuit modules, or multiple modules or steps thereof fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.
It is understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art should understand that they can make various changes, modifications, additions and substitutions within the spirit and scope of the present invention.

Claims (6)

1. The log management method facing the distributed graph iterative computation operation is characterized in that after the distributed graph iterative computation operation starts, a Master coordinates each node to load graph data to the local and starts computation, and the log management is realized through the following method:
(1) firstly, tracing the source of the fault by using a log incremental change analysis tracing method based on a unified time measurement standard: continuously monitoring the log increment change condition of each node through a heartbeat mechanism between a master node and a slave node, and judging the updating stop sequence of each node log by taking the time of the master node as a reference so as to give a candidate fault source node;
(2) after the failure tracing, optimizing log analysis in debugging the program, and collecting key log information for debugging through migrating and executing a retrieval command in a distributed manner;
and when the distributed graph is subjected to iterative computation, the iterative step information is checked in real time through an increment retrieval method.
2. The log management method for the distributed graph iterative computation job according to claim 1, wherein the specific operation steps of the log incremental change analysis tracing method based on the unified time metric in step (1) are as follows:
step1, firstly, utilizing a regular reporting mechanism of n milliseconds in a graph computing system on each working node to collect and report the log record updating state of each node;
step2, when every other heartbeat, each node compares the local current log quantity with the log quantity when the previous heartbeat is ended, and obtains an incremental change value deltalogAnd will belogReporting to a Master;
step3, judging whether the abnormal condition occurs, and when the abnormal condition occurs in the calculation operation of the graph, firstly checking the delta recorded by the MasterlogIf the working node i reports
Figure FDA0003138533580000011
The working node i is considered to have a fault; if not found
Figure FDA0003138533580000012
And reducing the heartbeat interval n, increasing the fault tracing sensitivity, and running the operation again until the log of the fault source is captured and is not updated any more.
3. The distributed graph iterative computation job-oriented log management method according to claim 1, wherein in step (2), the migration and distributed execution retrieval command refers to a method that when the distributed graph iterative computation system performs program debugging, each node first retrieves locally, and then transmits the retrieved key log information to the Master, specifically as follows:
sending a search command to the Slave working nodes from the Master, retrieving logs locally according to the command after each Slave receives the command, and operating the retrieval command in a distributed manner by each node;
returning part of the key log information, and finally presenting the result on the Master.
4. The distributed graph iterative computation job-oriented log management method according to claim 1, wherein in step (2), the viewing iteration step information by the incremental retrieval method includes: for one-time operation, when the information of the iteration step n is checked for the first time, the required information of the iteration step n is directly output, and then a shaping variable outIteraction is set to be n and used for recording the output of the information of the iteration step n; if the information of the iteration step m needs to be output next time, firstly checking whether m is larger than n, when m is larger than n, only outputting the log information of the iteration steps from n +1 to m, then setting outIteration as m, when m is smaller than n, prompting that the log information is already output, and checking the log information of the iteration step n.
5. A log management system facing distributed graph iterative computation operation is used for managing logs and is characterized by comprising the following steps:
the log incremental change analysis traceability module is used for tracing the fault, continuously monitoring the log incremental change condition of each node through a heartbeat mechanism between a master node and a slave node, and judging the order of stopping updating of logs of each node by taking the time of the master node as a reference so as to give a candidate fault source node;
the distributed migration retrieval module is used for optimizing log analysis in debugging a program after fault tracing, and collecting key log information for debugging through migration and distributed execution of retrieval commands;
and the increment retrieval module is used for viewing the iteration step information in real time through an increment retrieval command during the iterative computation of the distributed graph.
6. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a method of log management for a distributed graph-oriented iterative computing job according to any one of claims 1 to 4.
CN202110728761.9A 2021-06-29 2021-06-29 Log management system, method and medium for distributed graph iterative computation job Active CN113535528B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110728761.9A CN113535528B (en) 2021-06-29 2021-06-29 Log management system, method and medium for distributed graph iterative computation job

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110728761.9A CN113535528B (en) 2021-06-29 2021-06-29 Log management system, method and medium for distributed graph iterative computation job

Publications (2)

Publication Number Publication Date
CN113535528A true CN113535528A (en) 2021-10-22
CN113535528B CN113535528B (en) 2023-08-08

Family

ID=78126198

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110728761.9A Active CN113535528B (en) 2021-06-29 2021-06-29 Log management system, method and medium for distributed graph iterative computation job

Country Status (1)

Country Link
CN (1) CN113535528B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7721152B1 (en) * 2004-12-21 2010-05-18 Symantec Operating Corporation Integration of cluster information with root cause analysis tool
US20110083123A1 (en) * 2009-10-05 2011-04-07 Microsoft Corporation Automatically localizing root error through log analysis
CN105975604A (en) * 2016-05-12 2016-09-28 清华大学 Distribution iterative data processing program abnormity detection and diagnosis method
CN106227727A (en) * 2016-06-30 2016-12-14 乐视控股(北京)有限公司 Daily record update method, device and the system of a kind of distributed system
CN110134714A (en) * 2019-05-22 2019-08-16 东北大学 A kind of distributed computing framework caching index suitable for big data iterative calculation
CN110489302A (en) * 2019-08-22 2019-11-22 贵州电网有限责任公司 Fault judgment method based on plurality of devices log multiple analysis

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7721152B1 (en) * 2004-12-21 2010-05-18 Symantec Operating Corporation Integration of cluster information with root cause analysis tool
US20110083123A1 (en) * 2009-10-05 2011-04-07 Microsoft Corporation Automatically localizing root error through log analysis
CN105975604A (en) * 2016-05-12 2016-09-28 清华大学 Distribution iterative data processing program abnormity detection and diagnosis method
CN106227727A (en) * 2016-06-30 2016-12-14 乐视控股(北京)有限公司 Daily record update method, device and the system of a kind of distributed system
CN110134714A (en) * 2019-05-22 2019-08-16 东北大学 A kind of distributed computing framework caching index suitable for big data iterative calculation
CN110489302A (en) * 2019-08-22 2019-11-22 贵州电网有限责任公司 Fault judgment method based on plurality of devices log multiple analysis

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
贾统;李影;吴中海;: "基于日志数据的分布式软件系统故障诊断综述", 软件学报, vol. 31, no. 7, pages 1997 - 2018 *

Also Published As

Publication number Publication date
CN113535528B (en) 2023-08-08

Similar Documents

Publication Publication Date Title
Gainaru et al. Fault prediction under the microscope: A closer look into HPC systems
CN110287052B (en) Root cause task determination method and device for abnormal task
US8055686B2 (en) Method and program of collecting performance data for storage network
US20120054554A1 (en) Problem isolation in a virtual environment
JP2008009842A (en) Control method of computer system, and computer system
CN110489317B (en) Cloud system task operation fault diagnosis method and system based on workflow
CN110740054A (en) data center virtualization network fault diagnosis method based on reinforcement learning
WO2021008029A1 (en) Case execution method, apparatus and device, and computer readable storage medium
US20090307526A1 (en) Multi-cpu failure detection/recovery system and method for the same
KR101830936B1 (en) Performance Improving System Based Web for Database and Application
CN111679955B (en) Monitoring diagnosis and snapshot analysis system for application server
CN112068981B (en) Knowledge base-based fault scanning recovery method and system in Linux operating system
CN116483831B (en) Recommendation index generation method for distributed database
CN113535528B (en) Log management system, method and medium for distributed graph iterative computation job
CN114503084A (en) Parallel program expandability bottleneck detection method and computing device
CN109150596B (en) SCADA system real-time data dump method and device
Ding et al. Automatic Software Fault Diagnosis by Exploiting Application Signatures.
US20230306343A1 (en) Business process management system and method thereof
US10776240B2 (en) Non-intrusive performance monitor and service engine
CN113626288B (en) Fault processing method, system, device, storage medium and electronic equipment
Fuad et al. Self-healing by means of runtime execution profiling
CN110928705B (en) Communication characteristic analysis method and system for high-performance computing application
CN111506422B (en) Event analysis method and system
CN114090433A (en) Buried point data reporting control method and device, storage medium and electronic equipment
CN112181759A (en) Method for monitoring micro-service performance and diagnosing abnormity

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant