CN112948154A

CN112948154A - System abnormity diagnosis method, device and storage medium

Info

Publication number: CN112948154A
Application number: CN201911267056.2A
Authority: CN
Inventors: 叶尧罡
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2019-12-11
Filing date: 2019-12-11
Publication date: 2021-06-11

Abstract

The application discloses a system abnormity diagnosis method, which comprises the following steps: establishing a multi-layer factor graph based on the abnormal log data; determining candidate abnormal reasons of the abnormal logs based on the multilayer factor graph; the application also discloses a system abnormity diagnosis device and a storage medium; by the embodiment of the application, different abnormal reasons corresponding to the same abnormal log in different states can be diagnosed more accurately.

Description

System abnormity diagnosis method, device and storage medium

Technical Field

The present invention relates to the field of business support technologies, and in particular, to a method and an apparatus for diagnosing system anomaly, and a storage medium.

Background

With the continuous development of Information Technology (IT) systems, the size of computer clusters is becoming larger, and the complexity of Hadoop systems is increasing. The abnormality diagnosis of the conventional Hadoop system mainly depends on manual troubleshooting, and the abnormality is not timely positioned and the efficiency is low; therefore, how to reduce manual intervention and diagnose different abnormal reasons corresponding to the same abnormal log in different states more accurately is a technical problem to be solved.

Disclosure of Invention

The embodiment of the invention provides a system abnormity diagnosis method, a system abnormity diagnosis device and a storage medium, which can more accurately diagnose different abnormity reasons corresponding to the same abnormity log in different states of a Hadoop system.

The application provides a system abnormality diagnosis method, which comprises the following steps:

establishing a multi-layer factor graph based on the abnormal log data;

and determining candidate abnormal reasons of the abnormal logs based on the multilayer factor graph.

In the foregoing solution, the establishing a multi-layer factor graph based on the abnormal log data includes:

acquiring at least one variable node based on data of a function call stack in the abnormal log data;

and acquiring at least one reason factor node corresponding to the at least one variable node.

In the foregoing solution, the establishing a multi-layer factor graph based on the abnormal log data further includes:

acquiring at least one log factor node based on data of a non-function call stack in the abnormal log data;

and performing directed connection on the at least one log factor node and the at least one variable node to obtain a first-layer factor graph in a multi-layer factor graph.

and performing undirected connection on the at least one variable node and the at least one reason factor node to obtain a second-layer factor graph in the multi-layer factor graph.

dividing the exception log into at least one exception log group based on time information of data of the exception log;

and establishing a multi-layer factor graph corresponding to the abnormal log group.

In the above solution, before the multi-layer factor graph is built based on the abnormal log data, the method further includes at least one of the following:

removing variables in the data of the non-function call stack in the abnormal log data;

segmenting data of a non-function call stack in the abnormal log data with the variable removed;

converting data of a non-function call stack in the abnormal log data after word segmentation into sentences;

wherein the sentence is a log factor node.

In the above scheme, the data based on the function call stack in the abnormal log data includes:

extracting a method included in the data of the function call stack in the abnormal log data;

determining the method to be the variable node.

In the foregoing solution, the obtaining of the candidate abnormality cause of the abnormality log based on the multi-layer factor graph includes:

the variable nodes respectively transmit the prior probabilities to log factor nodes and reason factor nodes in the multilayer factor graph;

the log factor node and the reason factor node respectively calculate an external probability according to the received prior probability and transmit the external probability to the variable node;

the variable node determines prior probability corresponding to the log factor node and prior probability corresponding to the reason factor node based on the external probability, and sends the determined prior probabilities to the log factor node and the reason factor node respectively;

circulating until the circulating times reach a first threshold value or the function for determining the prior probability or the posterior probability is converged;

and determining the candidate abnormal reason based on the prior probability or the posterior probability when the circulation times reach a first threshold or determining the prior probability or the function of the posterior probability and the prior probability or the posterior probability during convergence.

In the above scheme, the method further comprises: comparing the candidate abnormal reasons with actual abnormal reasons to obtain the confidence of the abnormal reasons corresponding to the abnormal log;

and based on the confidence coefficient, dividing the abnormal log group again.

An embodiment of the present application further provides a system abnormality diagnosis apparatus, where the apparatus includes:

an establishing unit for establishing a multi-layer factor graph based on the abnormal log data;

and the determining unit is used for determining candidate abnormal reasons of the abnormal log based on the multilayer factor graph.

In the above scheme, the apparatus further comprises: the obtaining unit is used for obtaining at least one variable node based on the data of the function call stack in the abnormal log data; and acquiring at least one reason factor node corresponding to the at least one variable node.

In the foregoing solution, the obtaining unit is further configured to obtain at least one log factor node based on data of a non-function call stack in the abnormal log data;

and the first connecting unit is used for performing directed connection on the at least one log factor node and the at least one variable node to obtain a first-layer factor graph in a multi-layer factor graph.

In the above scheme, the apparatus further comprises: and the second connecting unit is used for carrying out multidirectional connection on the at least one variable node and the at least one reason factor node to obtain a second-layer factor graph in the multi-layer factor graph.

In the above scheme, the apparatus further comprises: a dividing unit for dividing the abnormality log into at least one abnormality log group based on time information of data of the abnormality log;

the establishing unit is further configured to establish a multi-layer factor graph corresponding to each abnormal log group.

In the above scheme, the apparatus further includes at least one of the following units:

the removing unit is used for removing variables in the data of the non-function call stack in the abnormal log data;

a word segmentation unit, configured to segment words of data of a non-function call stack in the abnormal log data from which the variable is removed;

the conversion unit is used for converting the data of the non-function call stack in the abnormal log data after word segmentation into sentences;

wherein the sentence is a log factor node.

In the above scheme, the apparatus further comprises: the extracting unit is used for extracting a method included in the data of the function call stack in the abnormal log data;

the determining unit is further configured to determine that the method is the variable node.

In the above scheme, the apparatus further comprises: the calculation unit is used for respectively transmitting the prior probability in the variable node to the log factor node and the reason factor node in the multilayer factor graph;

respectively calculating external probabilities according to the prior probabilities received by the log factor nodes and the reason factor nodes, and transmitting the external probabilities to the variable nodes;

based on the external probability received by the variable node, determining prior probability corresponding to the log factor node and prior probability corresponding to the reason factor node, and respectively sending the determined prior probabilities to the log factor node and the reason factor node;

circulating in the way until the circulating times reach a first threshold value, or determining that the difference value between the prior probability or the posterior probability and the latest prior probability or posterior probability is smaller than a second threshold value;

the determining unit is further configured to determine the candidate abnormal cause based on a prior probability or a posterior probability when the number of cycles reaches a first threshold, or when the difference between the prior probability or the posterior probability and the latest prior probability or the posterior probability is smaller than a second threshold.

In the foregoing solution, the dividing unit is further configured to: comparing the candidate abnormal reasons with actual abnormal reasons to obtain the confidence of the abnormal reasons corresponding to the abnormal log;

and based on the confidence coefficient, dividing the abnormal log group again.

The embodiment of the invention also provides a storage medium which stores an executable program, and when the executable program is executed by a processor, the system abnormity diagnosis method is realized.

The system abnormity diagnosis method provided by the embodiment of the application establishes a multilayer factor graph based on abnormity log data; and determining candidate abnormal reasons of the abnormal logs based on the multilayer factor graph. Different abnormal reasons corresponding to the same abnormal log in different states can be quantized more accurately. The embodiment of the application uses a dynamic sum-product algorithm to calculate the probability of information transmission in a multilayer factor graph constructed based on log information, and can capture the incidence relation of the log information in different running states of a system, thereby realizing the dynamic reason inference with higher accuracy which cannot be realized by the prior art. The structure of the factor graph model is used for reducing the dependence on the prior knowledge to a large extent, manual intervention in actual analysis can be greatly reduced, and meanwhile, due to the calculation model fixed by the product-sum algorithm, automatic probability calculation can be realized by a computer, the manual calculation amount is reduced, and therefore the time complexity is lower. By introducing a collaborative learning framework, the system abnormity diagnosis process is equivalent to a closed effective reasoning diagnosis process, the establishment and selection of factor graphs under different states can be adaptively adjusted, and the comprehensive and accurate dynamic abnormity diagnosis of the Hadoop system is achieved.

Drawings

Fig. 1 is a first schematic flow chart of an alternative method for diagnosing system anomalies according to an embodiment of the present disclosure;

fig. 2 is an alternative flow chart illustrating the establishment of a multi-layer factor graph based on abnormal data according to an embodiment of the present application;

fig. 3 is a schematic view of an optional process for acquiring at least one log factor node based on data of a non-function call stack in the abnormal log data according to the embodiment of the present application;

fig. 4 is a factor graph established based on two exception logs according to an embodiment of the present application;

fig. 5 is a schematic diagram of a sum-product algorithm of a factor graph provided in the embodiment of the present application;

fig. 6 is a schematic view illustrating an alternative flow chart of a system abnormality diagnosis method according to an embodiment of the present application;

fig. 7 is a schematic diagram of an alternative structure of a system abnormality diagnosis apparatus according to an embodiment of the present application.

Detailed Description

The present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

With the continuous development of the IT system, the size of the computing cluster is increasingly large nowadays, and the complexity of the system is increasingly increased. The cost of manpower and material resources required for system operation and maintenance is increased, even exceeding the construction cost of the system. Once some nodes in a computing cluster fail, stable operation of a complex system is affected, and serious consequences of the system are likely to be caused. Therefore, the automatic and intelligent operation and maintenance system can not only improve the reliability and the use efficiency of the system, but also greatly save the operation and maintenance cost. The stable operation of the system depends on the normal operation of each component in the system, and the very important step in the intelligent operation and maintenance is the timely diagnosis work when the system fault occurs. Faults are always unavoidable, and the common practice is to find and eliminate the faults in time, so that an efficient and reliable system abnormity diagnosis method is needed. Anomaly diagnosis is a technique for discovering anomalies in a system or system components. As IT systems are becoming larger, the logic becomes more complex, and the difficulty of diagnosing system anomalies increases. On the one hand, many large-scale systems do not necessarily have very detailed monitoring capability; on the other hand, due to some fault-tolerant mechanisms in the system, the anomaly is not intuitively represented in the system sometimes.

Due to many limitations and considerations, the current abnormality diagnosis of large-scale systems still mainly depends on manual troubleshooting, and the abnormality is not timely located and is inefficient. In most large-scale IT systems, IT is difficult to reproduce and debug errors in the system for privacy protection purposes and system environment settings. Meanwhile, the application established on the large-scale system usually needs to process the large-scale data set provided by the user in parallel, and the running time is longer; the uncertainty factor is even greater if the above-described application is deployed in a large distributed system environment. At this time, the system log as a dynamic information recording program operation is an important means for abnormality diagnosis, and it can help maintenance personnel to analyze and reproduce errors, and further correct system abnormality, and improve system operation reliability.

When the system runs and has problems, the reason and the position of the problems can be deduced through the system log. Optimally, the system operation and maintenance personnel will obtain the most accurate diagnostic information with the least input. However, currently, the efficiency of using the system log to diagnose the abnormality is not high, which has two main reasons:

1) the analysis of the abnormity is generally strongly dependent on the operating environment of the system and the system input, and the information is often difficult to reproduce, and the available information quantity is very limited;

2) the system has huge log quantity, log information is unstructured free text and has no structuring characteristic, and valuable information is difficult to extract automatically. The current exception analysis mainly relies on the manual development of operation and maintenance personnel, and the context related to the code needs to be analyzed at many times.

In the current intelligent operation and maintenance technology, the mainstream scheme for anomaly diagnosis still depends on manual enumeration of operation and maintenance personnel, that is, various test cases which may cause anomalies are enumerated so as to expose the causes and phenomena of faults as much as possible, and such a method usually lacks pertinence, has low diagnosis efficiency and incomplete results. The automatic technology of the abnormity diagnosis not only can assist developers to locate possible sources, possible places and possible abnormity types of problems, but also can carry out preliminary screening and prejudgment on the abnormity diagnosis process, and can effectively improve the abnormity diagnosis efficiency. Therefore, the automation technology of this process has been a research focus of abnormality diagnosis. However, the existing technologies are still not mature enough, have low accuracy, and often have the problems of false alarm and false negative. Therefore, how to effectively utilize some methods and techniques to improve the diagnosis accuracy also becomes one of the main research targets of the automatic technology for abnormality diagnosis.

Currently, fault diagnosis technology is constantly being incorporated into new computational and mathematical methods, including: machine learning, artificial intelligence, bayesian inference, and graph theory.

The existing abnormality diagnosis methods can be roughly classified into six categories, which are respectively: a rule-based system anomaly diagnostic method, a model-based system anomaly diagnostic method, a statistics-based system anomaly diagnostic method, a machine learning-based system anomaly diagnostic method, a count and threshold-based system anomaly diagnostic method, and a visualization-based system anomaly diagnostic method.

The system abnormality diagnosis method based on the rule mainly carries out system abnormality diagnosis by expressing expert knowledge in the field as a series of fixed rules. Rules are manually extensible and interpretable, but the key limitations of this technique are that unknown system anomalies cannot be diagnosed, and that most knowledge bases themselves are difficult to maintain;

the system abnormity diagnosis method based on the model is to define the system as a certain mathematical expression and verify whether the defined model is satisfied by testing the observed behaviors, thereby realizing the system abnormity diagnosis. The model-based system anomaly diagnosis method is suitable for diagnosing problems at an application level, and the model construction needs to have very comprehensive and deep understanding on the system, so the actual operation difficulty of the model-based system anomaly diagnosis method is very large.

The statistical-based system anomaly diagnosis method is used for diagnosing system anomalies by using theories such as correlation analysis, comparison and probability on empirical data related to a system. The statistical-based system anomaly diagnosis method does not need to have very deep knowledge of the interior of the system or a model, but the statistical-based system anomaly diagnosis method is generally difficult to diagnose the unsteady state anomalies of the system, although the anomalies are really common to large-scale systems.

The system abnormity diagnosis method based on machine learning is to adopt a clustering method to identify common modes in the system, or use training data to determine whether the system state is normal, and find out the potential causes of abnormity. The system abnormity diagnosis method based on machine learning can automatically learn system behaviors, but when the feature dimension is large, the accuracy is rapidly reduced.

The counting and threshold value based system anomaly diagnosis method can diagnose transient and intermittent anomalies, but the technology depends on parameter correction to a large extent and requires parameter configuration through strict mathematical formulas and analytical models.

The system abnormality diagnosis method based on visualization is to identify abnormal points in the system by visualizing trends and patterns of data. It can have multiple hypotheses about the root cause of the problem, but this technique cannot automatically identify the problem.

Based on the problems existing in the conventional system abnormality diagnosis method, the abnormality diagnosis method can solve the technical problems and defects which cannot be solved in the prior art.

Fig. 1 shows an alternative flow diagram of an abnormality diagnosis method provided in an embodiment of the present application, which will be described according to various parts.

Step S101, establishing a multi-layer factor graph based on the abnormal log data.

In some embodiments, the format of the log in the system is: ((timestamp, level, class, message), and (call stack).

In some embodiments, the multi-layer factor graph is used to diagnose system dynamic anomalies.

In some embodiments, the server building a multi-layer factor graph based on the abnormal log data includes steps S201 to S206, and fig. 2 shows an alternative flow diagram for building the multi-layer factor graph based on the abnormal log data, which will be described according to various steps.

Step S201, acquiring at least one log factor node based on the data of the non-function call stack in the abnormal log data.

In some embodiments, the server obtains at least one log factor node based on the data of the non-function call stack in the abnormal log data includes steps S301 to S303, and fig. 3 shows an optional flowchart of obtaining at least one log factor node based on the data of the non-function call stack in the abnormal log data, which will be described according to each step.

Step S301, abnormal logs are screened according to the level corresponding to the level, and a log set is formed.

In some embodiments, the server screens the abnormal logs according to the level corresponding to the level in the log data, and forms a log set with the abnormal logs according to the level corresponding to the level.

In some embodiments, the level of the log is, in order from high to low: fatal level, Error level, Warn level, Info level, Debug level. The Fatal level represents that a very serious system abnormality occurs, the abnormality cannot be repaired, and if the system disregards the abnormality and continues to operate, a very serious result occurs; the Error level represents that a system is abnormal, but the abnormality can be repaired, and if the system disregards the abnormality, whether the system can normally operate cannot be determined; the Warn level represents that a potential system exception exists, the potential exception can be repaired, and if the potential exception is disregarded by the system, the system can operate normally; the Info level emphasizes the running process and is used for outputting some important information in the program running process; the Debug level is used for printing some running information.

In some embodiments, the server screens abnormal logs of the total level and the Error level according to the level corresponding to the level in the log data, and forms a log set respectively.

In some embodiments, the log collection may be represented as: l ═ log₁,...,log_mWherein, the log_jRepresenting the jth original log.

Step S302, cleaning the screened abnormal log

In some embodiments, the server flushes the variable values in the log based on the unique format of the log and human experience.

In some embodiments, the server uses a regular expression matching method to clean the variable values in the log.

In some embodiments, the values of the variables of the washing include: at least one of an Internet Protocol (IP) address, a path, a Uniform Resource Locator (URL), a 10-System number, a 16-System number, a block _ id of a Hadoop Distributed File System (HDFS), an Application _ id, a jobid, a task _ id, and a container _ id of a MapReduce task.

Step S303, performing word segmentation on the cleaned abnormal log.

In some embodiments, the server performs word segmentation on the cleaned log, and the separator used for word segmentation includes: '#', ''? ',' |! At least one of the above.

In some embodiments, the server uses delimiters to tokenize the cleaned logs.

In some embodiments, each log after the word segmentation is converted into a word list.

For example, the original log_jThe log form of (1) is:

2018-10-18 01:12:09,189WARN org.apache.hadoop.hdfs.server.datanode.DataNode:DatanodeRegistration(192.168.50.61:50010,infoPort＝50075,infoSecurePort＝0,ipcPort＝50020)

the cleaned logs are:

#WARN org.apache.hadoop.hdfs.server.datanode.DataNode:DatanodeRegistration(#,infoPort＝#,infoSecurePort＝#,ipcPort＝#)

the log after word segmentation is:

['WARN','org','apache','hadoop','hdfs','server','datanode','DataNode','DatanodeRegistration','infoPort','infoSecurePort','ipcPort']

the conversion is to sentences:

WARN org apache hadoop hdfs server datanode DataNode DatanodeRegistration datanodeUuid eea infoPort infoSecurePort ipcPort

in some embodiments, the factor graph is generally a "bipartite graph" that is used to describe a multivariate function. In general, there are two basic types of nodes in the factor graph: the variable node and the corresponding factor node (or function node), and the variable represented by the variable node is the independent variable of the factor node. Meanwhile, the nodes of the same type are not directly connected by edges. After preprocessing, each log information log_jConverted into sentences comprised of a list of words, each sentence corresponding to a log factor node in a multi-level factor graph.

Step S202, based on the data of the function call stack in the abnormal log data, at least one variable node is obtained.

In some embodiments, the function call stack represents a call stack of log information, including calling procedures of a plurality of methods; the format of the function call stack is typically a combination of a plurality of "at xxx.

In some embodiments, all methods "xxx.yyy.aaa.function" of the function call stack are extracted using regular expressions. A callstack can be represented as

Wherein

Representing a method corresponding to a variable node in the multi-level factor graph. All callstack constitutes a callstack set CS ═ CS₁,cs₂,...,cs_m}. Each cs_iIn the form of a sentence,

for one of the "words", CS can then be seen as the entire "sentence" set, corresponding to the set of variable nodes X in the multi-level factor graph.

Step S203, acquiring at least one reason factor node corresponding to the at least one variable node.

In some embodiments, the server obtaining at least one cause factor node corresponding to the at least one variable node comprises: the server maps the abnormal reasons corresponding to the variable nodes into the multilayer factor graph according to a basic abnormal reason knowledge base to form a new layer of node set which is a reason factor node and is marked as R_i。

Step S204, performing directed connection on the at least one log factor node and the at least one variable node to obtain a first-layer factor graph in a multi-layer factor graph.

In some embodiments, the server performs a directed connection on the at least one log factor node and the at least one variable node, resulting in a first-level factor graph in a multi-level factor graph.

Step S205, performing undirected connection on the at least one variable node and the at least one reason factor node to obtain a second-layer factor graph in the multi-layer factor graph.

In some embodiments, the server performs a multidirectional connection on the at least one variable node and the at least one cause factor node, to obtain a second-layer factor graph in the multi-layer factor graph.

In some embodiments, the common factor graph may factor a global function with multiple variables to obtain a product of several local functions, and a bipartite graph obtained based on the product is called the factor graph. For the function g (X)₁,...,X_n) The following equation holds true:

wherein,

in conjunction with the above expression, the factor map can be represented as a triple G (X, F, E):

X＝{X₁,...,X_ndenotes a variable node; f ═ F₁,...,f_mDenotes the factor node;

e is the set of edges, if a variable node X_kBy factor node f_jSet S of_jComprises, then can be at X_kAnd f_jA non-directional edge is added between the two edges.

Fig. 4 shows a multi-layer factor graph built based on the two anomaly logs. In the data of the abnormal log obtained in step S201, the log factor node set L of the non-callstack part is { log ═ log₁,...,log_m}; in step S202, a variable node set of the callstack part in the data of the exception log is acquired. Corresponding the log factor node set to the multi-layer factor graph, specifically, each log in the log factor node set_jCorresponds to a log factor node in the factor graph; and corresponding the variable node set and the reason factor node set to a multilayer factor graph.

Log for two exceptions_jAnd log_kIf the following relationship exists:

log_j＝f_A(X₂,X₃,X₅)

log_k＝f_B(X₁,X₂,X₃,X₄)

wherein each X is_iIndicating a method call in the stack information to which the log relates.

Step S206, dividing the abnormal log into at least one abnormal log group based on the time information of the data of the abnormal log, and establishing a multi-layer factor graph corresponding to each abnormal log group.

In some embodiments, the Hadoop system is a dynamically time-varying system. The load and resource utilization of the Hadoop system are changed continuously along with the time. At this time, the factor graph needs to be dynamically updated according to the state of the Hadoop system. For example, the Hadoop system is divided into a busy state, an idle state and a normal state according to the load of the Hadoop system, and the loads of the Hadoop system corresponding to the three states are respectively: the load of the system is 90% or more of the maximum load, the load of the system is 50% or less of the maximum load, and the load of the system is greater than 50% of the maximum load and less than 90% of the maximum load. Therefore, in different time periods, according to the load condition of the system, the relation between the abnormal logs of each component in the Hadoop system is represented by a multi-layer factor graph.

In some embodiments, the server dividing the anomaly log into at least one anomaly log group based on time information of data of the anomaly log comprises: the server divides the data of the abnormal log into different groups according to the event information of the time stamp part in the abnormal log, divides the abnormal log into different groups according to the running state of the system, and establishes a corresponding multilayer factor graph for all the abnormal logs of each group.

In some embodiments, the multi-layer factor graph is a multi-layer factor graph created by dividing data of the abnormal log according to time information, and therefore, the multi-layer factor graph can also be called a time-varying multi-layer factor graph.

In some embodiments, the method further comprises: all the multi-layer factor graphs are combined to form the integral multi-layer factor graph of the system.

And step S102, determining candidate abnormal reasons of the abnormal log based on the multilayer factor graph.

In some embodiments, the server determines candidate causes of anomalies for the anomaly log based on the multi-layer factor graph comprises: the server transmits the prior probability of the variable node to a log factor node and a reason factor node in the multilayer factor graph respectively; the log factor node and the reason factor node respectively calculate an external probability according to the received prior probability and transmit the external probability to the variable node; the variable node determines prior probability corresponding to the log factor node and prior probability corresponding to the reason factor node based on the external probability, and sends the determined prior probabilities to the log factor node and the reason factor node respectively;

circulating in such a way until the circulating times reach a first threshold value or the difference value between the prior probability or the posterior probability and the latest prior probability or posterior probability is determined to be smaller than a second threshold value; and determining the candidate abnormal reason based on the prior probability or the posterior probability when the number of circulation reaches a first threshold value or when the difference value of the prior probability or the posterior probability and the latest prior probability or posterior probability is smaller than a second threshold value.

In some embodiments, the first threshold is a preset maximum number of cycles; the second threshold is a preset maximum difference.

In some embodiments, the determining the candidate abnormality cause according to the prior probability or the posterior probability when the loop times reach the first threshold or the prior probability converges includes: and calculating the probability of the reason corresponding to the reason factor node according to the prior probability or the posterior probability, and arranging the reasons according to the probability from large to small, wherein the reason with the highest probability is the candidate abnormal reason.

In some embodiments, the server uses a dynamic sum-product algorithm to calculate the information transfer process of the abnormal log on the function call relation based on the multilayer factor graph, and further analyzes the abnormal reason of the abnormal log. Specifically, the calculation and grasping of four message passing relationships according to the multi-layer factor graph established in the step S101 includes: the message transfer relationship of the variable to the local function (log factor node), the message transfer relationship of the local function (log factor node) to the variable, the message transfer relationship of the variable to the local function (reason factor node), and the message transfer relationship of the local function (reason factor node) to the variable; deducing possible abnormal data part information of the abnormal log according to the function call stack information of the abnormal log, deducing possible corresponding function call stack information according to the data of the abnormal log, deducing possible abnormal reasons according to the function call stack information, and deducing possible function call stack information according to the abnormal reasons. Considering the time-varying characteristics of the multilayer factor graphs, a dynamic sum-product algorithm is used, and during calculation, the time-varying characteristics can be corresponding to different multilayer factor graphs along with different time, so that different probability calculation results are obtained, and further, the most possible abnormal reason of a certain abnormal log in the current state is determined in different states of the system.

In some embodiments, the determining candidate causes of anomalies for an anomaly log based on the multi-layer factor graph includes: and respectively calculating all information transmitted by the at least one log factor node to the at least one variable node based on the multilayer factor graph. The edge function of each variable node is the product of all messages transmitted by at least one log factor node to the variable node.

And respectively calculating all information transmitted by the at least one variable node to the at least one reason factor node based on the multilayer factor graph, wherein the edge function of each reason factor node is the product of all messages transmitted by the at least one variable node to the edge function of the reason factor node.

And calculating the probability of each reason factor node based on the edge function of the variable node and the edge function of the reason factor node, and sequencing the reason factor nodes from large to small according to the probability. And the abnormal reason corresponding to the reason factor node with the highest probability is the candidate abnormal reason.

Fig. 5 is a schematic diagram of the sum-product algorithm of the factor graph.

The message calculation rule according to the sum-product algorithm is as follows:

wherein,

calculating an external probability according to the received prior probability by the reason factor node;

the prior probability sent by the variable factor to the reason factor node; r (x, y)₁,...,y_n) Information contained within the reason factor node;

calculating the obtained external probability for the log factor node according to the received prior probability; l (k, z)₁,...,z_n) Information contained in the node of the log factor;

and sending the prior probability to the log factor node for the variable node.

According to the sum-product theorem, if the function f in the factor graph has no period, then:

wherein,

is an edge function of the variable factor node.

For example, in fig. 4, two pieces of log data correspond to variable nodes with sequence numbers 2, 3, and 5 and variable nodes with sequence numbers 1, 2, 3, and 4, respectively. The variable nodes with the sequence numbers of 1, 3 and 5 are related to the fault cause factor node 1; the variable nodes with sequence numbers 2, 3, 4 are associated with the failure cause factor node 2. The anomaly cause of the anomaly log may be analyzed using a dynamic sum-product algorithm. Log when_jWhen the variable nodes with the sequence numbers of 2, 3 and 5 appear independently, the variable nodes with the sequence numbers of 2, 3 and 5 appear simultaneously, and the variable nodes all point to the node 2 of the reason factor, so that the reason 2 is likely to be log_jThe cause of the fault anomaly. Log when_kWhen singly appearing, the variable nodes with the serial numbers of 1, 2, 3 and 4 are all the sameIt happens that these several variable nodes point to the cause factor node 1 and also to the cause factor node 2. The probability of two cause factor nodes is calculated using the sum-product algorithm at this time. The sum-product algorithm contains two key messaging steps. Forward messaging is messaging from a variable node to a factor node; backward messaging calculates the message passing of the factor node to the variable node. The first step of the sum-product algorithm is to pass the prior probabilities in the variable nodes to the factor nodes. The factor node calculates the posterior probability according to the received prior probability and then transmits the posterior probability to the factor node. And looping back and forth until the algorithm finally converges or the number of loops reaches a given upper limit set. The priority order of cause 1 and cause 2 is calculated based on the last probability value, i.e. which cause is more likely to be the cause of the abnormality at that time.

In some embodiments, after the step S102, the method further comprises:

step S103, comparing the candidate abnormal reasons with actual abnormal reasons to obtain the confidence of the abnormal reasons corresponding to the abnormal log; and based on the confidence coefficient, dividing the abnormal log group again.

In some embodiments, the server compares the candidate abnormal reason calculated in step S102 with an actual system abnormal reason to obtain a confidence of the abnormal reason corresponding to the abnormal log; and based on the confidence coefficient, dividing the abnormal log group again.

In some embodiments, the method further comprises: comparing the candidate abnormal reasons with the actual abnormal reasons, and calculating the confidence coefficient of the candidate abnormal reasons corresponding to the same abnormal log in different system states to obtain a confidence coefficient matrix of the whole system; and based on the confidence coefficient matrix, determining the credibility of each multi-layer factor graph in different system states, further dynamically adjusting the time division of the abnormal log groups in the diagnosis process, and establishing different multi-layer factor graphs.

Therefore, by introducing a collaborative learning framework, the system abnormity diagnosis process is equivalent to a closed effective reasoning diagnosis process, the establishment and selection of the factor graph under different states can be adaptively adjusted, and the comprehensive and accurate dynamic abnormity diagnosis of the Hadoop system is achieved.

Therefore, the system abnormity diagnosis method provided by the embodiment of the application can quantify more accurately due to the fact that the factor graph model has a thick theoretical basis and the characteristics of multiple layers and time variation

And different abnormal reasons corresponding to the same abnormal log under different states. According to the method and the device, the probability of information transmission in the multi-layer factor graph constructed based on the log information is calculated by using a dynamic sum-product algorithm, and the incidence relation of the log information in different running states of the system can be captured, so that the dynamic reason inference with higher accuracy, which cannot be realized in the prior art, is realized. The structure of the factor graph model is used for reducing the dependence on the prior knowledge to a large extent, manual intervention in actual analysis can be greatly reduced, and meanwhile, due to the calculation model fixed by the product-sum algorithm, automatic probability calculation can be realized by a computer, the manual calculation amount is reduced, and therefore the time complexity is lower.

Fig. 6 is a schematic flow chart illustrating an alternative method for diagnosing system anomaly according to an embodiment of the present application, which will be described according to various steps.

In some embodiments, the format of the log in the system is: ((timestamp, level, class, message), (callstack)).

In some embodiments, the callstack portion of information is only partially logged, and the exception log includes a callstack portion.

Step S401, abnormal logs are screened according to the level corresponding to the level, and a log set is formed.

Step S402, cleaning the screened abnormal log

In some embodiments, the values of the variables of the washing include: at least one of an IP address, a path, a URL, a 10-system number, a 16-system number, a block _ id of an HDFS, an Application _ id, a jobid, a task _ id, a container _ id of a MapReduce task, and the like.

And step S403, performing word segmentation on the cleaned abnormal log.

In some embodiments, the server uses delimiters to tokenize the cleaned logs.

For example, the original log_jThe log form of (1) is:

the cleaned logs are:

the log after word segmentation is:

the conversion is to sentences:

Step S404, based on the data of the function call stack in the abnormal log data, at least one variable node is obtained.

Wherein

Step S405, acquiring at least one cause factor node corresponding to the at least one variable node.

Step S406, performing directional connection on the at least one log factor node and the at least one variable node to obtain a first-layer factor graph in a multi-layer factor graph.

Step S407, performing undirected connection on the at least one variable node and the at least one cause factor node to obtain a second-layer factor graph in the multi-layer factor graph.

wherein,

Fig. 4 shows a multi-layer factor graph built based on the two anomaly logs. In the data of the abnormal log obtained in steps S401 to S403, the log factor node set L of the non-callstack part is { log ═ log { (log) }₁,...,log_m}; in step S404, a variable node set of the callstack part in the data of the exception log is acquired. Corresponding the log factor node set to the multi-layer factor graph, specifically, each log in the log factor node set_jCorresponds to a log factor node in the factor graph; and corresponding the variable node set and the reason factor node set to a multilayer factor graph.

Log for two exceptions_jAnd log_kIf the following relationship exists:

log_j＝f_A(X₂,X₃,X₅)

log_k＝f_B(X₁,X₂,X₃,X₄)

Step S408, dividing the abnormal log into at least one abnormal log group based on the time information of the data of the abnormal log, and establishing a multi-layer factor graph corresponding to each abnormal log group.

And step S409, determining candidate abnormal reasons of the abnormal log based on the multilayer factor graph.

Fig. 5 is a schematic diagram of the sum-product algorithm of the factor graph.

wherein,

and sending the prior probability to the log factor node for the variable node.

wherein,

is an edge function of the variable factor node.

For example, in fig. 4, two pieces of log data respectively correspond to a sequence number of2, 3, 5 and variable nodes with sequence numbers 1, 2, 3, 4. The variable nodes with the sequence numbers of 1, 3 and 5 are related to the fault cause factor node 1; the variable nodes with sequence numbers 2, 3, 4 are associated with the failure cause factor node 2. The anomaly cause of the anomaly log may be analyzed using a dynamic sum-product algorithm. Log when_jWhen the variable nodes with the sequence numbers of 2, 3 and 5 appear independently, the variable nodes with the sequence numbers of 2, 3 and 5 appear simultaneously, and the variable nodes all point to the node 2 of the reason factor, so that the reason 2 is likely to be log_jThe cause of the fault anomaly. Log when_kWhen the variable nodes occur independently, the variable nodes with the sequence numbers of 1, 2, 3 and 4 all occur at the same time, and the variable nodes point to the reason factor node 1 and point to the reason factor node 2. The probability of two cause factor nodes is calculated using the sum-product algorithm at this time. The sum-product algorithm contains two key messaging steps. Forward messaging is messaging from a variable node to a factor node; backward messaging calculates the message passing of the factor node to the variable node. The first step of the sum-product algorithm is to pass the prior probabilities in the variable nodes to the factor nodes. The factor node calculates the posterior probability according to the received prior probability and then transmits the posterior probability to the factor node. And looping back and forth until the algorithm finally converges or the number of loops reaches a given upper limit set. The priority order of cause 1 and cause 2 is calculated based on the last probability value, i.e. which cause is more likely to be the cause of the abnormality at that time.

Step S410, comparing the candidate abnormal reasons with actual abnormal reasons to obtain the confidence of the abnormal reasons corresponding to the abnormal log; and based on the confidence coefficient, dividing the abnormal log group again.

Therefore, the system abnormity diagnosis method provided by the embodiment of the application can more accurately quantify different abnormity reasons corresponding to the same abnormity log in different states due to the fact that the factor graph model has a thick theoretical basis and the characteristics of multiple layers and time variation. According to the method and the device, the probability of information transmission in the multi-layer factor graph constructed based on the log information is calculated by using a dynamic sum-product algorithm, and the incidence relation of the log information in different running states of the system can be captured, so that the dynamic reason inference with higher accuracy, which cannot be realized in the prior art, is realized. The structure of the factor graph model is used for reducing the dependence on the prior knowledge to a large extent, manual intervention in actual analysis can be greatly reduced, and meanwhile, due to the calculation model fixed by the product-sum algorithm, automatic probability calculation can be realized by a computer, the manual calculation amount is reduced, and therefore the time complexity is lower. By introducing a collaborative learning framework, the system abnormity diagnosis process is equivalent to a closed effective reasoning diagnosis process, the establishment and selection of factor graphs under different states can be adaptively adjusted, and the comprehensive and accurate dynamic abnormity diagnosis of the Hadoop system is achieved.

Fig. 7 is a schematic diagram illustrating an alternative structure of a system diagnostic apparatus provided in an embodiment of the present application, which will be described according to various parts.

An establishing unit 701 configured to establish a multi-layer factor graph based on the abnormal log data;

a determining unit 702, configured to determine candidate abnormality reasons of the abnormality log based on the multi-layer factor graph.

An obtaining unit 703, configured to obtain at least one variable node based on data of a function call stack in the abnormal log data; and acquiring at least one reason factor node corresponding to the at least one variable node.

The obtaining unit 703 is further configured to obtain at least one log factor node based on data of a non-function call stack in the abnormal log data;

the first connection unit 704 is configured to perform directional connection on the at least one log factor node and the at least one variable node to obtain a first-layer factor graph in a multi-layer factor graph.

A second connecting unit 705, configured to perform a non-directional connection on the at least one variable node and the at least one cause factor node, so as to obtain a second-layer factor graph in the multi-layer factor graph.

A dividing unit 706 configured to divide the exception log into at least one exception log group based on time information of data of the exception log;

the establishing unit 701 is further configured to establish a multi-layer factor graph corresponding to each exception log group.

A removing unit 707, configured to remove a variable in data of a non-function call stack in the abnormal log data;

a word segmentation unit 708, configured to segment words of data of a non-function call stack in the abnormal log data of the removed variable;

a conversion unit 709, configured to convert data of a non-function call stack in the abnormal log data after the word segmentation into a sentence;

wherein the sentence is a log factor node.

An extracting unit 710, which extracts a method included in the data of the function call stack in the abnormal log data;

the determining unit 702 is further configured to determine that the method is the variable node.

A calculating unit 711, configured to transmit the prior probabilities in the variable nodes to the log factor node and the reason factor node in the multilayer factor graph, respectively;

and circulating the steps until the circulating times reach a first threshold value, or determining that the difference value between the prior probability or the posterior probability and the latest prior probability or posterior probability is smaller than a second threshold value.

The determining unit 702 is further configured to determine the candidate abnormal cause based on the prior probability or the posterior probability when the number of cycles reaches a first threshold, or when the difference between the prior probability or the posterior probability and the last prior probability or the posterior probability is smaller than a second threshold.

The dividing unit 706 is further configured to: comparing the candidate abnormal reasons with actual abnormal reasons to obtain the confidence of the abnormal reasons corresponding to the abnormal log; and based on the confidence coefficient, dividing the abnormal log group again.

Those of ordinary skill in the art will understand that: all or part of the steps of implementing the above method embodiments may be accomplished by hardware related to program commands, where the program may be stored in a storage medium, and when the program is executed, when receiving a notification message based on a second application during the running process of a first application, the program responds to the notification message in a first area on a screen of an electronic device; the first area is smaller than a corresponding area of the input method application loaded when the second application is independently run on the screen of the electronic equipment. And the aforementioned storage medium includes: various media capable of storing program codes, such as a removable Memory device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, and an optical disk.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially or partially implemented in the form of a software product, which is stored in a storage medium and includes several commands for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for diagnosing system abnormalities, the method comprising:

establishing a multi-layer factor graph based on the abnormal log data;

2. The method of claim 1, wherein the building a multi-layer factor graph based on the anomaly log data comprises:

3. The method of claim 2, wherein building a multi-layer factor graph based on the anomaly log data further comprises:

4. The method of claim 2, wherein building a multi-layer factor graph based on the anomaly log data further comprises:

5. The method of any of claims 1 to 4, wherein the building a multi-layer factor graph based on the anomaly log data further comprises:

6. The method of claim 1, wherein prior to establishing the multi-tiered factor graph based on the anomaly log data, the method further comprises at least one of:

wherein the sentence is a log factor node.

7. The method of claim 2, wherein the data based on a function call stack in the exception log data comprises:

determining the method to be the variable node.

8. The method of claim 1, wherein obtaining candidate anomaly causes for an anomaly log based on the multi-layer factor graph comprises:

and determining the candidate abnormal reason based on the prior probability or the posterior probability when the number of circulation reaches a first threshold value or when the difference value between the prior probability or the posterior probability value and the last prior probability or the posterior probability is smaller than a second threshold value.

9. The method of claim 1, further comprising:

comparing the candidate abnormal reasons with actual abnormal reasons to obtain the confidence of the abnormal reasons corresponding to the abnormal log;

and based on the confidence coefficient, dividing the abnormal log group again.

10. A system abnormality diagnosis apparatus, characterized in that the apparatus comprises:

11. The apparatus of claim 10, further comprising:

the obtaining unit is used for obtaining at least one variable node based on the data of the function call stack in the abnormal log data; and acquiring at least one reason factor node corresponding to the at least one variable node.

12. The apparatus of claim 11,

the obtaining unit is further configured to obtain at least one log factor node based on data of a non-function call stack in the abnormal log data;

13. The apparatus of claim 11, further comprising:

and the second connecting unit is used for carrying out multidirectional connection on the at least one variable node and the at least one reason factor node to obtain a second-layer factor graph in the multi-layer factor graph.

14. The apparatus according to any one of claims 10 to 13,

the device further comprises: a dividing unit for dividing the abnormality log into at least one abnormality log group based on time information of data of the abnormality log;

15. The apparatus according to claim 10, characterized in that the apparatus further comprises at least one of the following units:

wherein the sentence is a log factor node.

16. The apparatus of claim 11, further comprising:

the extracting unit is used for extracting a method included in the data of the function call stack in the abnormal log data;

17. The apparatus of claim 10, further comprising:

the calculation unit is used for respectively transmitting the prior probability in the variable node to the log factor node and the reason factor node in the multilayer factor graph;

18. The apparatus of claim 10, wherein the dividing unit is further configured to:

and based on the confidence coefficient, dividing the abnormal log group again.

19. A storage medium storing an executable program which, when executed by a processor, implements the system abnormality diagnosis method according to any one of claims 1 to 9.

20. A system abnormality diagnosis apparatus comprising a memory, a processor, and an executable program stored on the memory and executable by the processor, wherein the processor executes the executable program to perform the steps of the system abnormality diagnosis method according to any one of claims 1 to 9.