CN116804951A

CN116804951A - Fault analysis method and related equipment

Info

Publication number: CN116804951A
Application number: CN202310610513.3A
Authority: CN
Inventors: 吴施楷; 梁永贵; 曹瑞
Original assignee: XFusion Digital Technologies Co Ltd
Current assignee: XFusion Digital Technologies Co Ltd
Priority date: 2023-05-23
Filing date: 2023-05-23
Publication date: 2023-09-26

Abstract

The embodiment of the application discloses a fault analysis method and related equipment, comprising the following steps: obtaining M fault events from the log; determining M-1 fault event pairs based on the timing of the M fault events; each fault event pair consists of two fault events adjacent in time sequence; acquiring a context semantic body of each fault event pair; the context semantic body of each fault event pair comprises context semantic bodies of two fault events in the fault event pair, and the context semantic body of each fault event sequentially comprises a feature vector matrix corresponding to N events in front of the fault event, a feature vector matrix corresponding to the fault event and a feature vector matrix corresponding to N events in back of the fault event; respectively inputting the context semantic bodies of M-1 fault event pairs into a classification model to determine M-1 relation probability values; each relationship probability value is used for indicating the relevance between two fault events in the corresponding fault event pair; the fault propagation link is determined based on the M-1 relationship probability values.

Description

Fault analysis method and related equipment

Technical Field

The embodiment of the application relates to the field of servers, in particular to a fault analysis method and related equipment.

Background

With the rapid development of hardware systems and software systems in the field of servers, log data generated by intelligent operation and maintenance are continuously increased, log formats are widely different, and simultaneously, the quantity of log words is continuously increased, so that the difficulty in analyzing the log data is also continuously increased.

The server inevitably encounters a fault during operation, and the symptom of the fault is discovered first when the fault is processed, but finding the root cause of the fault is the key to solving the problem. The fault events on the propagation link of the fault root cause are recorded in the log data, and accurately identifying the relevance between the fault events is a key step for locating the fault root cause. However, in the prior art, the above problems have not been solved well.

Disclosure of Invention

The embodiment of the application provides a fault analysis method and related equipment, which can determine a fault propagation link and accurately position a fault root cause.

The first aspect of the present application provides a fault analysis method:

m fault events are obtained from the log, wherein M is a positive integer greater than 1. Determining M-1 fault event pairs based on the timing of the M fault events; wherein each fault event pair consists of two fault events adjacent in time sequence; acquiring a context semantic body of each fault event pair; the context semantic body of each fault event pair comprises context semantic bodies of two fault events in the fault event pair, wherein the context semantic body of each fault event sequentially comprises a feature vector matrix corresponding to the N events of the preamble of the fault event, a feature vector matrix corresponding to the fault event and a feature vector matrix corresponding to the N events of the postamble of the fault event. Respectively inputting the context semantic bodies of M-1 fault event pairs into a classification model to determine M-1 relation probability values; wherein each relationship probability value is used for indicating the relevance between two fault events in the corresponding fault event pair; the fault propagation link is determined based on the M-1 relationship probability values.

In the application, the context semantic bodies of the fault event pairs are more in line with the time sequence and the space between the fault events, and the relation distinguishing property is higher, so that the relation probability value acquired based on the context semantic bodies is more accurate, the positioning of the fault root cause can be accurately carried out, and the fault propagation link is determined.

In one possible implementation, M fault events are obtained from the log based on the first event detection model and the second event detection model; wherein the first event model is a detection model determined based on fault event training; the second event detection model is a detection model determined based on training of normal events.

In the application, in order to avoid the situation that the accuracy of the event detection model is insufficient due to the fact that the sample size of the fault event is too small compared with that of the normal event, the fault event and the normal event are respectively used for training the two event detection models, and a plurality of fault events in the log template are determined based on the two event detection models, so that the accuracy of identification can be improved.

In one possible implementation, a plurality of events are obtained from a log; determining a feature vector matrix corresponding to a plurality of events; inputting the feature vector matrix of each event into a first event detection model to obtain a first output feature vector matrix corresponding to the feature vector matrix of each event; inputting the feature vector matrix of each event into a second event detection model to obtain a second output feature vector matrix; determining a first error value based on the feature vector matrix of each event and the corresponding first output feature vector matrix; determining a second error value based on the feature vector matrix of each event and the corresponding second output feature vector matrix; based on the first error value and the second error value, it is determined whether each event is a fault event.

In one possible implementation, the first membership value is calculated based on the first error value; calculating a second membership value based on the second error value; the calculation formula of the first membership value is as follows: p1=1- (x 1-a 1)/(b 1-a 1); wherein x1 is the first error value of each event, a1 is the maximum value of the plurality of third errors, and b1 is the minimum value of the plurality of third errors; the third error is an error determined based on the first feature vector matrix and the third output feature vector matrix; the first eigenvector matrix is an eigenvector matrix corresponding to a fault event for training the first event detection model, and the third output eigenvector matrix is an output matrix obtained by inputting the first eigenvector matrix into the first training model; the third model is used for acquiring a first event detection model; the calculation formula of the second membership value is as follows: p2=1- (x 2-a 2)/(b 2-a 2); wherein x2 is a second error value of each event, a2 is a maximum value of a plurality of fourth errors, b2 is a minimum value of the plurality of fourth errors, and the fourth errors are errors determined based on a second feature vector matrix and a fourth output feature vector matrix; the second eigenvector matrix is an eigenvector matrix corresponding to a normal event for training the second event detection model, and the fourth output eigenvector matrix is an output matrix obtained by inputting the second eigenvector matrix into the second training model, wherein the fourth model is used for obtaining the second event detection model; whether each event is a fault event is determined based on the first membership value and the second membership value.

In one possible implementation, the event is determined to be a fault event if the first membership value is greater than or equal to the second membership value.

In one possible implementation, the number of failure events used to train the first event detection model is twice the number of normal events used to train the second event detection model.

In one possible implementation, preprocessing is performed on the log to obtain a log template of the log, wherein the log template comprises one or more events, and each event comprises a plurality of log template words; converting each log template word into a primary feature vector; converting the primary feature vector into a word2vec embedded feature vector; and embedding word2vec corresponding to the log template words into feature vectors, and splicing to obtain a feature vector matrix corresponding to each event.

In one possible implementation, the first event detection model and the second event detection model are LSTM models.

In one possible implementation, the classification model is a bilstm-softmax classification model.

A third aspect of the application provides a computing device comprising a processor coupled to a memory for storing instructions which, when executed by the processor, cause the computing device to perform a method as in the first aspect described above.

A fourth aspect of the application provides a computer readable storage medium having stored thereon a computer program or instructions which, when executed, cause a computer to perform the method of the first aspect described above.

A fifth aspect of the application provides a computer program product comprising computer program code for causing a computer to carry out the method as in the first aspect described above when the computer program code is run on a computer.

Drawings

FIG. 1 is a schematic diagram of an application scenario of the present application;

FIG. 2 is a schematic flow chart of a fault analysis method according to the present application;

FIG. 3 is a schematic diagram of training a first event detection model in the present application;

FIG. 4 is a schematic diagram of training a second event detection model in accordance with the present application;

FIG. 5a is a schematic diagram of the context semantics of constructing a fault event in the present application;

FIG. 5b is a schematic diagram of a context semantic body for constructing a fault event pair in the present application;

FIG. 6 is a schematic diagram of obtaining a relationship probability value according to the present application;

FIG. 7 is a schematic diagram of a computing device in accordance with the present application;

FIG. 8 is a schematic diagram of a computing device in accordance with the present application.

Detailed Description

Embodiments of the present application will now be described with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the present application. As one of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical scheme provided by the embodiment of the application is also applicable to similar technical problems.

The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In order to facilitate an understanding of the present application, concepts related to the present application will be described below:

log data: is one or more log files automatically created and maintained by the server, containing a list of activities it performs. The log data records not only the state information in the normal running process of the server, but also the state change abnormal response information when the server fails.

And (3) fault identification: the fault identification is a technology for knowing and grasping the state of a machine generated in the running process, determining the whole or partial normal or abnormal state of the machine, finding the fault and the reason thereof in early stage and forecasting the development trend of the fault.

Event relationship: event relationships refer to interdependencies and associations between events, and have a logical form of objectivity and regularity. The event relation detection takes an event as a main element, and the deep logic relation between the events is mined by analyzing the structural information and semantic features of the event text, so that the derivation and development of the event and the reasoning and prediction of target information are assisted.

Failure: a failure is a grammatical error or a logical error of a computer program. A system failure refers to a system that, during operation, for some reason, causes a transaction to terminate in an abnormal manner during execution. The failure can be classified into hardware failure and software failure by the failed component. Hardware failure refers to failure of the hardware system. A software fault refers to a program running some illegal instructions, such as privileged instructions.

Root cause positioning: root causes that cause the fault to occur and propagate are determined from a complex network of fault event relationships.

Bi-lstm: LSTM is known as Long Short-Term Memory (Long Short-Term Memory), which is one of the RNNs (recurrent neural networks, recurrent neural network). LSTM is well suited for modeling time series data, such as text data, due to its design characteristics. Bi-LSTM is an abbreviation for Bi-directional Long Short-Term Memory, and is formed by combining forward LSTM with backward LSTM. Both are often used to model context information in natural language processing tasks.

PC algorithm: the PC algorithm is one of causal discovery algorithms, which is an algorithm that discovers causal relationships from data by statistical methods, in which the occurrence of one event has a direct or indirect effect on another event.

The embodiment of the application provides a fault analysis method and related equipment, which are used for accurately positioning a fault root cause and determining a fault propagation link.

The embodiment of the application can be applied to an application scene shown in fig. 1, wherein the computing equipment acquires the real-time log data of the server, and the real-time log data of the server is generated in real time by the server. The computing equipment can analyze and process the real-time log data of the server, determine a fault propagation link and further accurately determine the root cause of the fault.

Wherein the computing device includes, but is not limited to, a server or a computer.

It should be noted that, in the above scenario, when the computing device is a server, the computing device that performs log analysis and the server that stores the log may be the same server.

Referring to fig. 2, a fault analysis method according to an embodiment of the application is described below.

201. M fault events are obtained from the log, wherein M is a positive integer greater than 1;

the computing device obtains server real-time log data, including, for example, BMC logs and OS logs. The computing device performs data preprocessing on the server real-time log data, including converting an irregular log block into a structured data structure, setting the same log level for the same log, performing special character filtering, regularization processing, variable filtering and other operations, and finally extracting the log data into a log template. The log template word refers to words contained in the log template, such as bug, unable, null, pointer and kernel, and is converted into a primary feature vector by 1-of-N coding. And then, converting the obtained primary feature vector into a word2vec embedded feature vector, thereby reducing the dimension of the vector.

After the operation is finished, the computing equipment identifies the fault event and the normal event in the log template, so that M fault events are obtained. In one possible implementation, the computing device may identify fault events and normal events in the server real-time log data using the first event detection model and the second event detection model.

The training process of the first event detection model and the second event detection model is described below:

the computing device obtains server history log data, which is log data generated by the server in the history, including, for example, a BMC log and an OS log. The computing equipment performs data preprocessing on the historical log data of the server, and the data preprocessing comprises the operations of converting irregular log blocks into a structured data structure, setting the same log level for the same log, filtering special characters, regularizing, filtering variables and the like, and finally extracting the log data into a log template. The log template words refer to words contained in the log template, such as bug, unable, null, pointer and kernel. The computing device converts the log template words into primary feature vectors by means of 1-of-N encoding. And then, converting the obtained primary feature vector into a word2vec embedded feature vector, thereby reducing the dimension of the vector and being beneficial to subsequent model training.

The fault event is generally embodied in the log template as a sentence composed of log template words, for example, "bug unable to handle kernel null pointer" is a fault event. Similar to the fault event, the normal event is also generally embodied in the log template as a sentence composed of log template words, for example, "mtrr variable ranges enable ×base mask" is a normal event. And taking fault events in a log template obtained by carrying out data preprocessing on the server history log data as training data, and inputting the training data into the LSTM model 1 for training so as to obtain a first event detection model. Specifically, word2vec embedded feature vectors corresponding to log template words of a fault event are spliced into a feature vector matrix each time, the feature vector matrix is input into an LSTM model 1, the LSTM model 1 is trained through the input and the output of the LSTM model 1, the multidimensional space probability distribution of the word2vec embedded feature vectors is learned, and Encoded Representation of the middle layer represents embedded separability features of the fault event. For example, referring to fig. 3, the LSTM model 1 includes an encoder, a Encoded Representation, and a decoder, the x-axis is the direction of the word2vec embedded feature vector, the word2vec embedded feature vector corresponding to the log template word of the fault event "bug unable to handle kernel null pointer" is spliced into a feature vector matrix in the y-axis direction perpendicular to the x-axis, and the feature vector matrix is input into the LSTM model 1, so as to train the model. The above-described method is performed a plurality of times, and a plurality of fault events are input to the LSTM model 1, where the LSTM model 1 outputs a matrix for each fault event input, the matrix being a reconstructed matrix of the input eigenvector matrix, and ideally being equal to the input eigenvector matrix. When the error between the input and the output of the LSTM model 1 is smaller than a preset threshold value, the first event detection model is trained.

In addition, a plurality of fault events are input as training data into the LSTM model 1, and a plurality of outputs are obtained in correspondence. Calculating the error between each pair of inputs and outputs may result in a plurality of error values, such as the difference between the inputs and outputs. For example, the maximum error value a1 and the minimum error value b1 are determined from a plurality of error values by a box method or a three-time root mean square error method, and a membership formula 1 is obtained, wherein the membership formula 1 is shown in the following formula (1):

in the above formula (1), x1 represents an error of an event in the server real-time log data in input and output of the first event detection model.

Similarly, the normal event in the log template obtained after the data preprocessing of the server history log data is used as training data and is input into the LSTM model 2 for training, so that a second event detection model is obtained. Specifically, word2vec embedded feature vectors corresponding to log template words of a normal event are spliced into a feature vector matrix each time, the feature vector matrix is input into an LSTM model 2, the LSTM model 2 is trained through the input of the LSTM model 2 and the output of the LSTM model 2, the multidimensional space probability distribution of the word2vec embedded feature vectors is learned, and Encoded Representation of the middle layer represents embedded separability features of the normal event. For example, referring to fig. 4, the LSTM model 2 includes an encoder, a Encoded Representation, and a decoder, the x-axis is the direction of the word2vec embedded feature vector, and the word2vec embedded feature vector corresponding to the log template word of the normal event "mtrr variable ranges enable x base mask" is spliced into a feature vector matrix in the y-axis direction perpendicular to the x-axis, and is input into the LSTM model 2, so as to train the model. The above-described method is performed a plurality of times, and a plurality of normal events are input to the LSTM model 2, and the LSTM model 2 outputs a matrix for each normal event input, where the matrix is a reconstructed matrix of the input eigenvector matrix, and is ideally equal to the input eigenvector matrix. When the error between the input and the output of the LSTM model 2 is smaller than the preset threshold, it is indicated that the second event detection model is trained. Alternatively, the number of normal events used to train the second event detection model may be twice the number of failure events used to train the first event detection model, thereby avoiding the occurrence of overfitting and resulting in a more accurate model.

In addition, a plurality of normal events are input as training data into the LSTM model 2, and a plurality of outputs are obtained in correspondence. Calculating the error between each pair of inputs and outputs may result in a plurality of error values, such as the difference between the inputs and outputs. The error maximum value a2 and the error minimum value b2 are determined from a plurality of error values by, for example, a box method or a three-time root mean square error method, and a membership formula 2 is obtained, and the membership formula 2 is shown in the following formula (2):

in the above formula (2), x2 represents an error of an event in the server real-time log data in input and output of the normal event model.

After training to obtain the first event detection model and the second event detection model, the computing device uses the first event detection model and the second event detection model to identify a fault event and a normal event in the server real-time log data, which is described below:

the log template obtained by preprocessing the data of the server real-time log data comprises a plurality of events, and the computing equipment splices word2vec embedded feature vectors corresponding to log template words contained in the events into a feature vector matrix in a mode similar to that shown in fig. 3 and 4 and inputs the feature vector matrix into a first event detection model and a second event detection model respectively. It is to be understood that, after the feature vector matrix is input to the first event detection model and the second event detection model, x1 and x2 shown in the expression (1) and the expression (2) can be obtained, and that p1 and p2 can be obtained by substituting x1 and x2 into the expression (1) and the expression (2), respectively. If p1 is greater than or equal to p2, the event is indicated to be a fault event, otherwise, the event is indicated to be a normal event. By the method, the computing device can identify the fault event and the normal event in the plurality of events, so as to acquire M fault events.

202. Determining M-1 fault event pairs based on the timing of the M fault events; wherein each fault event pair consists of two fault events adjacent in time sequence;

after the computing equipment acquires M fault events, pairing every two fault events adjacent in time in the M fault events, so as to determine M-1 fault event pairs. For example, in the log template, the fault event a, the fault event B, the fault event C, the fault event D, and the fault event E are sequentially arranged according to the time sequence, so that the fault event a and the fault event B form a pair of fault event pairs, the fault event B and the fault event C form a pair of fault event pairs, the fault event C and the fault event D form a pair of fault event pairs, and the fault event D and the fault event E form a pair of fault event pairs.

203. Acquiring a context semantic body of each fault event pair; the context semantic body of each fault event pair comprises context semantic bodies of two fault events in the fault event pair, wherein the context semantic body of each fault event sequentially comprises a feature vector matrix corresponding to the N events of the preamble of the fault event, a feature vector matrix corresponding to the fault event and a feature vector matrix corresponding to the N events of the postamble of the fault event;

for each fault event in a fault event pair, the computing device acquires a context semantic body corresponding to the fault event, wherein the context semantic body of each fault event sequentially consists of a feature vector matrix corresponding to N events in front of the fault event, the feature vector matrix corresponding to the fault event and a feature vector matrix corresponding to N events in back of the fault event. Specifically, referring to fig. 5a, feature vector matrices corresponding to N events in the preamble of the fault event and feature vector matrices corresponding to N events in the postamble of the fault event are spliced in sequence in the z-axis direction perpendicular to the x-axis and the y-axis according to the time sequence, so as to obtain the context semantic body of the fault event. It should be understood that the feature vector matrix is similar to that shown in fig. 3 and fig. 4, and is formed by splicing word2vec embedded feature vectors corresponding to log template words of events, which is not illustrated here. And then, the computing equipment splices the context semantic bodies of the two fault events in the fault event pair in the z-axis direction according to the time sequence again, so that the context semantic bodies of the fault event pair are obtained. In the manner described above, the computing device obtains the context semantics volume for each fault event pair. Illustratively, as shown in fig. 5b, the context semantic body of the first fault event and the context semantic body of the second fault event in the fault event pair are spliced in the Z-axis direction according to the time sequence to form the context semantic body of the fault event pair.

204. Respectively inputting the context semantic bodies of M-1 fault event pairs into a classification model to determine M-1 relation probability values;

referring to fig. 6, the computing device inputs the context semantics of the M-1 fault event pairs obtained in step 203 into a bilstm-softmax binary classification model, which outputs a relationship probability value for each of the context semantics of the fault event pairs, the relationship probability value indicating a correlation between two fault events of the fault event pairs.

205. The fault propagation link is determined from the M-1 relationship probability values.

By the above way, the computing device obtains the corresponding relation probability value of each fault event pair, and the relation probability value of the fault event A and the fault event B is shown as V1, the relation probability value of the fault event B and the fault event C is shown as V2, the relation probability value of the fault event C and the fault event D is shown as V3, and the relation probability value of the fault event D and the fault event E is shown as V4. Based on the above, the relation probability value V5 of the fault event a and the fault event C can be calculated according to V1 and V2 by a PC algorithm, the relation probability value V6 of the fault event a and the fault event D can be calculated according to V1, V2 and V3 by a PC algorithm, and the relation probability value V7 of the fault event a and the fault event E can be calculated according to V1, V2, V3 and V4 by a PC algorithm. If V1 is greater than the preset threshold, the fault event A is the root cause of the fault event B, so that the fault propagation link comprises the fault event A and the fault event B; if V5 is also greater than the preset threshold, it indicates that the fault event a is also the root cause of the fault event C, so that the fault propagation link includes the fault event a, the fault event B, and the fault event C; if V6 is also greater than the preset threshold, it indicates that the fault event a is also the root cause of the fault event D, so that the fault propagation link includes the fault event a, the fault event B, the fault event C, and the fault event D; if V7 is also greater than the preset threshold, it indicates that the fault event a is also the root cause of the fault event E, so the fault event a is the root cause of the fault event B, the fault event C, the fault event D, and the fault event E, and thus the fault propagation link includes the fault event a, the fault event B, the fault event C, the fault event D, and the fault event E.

In the application, in order to avoid the situation that the accuracy of the event detection model is insufficient due to the fact that the sample size of the fault event is too small compared with that of the normal event, the fault event in the log template is determined based on the two event detection models by training the two event detection models through the fault event and the normal event respectively, so that the accuracy of identification can be improved. In addition, the context semantic bodies of the fault event pairs are more consistent with the time sequence and the space between the fault events, and the relationship distinguishing property is higher, so that the relationship probability values acquired based on the context semantic bodies are more accurate, and therefore, the positioning of the fault root cause can be accurately carried out, and the fault propagation link can be determined.

The method of the present application is described above and the apparatus of the present application is described below:

referring to fig. 7, a computing device 700 in the present application includes an acquisition unit 701 and a determination unit 702. The computing device 700 is used to perform the operations performed by the computing device in the embodiment shown in fig. 2 described above.

An acquiring unit 701, configured to acquire M fault events from the log; m is a positive integer greater than 1;

a determining unit 702, configured to determine M-1 fault event pairs based on a timing of M fault events; wherein each fault event pair consists of two fault events adjacent in time sequence;

an obtaining unit 701, configured to obtain a context semantic body of each fault event pair; the context semantic body of each fault event pair comprises context semantic bodies of two fault events in the fault event pair, wherein the context semantic body of each fault event sequentially comprises a feature vector matrix corresponding to N events in front of the fault event, a feature vector matrix corresponding to the fault event and a feature vector matrix corresponding to N events in back of the fault event;

the determining unit 702 is further configured to input context semantic objects of M-1 fault event pairs into the classification model respectively, and determine M-1 relationship probability values;

wherein each relationship probability value is used for indicating the relevance between two fault events in the corresponding fault event pair;

the determining unit 702 is further configured to determine a fault propagation link based on the M-1 relationship probability values.

In one possible implementation of the present application,

an acquiring unit 701, configured to acquire M fault events from the log based on the first event detection model and the second event detection model; wherein the first event model is a detection model determined based on fault event training; the second event detection model is a detection model determined based on training of normal events.

In one possible implementation of the present application,

an acquiring unit 701, specifically configured to acquire a plurality of events from a log;

an obtaining unit 701, configured to determine feature vector matrices corresponding to a plurality of events;

the acquiring unit 701 is specifically configured to input the feature vector matrix of each event into a first event detection model, so as to obtain a first output feature vector matrix corresponding to the feature vector matrix of each event;

the obtaining unit 701 is specifically configured to input the feature vector matrix of each event into a second event detection model, to obtain a second output feature vector matrix;

the acquiring unit 701 is specifically configured to determine a first error based on the feature vector matrix of each event and the corresponding first output feature vector matrix;

an obtaining unit 701, configured to determine a second error based on the feature vector matrix of each event and the corresponding second output feature vector matrix;

the acquiring unit 701 is specifically configured to determine whether each event is a fault event based on the first error value and the second error value.

In one possible implementation of the present application,

a determining unit 702, specifically configured to calculate a first membership value based on the first error value;

a determining unit 702, specifically configured to calculate a second membership value based on the second error value;

the calculation formula of the first membership value is as follows: p1=1- (x 1-a 1)/(b 1-a 1);

wherein x1 is a first error value of each event, a1 is a maximum value of a plurality of third errors, b1 is a minimum value of the plurality of third errors, the third errors are determined based on a first eigenvector matrix and a third output eigenvector matrix, the first eigenvector matrix is an eigenvector matrix corresponding to a fault event for training a first event detection model, the third output eigenvector matrix is an output obtained by inputting the first eigenvector matrix into a first LSTM model, and the first event detection model is obtained by training the first LSTM model;

the calculation formula of the second membership value is as follows: p2=1- (x 2-a 2)/(b 2-a 2);

wherein x2 is a second error value of each event, a2 is a maximum value of a plurality of fourth errors, b2 is a minimum value of the plurality of fourth errors, the fourth errors are determined based on a second eigenvector matrix and a fourth output eigenvector matrix, the second eigenvector matrix is an eigenvector matrix corresponding to a normal event for training a second event detection model, the fourth output eigenvector matrix is an output obtained by inputting the second eigenvector matrix into a second LSTM model, and the second event detection model is obtained by training the second LSTM model;

the determining unit 702 is specifically configured to determine whether each event is a fault event based on the first membership value and the second membership value.

In one possible implementation of the present application,

the determining unit 702 is specifically configured to determine that the event is a fault event if the first membership value is greater than or equal to the second membership value.

In one possible implementation of the present application,

the determining unit 702 is specifically configured to preprocess a log to obtain a log template of the log, where the log template includes one or more events, and each event includes a plurality of log template words;

a determining unit 702, configured to convert each log template word into a primary feature vector;

a determining unit 702, specifically configured to convert the primary feature vector into a word2vec embedded feature vector;

the determining unit 702 is specifically configured to splice word2vec embedded feature vectors corresponding to a plurality of log template words, and obtain a feature vector matrix corresponding to each event.

Fig. 8 is a schematic diagram of a computing device 800 provided by the present application. The computing device includes a processor 801 and memory 802, with one or more program instructions or data stored in the memory 802.

In one possible implementation, the processor 801 may be a central processing unit, an image processor (graphics processing unit, GPU), a field programmable gate array (field programmable gate array, FPGA), a complex programmable logic device (complex programmable logic device, CPLD), or an application specific integrated circuit (application specific integrated circuit, ASIC).

The memory 802 may be a volatile memory or a nonvolatile memory, and the memory 802 is electrically connected to the processor 801. The processor 801 may read program instructions in the memory 802 to cause the computing device 800 to perform the fault analysis method described in the above embodiments.

The computing device 800 may also include a power supply 803, as well as hardware such as an interface 804, software may also include an operating system, etc.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or a part contributing to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RAM, random access memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims

1. A method of fault analysis, comprising:

obtaining M fault events from the log; m is a positive integer greater than 1;

determining M-1 fault event pairs based on the timing of the M fault events; wherein each fault event pair consists of two fault events adjacent in time sequence;

acquiring a context semantic body of each fault event pair; the context semantic body of each fault event pair comprises context semantic bodies of two fault events in the fault event pair, wherein the context semantic body of each fault event sequentially comprises a feature vector matrix corresponding to N events in front of the fault event, a feature vector matrix corresponding to the fault event and a feature vector matrix corresponding to N events in back of the fault event; wherein N is a positive integer greater than or equal to 1;

respectively inputting the context semantic bodies of the M-1 fault event pairs into a classification model to determine M-1 relation probability values;

and determining a fault propagation link based on the M-1 relationship probability values.

2. The method of claim 1, wherein the obtaining M fault events from the log comprises:

acquiring M fault events from the log based on a first event detection model and a second event detection model;

wherein the first event model is a detection model determined based on fault event training; the second event detection model is a detection model determined based on training of normal events.

3. The method of claim 2, wherein obtaining M fault events from the log based on the first event detection model and the second event detection model comprises:

acquiring a plurality of events from the log;

determining a feature vector matrix corresponding to each event;

inputting the feature vector matrix of each event into the first event detection model to obtain a first output feature vector matrix corresponding to the feature vector matrix of each event;

inputting the feature vector matrix of each event into the second event detection model to obtain a second output feature vector matrix;

determining a first error value based on the feature vector matrix of each event and the corresponding first output feature vector matrix;

determining a second error value based on the feature vector matrix of each event and the corresponding second output feature vector matrix;

based on the first error value and the second error value, it is determined whether each event is a fault event.

4. A method according to claim 3, wherein determining whether each event is a fault event based on the first error value and the second error value comprises:

calculating a first membership value based on the first error value;

calculating a second membership value based on the second error value;

wherein x1 is a first error value of each event, a1 is a maximum value of a plurality of third errors, and b1 is a minimum value of the plurality of third errors; the third error is determined based on the first eigenvector matrix and the third output eigenvector matrix; the first eigenvector matrix is an eigenvector matrix corresponding to a fault event for training the first event detection model, and the third output eigenvector matrix is an output matrix obtained by inputting the first eigenvector matrix into a first training model; wherein the third model is used for acquiring the first event detection model;

wherein x2 is a second error value of each event, a2 is a maximum value of a plurality of fourth errors, b2 is a minimum value of the plurality of fourth errors, and the fourth errors are errors determined based on a second eigenvector matrix and a fourth output eigenvector matrix; the second eigenvector matrix is an eigenvector matrix corresponding to a normal event for training the second event detection model, and the fourth output eigenvector matrix is an output matrix obtained by inputting the second eigenvector matrix into a second training model, wherein the fourth model is used for obtaining the second event detection model;

and determining whether each event is a fault event based on the first membership value and the second membership value.

5. The method of claim 4, wherein the determining whether an event is a fault event based on the first membership value and the second membership value comprises:

and under the condition that the first membership value is greater than or equal to the second membership value, determining that the event is a fault event.

6. The method according to any one of claims 2-5, further comprising:

the number of failure events used to train the first event detection model is twice the number of normal events used to train the second event detection model.

7. The method according to any one of claims 3-6, wherein determining the feature vector matrix for each event comprises:

preprocessing the log to obtain a log template of the log, wherein the log template comprises one or more events, and each event comprises a plurality of log template words;

converting each log template word into a primary feature vector;

converting the primary feature vector into a word2vec embedded feature vector;

and embedding word2vec corresponding to the log template words into feature vectors, and splicing to obtain a feature vector matrix corresponding to each event.

8. The method of any of claims 2-7, wherein the first event detection model and the second event detection model are LSTM models.

9. The method of any one of claims 1-8, wherein the classification model is a bilstm-softmax classification model.

10. A computing device comprising a processor and a memory; the computing device and the storage device are coupled; the memory is used for storing computer instructions; the processor is configured to execute the computer instructions to cause the computing device to perform the method of any one of claims 1 to 9.