CN113342597B - System fault prediction method based on Gaussian mixture hidden Markov model - Google Patents
System fault prediction method based on Gaussian mixture hidden Markov model Download PDFInfo
- Publication number
- CN113342597B CN113342597B CN202110597641.XA CN202110597641A CN113342597B CN 113342597 B CN113342597 B CN 113342597B CN 202110597641 A CN202110597641 A CN 202110597641A CN 113342597 B CN113342597 B CN 113342597B
- Authority
- CN
- China
- Prior art keywords
- data set
- fault
- type
- gaussian mixture
- log file
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/302—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
- G06F11/3072—Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Quality & Reliability (AREA)
- Mathematical Optimization (AREA)
- Databases & Information Systems (AREA)
- Pure & Applied Mathematics (AREA)
- Computational Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Analysis (AREA)
- Evolutionary Biology (AREA)
- Operations Research (AREA)
- Probability & Statistics with Applications (AREA)
- Bioinformatics & Computational Biology (AREA)
- Algebra (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Complex Calculations (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a system fault prediction method based on a Gaussian mixture hidden Markov model, which comprises the following steps: preprocessing an original log file and marking; extracting log file features and constructing feature vectors; constructing a corresponding data set for each fault to be predicted by using a sliding window; respectively training a Gaussian mixture hidden Markov fault prediction model for each fault to be predicted; predicting whether the real-time log can be in failure or not and the type of the failure which can be in failure through a trained Gaussian mixture hidden Markov model. By the technical scheme of the invention, the problems of interleaving and redundancy of original log files are solved, so that the extracted features are fewer and more accurate; the system state and the log before the system fails are modeled by adopting a Gaussian mixture hidden Markov model, so that the system failure is rapidly and accurately predicted, and the availability of the system is improved.
Description
Technical Field
The invention belongs to the field of intelligent operation and maintenance, particularly relates to a system fault prediction method based on a Gaussian mixture hidden Markov model, and aims at the problem of system fault prediction.
Background
The complexity of software systems has increased over the past decade as demand has grown. The complexity of the software, human mental behavior, and other resource constraints make it very difficult to develop fault-free software. High complexity software systems need to guarantee their reliability. The software failure prediction predicts the future software failure tendency by using the basic prediction index and the historical failure data, and eliminates the potential failure by predicting the result. The method for preventing the software system from the faults in the past is beneficial to improving the usability and the use efficiency of the software system. By logging such semi-structured text-type data, however, there are two significant improvements to predicting system failure problems:
the effect and the efficiency of the fault prediction have further improved space
Based on the traditional machine learning models such as a support vector machine and a clustering fault prediction algorithm, the prediction accuracy and the recall rate are both about 80%, and the method can be further improved. Although the accuracy of the fault prediction algorithm based on deep learning such as CNN and LSTM reaches 90%, the training time and the prediction time of the model are obviously higher than those of the traditional machine learning model, so that the fault prediction efficiency can be further improved.
There is a need for more efficient data preprocessing methods
The log sequence has three characteristics after analysis:
long-term ordering: the sequential log of the system in the state transition can be generated in a time sequence in a series of actions in a long time, so that the sequence of the log sequence cannot be damaged when the frequent log sequence is analyzed and mined.
Staggering in the short term: because the system cluster is large in scale, multiple different tasks may be executed at the same node or different nodes, and corresponding logs are generated while each task is executed. The logs are arranged according to the time sequence to form a log sequence, so that the logs of other tasks can be inserted into the log sequence of a certain task, and the normal sequence of the logs corresponding to the tasks is damaged.
Redundancy in the short term: a component of the system is heavily accessed in a short period of time (especially when a failure occurs) and thus produces a large number of logs of the same type. For example, when a request connection error occurs, the system will immediately issue a connection request again until the connection is successful or a certain condition is reached. In the log-based failure prediction method, such redundant logs not only increase the computation cost, but also overwhelm other important logs, which is not favorable for the analysis of frequent log sequences. However, some kind of log is generated in a large amount in a short time, which may also be a feature of some kind of failure, and therefore a certain proportion of redundant logs also need to be kept.
Due to the preservation of the partial redundant logs, the number of logs in a certain period of time is large, and the types of logs are small. In the traditional log data preprocessing method, each log is regarded as an independent sample, and a feature vector is extracted. This method on the one hand results in a too large number of samples to be analyzed; the other party also causes a smaller amount of useful information in this time period. Therefore, a better method for preprocessing log data is needed, so that the processed data set can be more representative.
Disclosure of Invention
Aiming at the research background and problems, the invention provides a system fault prediction method based on a Gaussian mixture hidden Markov model, which comprises the following steps: according to the historical system logs, a GMM-HMM model is respectively constructed for each fault type to be predicted, the real-time log sequence of the system is respectively input into each GMM-HMM model during prediction, the probability of the log sequence under each model is calculated, and whether a fault occurs or the fault type occurs is judged based on the probability.
The technical scheme of the invention is a system fault prediction method based on a Gaussian mixture hidden Markov model, which comprises the following specific steps:
step 1: preprocessing an original log file data set to obtain a preprocessed log file data set, extracting a plurality of keywords of each preprocessed log file of the preprocessed log file data set by a keyword extraction method, further constructing a word frequency matrix, extracting a plurality of clustered preprocessed log files of each preprocessed log file data set of the preprocessed log file data set by the word frequency matrix by adopting a coacervation hierarchical clustering method, and manually marking the type of each clustered preprocessed log file;
step 2: extracting the characteristic of the type of each clustered preprocessed log file, further constructing a characteristic vector of the type of each clustered preprocessed log file, and arranging the characteristic vectors of the types of all clustered preprocessed log files according to the sequence of original log files to obtain a characteristic vector data set;
and step 3: positioning all occurrence positions of specified faults on a characteristic vector data set, positioning an initial position of a sliding window and a stop position of the sliding window on the characteristic vector data set, intercepting a characteristic vector sequence in the sliding window from the initial position of the sliding window and putting the characteristic vector sequence into the specified fault data set, moving the sliding window backwards by a sliding step length distance, continuously intercepting the characteristic vector sequence in the sliding window and putting the characteristic vector sequence into the specified fault data set until the sliding window reaches or exceeds the stop position of the sliding window, and taking the characteristic vector sequence as a specified fault data set;
and 4, step 4: respectively setting hyper-parameters of a Gaussian mixture hidden Markov model of each specified fault to be predicted, respectively taking a data set of each specified fault as the input of a training algorithm of the Gaussian mixture hidden Markov model, and optimizing and training the parameters to be estimated of the Gaussian mixture hidden Markov model of each specified fault through what algorithm to obtain the parameters to be estimated of the optimized Gaussian mixture hidden Markov model of each specified fault so as to construct the optimized Gaussian mixture hidden Markov prediction model of each specified fault;
and 5: intercepting a section of real-time log sequence through the sliding window in the step 3 to serve as a log sequence to be predicted; converting the log sequence to be predicted into a type sequence of the log files after the preprocessing of clustering by the method of step 1, converting the type sequence of the log files after the preprocessing of clustering into a characteristic vector sequence by the method of step 2, and predicting the characteristic vector sequence by the optimized Gaussian mixed hidden Markov model of each specified fault to obtain a prediction result;
preferably, the pretreatment in step 1 is: cleaning meaningless parameters and filtering redundant logs on the original log file data set to obtain a preprocessed log file data set;
wherein lj,iRepresenting the ith log file, N in the preprocessed log file data set at the jth acquisition timejRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, i belongs to [1, N ]j],j∈[1,K]K represents the number of acquisition instants;
wherein e isj,iRepresenting the type, N, of the ith log file in the preprocessed log file data set at the jth acquisition timejRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, i belongs to [1, N ]j],j∈[1,K]K represents the number of acquisition instants;
preferably, in the step 2, the specific calculation method for extracting the feature of the type of the preprocessed log file after each cluster is as follows:
wherein the content of the first and second substances,indicating that the type m of the log file is in typejFrequency of middle energizer, FmIs represented in { type1,type2,...,typeKIn the structure, m e typeiIs true i e [1, M ∈]Frequency of (N)jRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, wherein M belongs to [1, M ∈],j∈[1,K]M represents the total number of the types of the clustered log files in the step 1, and K represents the number of the acquisition moments;
wherein component (a)Indicating that the type m of the log file is in typejM is within [1, M ]],j∈[1,K]M represents the total number of the types of the clustered log files in the step 1, and K represents the number of the acquisition moments;
V={v1,v2,...,vK}
Wherein v isjRepresents typejExtracting characteristic vector, j belongs to [1, K ∈ after characteristic extraction];
Preferably, the step 3 of locating all occurrence positions of the specified fault on the characteristic vector data set includes:
searching and positioning a position d where the specified fault f appears in an original log file through a keyword of the specified fault f, and recording an index;
f∈[1,F],d∈[1,Df]f denotes the number of all specified fault types to be predicted, DfRepresenting the number of the occurrence positions of the specified fault type f to be predicted;
positioning an index j of the acquisition time of the specified fault f in the preprocessed log file data set through the recorded indexesf,jf∈[1,Df],DfRepresenting the number of the occurrence positions of the specified fault type f to be predicted;
wherein v isr,zIndicating the z-th feature vector in the vector sequence intercepted for the r-th time before the sliding window specifies the d-th positioning position of the fault on the feature vector data set,
f∈[1,F],d∈[1,Df],r∈[1,Sd],z∈[1,Lw],
f denotes the number of all specified fault types to be predicted, DfRepresenting the number of locations where the specified fault type f occurs, SdRepresenting the number of times of interception before the d-th position on the feature vector data set, d ∈ [1, N]N denotes the number of fault locations specified on the sequence of feature vectors, LwRepresenting the number of feature vectors contained in the sliding window;
wherein v isvDenotes the v-th feature vector contained in the sliding step, v ∈ [1, Ls],LsRepresenting the number of feature vectors contained in the step of sliding;
wherein the content of the first and second substances,representing all feature vector sequence segments intercepted before the d-th localization position of the specified fault f,representing the r-th truncated segment of the feature vector before the d-th localized position of the specified fault f,
f∈[1,F],d∈[1,Df],r∈[1,Sd],
where F represents the number of all specified fault types to be predicted, DfRepresenting the number of locations where the specified fault type f occurs, SdIndicating the number of truncations made before the d-th position on the feature vector data set.
Preferably, in step 4, the specified failure data set is defined as:
Data={Data1,Data2,...,DataF}
wherein DatafData set representing a specified failure F, F ∈ [1, F ∈ [ ]]F represents the number of all specified fault types to be predicted;
the number of hidden states in the hidden Markov model is QfF denotes specifying the type of failure, F ∈ [1, F ∈ [ ]]F represents the number of all specified fault types to be predicted;
the number of the Gaussian partial models is GfF denotes specifying the type of failure, F ∈ [1, F ∈ [ ]]F represents the number of all specified fault types to be predicted;
the weight of the Gaussian partial model isRepresenting the weight of a Gaussian component model g in the mixed Gaussian model corresponding to the hidden state p of the specified fault f;
the mean vector of the Gaussian partial model isRepresenting a mean vector of a Gaussian mixture model g in a mixed Gaussian model corresponding to a hidden state p of a specified fault f;
the Gaussian mixture model covariance matrix isRepresenting a covariance matrix of a Gaussian mixture model g in a mixed Gaussian model corresponding to a hidden state p of a specified fault f;
the state transition probability matrix isWhereinRepresenting the probability of a hidden state p of a given fault f transitioning to a hidden state q;
the initial state probability vector isπp fIndicating the probability of occurrence of the hidden state p at the initial moment of the specified fault f,
wherein the content of the first and second substances,
g represents a Gaussian component model, and G belongs to [1, G ]f],GfRepresenting the number of Gaussian component models corresponding to the hidden state of the specified fault f;
f represents the specified fault type, F belongs to [1, F ], and F represents the number of all specified fault types to be predicted;
p represents the hidden state at the current time, p is in [1, Q ]f],QfRepresenting the number of hidden states of a specified fault type f;
q represents the hidden state at the next time, Q ∈ [1, Q ]f],QfIndicating the number of hidden states specifying the fault type f.
Preferably, in step 5, the specific method for predicting the feature vector sequence by the optimized gaussian mixture hidden markov model for each specified fault in step 4 is as follows:
and taking the characteristic vector sequence as the input of a backward algorithm of each Gaussian mixture hidden Markov model to obtain the probability of the characteristic vector sequence appearing under each Gaussian mixture hidden Markov model:
PR={PR1,PR2,...,PRF}
wherein, PRfRepresenting the probability of occurrence of a sequence of feature vectors in a Gaussian mixture hidden Markov model of the fault type F, found by a backward algorithm, with F ∈ [1, F]And F denotes the number of specified fault types to be predicted.
Step 5, the specific method for obtaining the prediction result is as follows:
defining a threshold value T, if PRresult={PRtIs not empty, where PRt>T,PRt∈PR,t∈[1,F]F denotes the number of all specified fault types to be predicted,
then take max { PRresultAnd taking the corresponding fault type as a prediction result, otherwise, taking the prediction result as no fault.
The invention has the advantages that the invention adopts the feature extraction method of the segmentation log, solves the problem of interleaving and redundancy of the original log file, and ensures that the constructed feature vector has more identifiability; the system state and the log before the system fails are modeled by adopting the Gaussian mixture hidden Markov model, and compared with other system failure prediction models, the method has certain advantages in prediction effect and efficiency.
Drawings
FIG. 1: a flow chart of a method of fault prediction for an embodiment of the invention;
FIG. 2: the GMM-HMM system based fault prediction model of the embodiment of the invention;
FIG. 3: data pre-processing activity graphs for embodiments of the invention;
FIG. 4: data change graphs for examples of the invention;
FIG. 5: the recognition rate thermodynamic diagram of an embodiment of the invention;
FIG. 6: log cross-contrast graphs for examples of the invention;
FIG. 7: different methods predict effect contrast chart;
FIG. 8: different methods compare the efficiency to the figure.
Detailed Description
In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.
The following describes a specific embodiment of the present invention with reference to fig. 1 to 8, and a technical solution of the specific embodiment of the present invention is a system fault prediction method based on a gaussian hybrid hidden markov model, which includes the following specific steps:
step 1: preprocessing an original log file data set to obtain a preprocessed log file data set, extracting a plurality of keywords of each preprocessed log file of the preprocessed log file data set by a keyword extraction method, further constructing a word frequency matrix, extracting a plurality of clustered preprocessed log files of each preprocessed log file data set of the preprocessed log file data set by the word frequency matrix by adopting a coacervation hierarchical clustering method, and manually marking the type of each clustered preprocessed log file;
the pretreatment in the step 1 comprises the following steps: cleaning meaningless parameters and filtering redundant logs on the original log file data set to obtain a preprocessed log file data set;
wherein lj,iRepresenting the ith log file, N in the preprocessed log file data set at the jth acquisition timejRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, i belongs to [1, N ]j],j∈[1,K]And K represents the acquisition timeThe number of (2);
wherein e isj,iRepresenting the type, N, of the ith log file in the preprocessed log file data set at the jth acquisition timejRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, i belongs to [1, N ]j],j∈[1,K]K is 1024, which represents the number of acquisition moments;
step 2: extracting the characteristic of the type of each clustered preprocessed log file, further constructing a characteristic vector of the type of each clustered preprocessed log file, and arranging the characteristic vectors of the types of all clustered preprocessed log files according to the sequence of original log files to obtain a characteristic vector data set;
wherein the content of the first and second substances,indicating that the type m of the log file is in typejFrequency of middle energizer, FmIs represented in { type1,type2,...,typeKIn the structure, m is an element tupeiIs true i e [1, M ∈]Frequency of (N)jRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, wherein M belongs to [1, M ∈],j∈[1,K]M is 80 to represent the total number of the log file types after clustering in the step 1, and K is 1024 to represent the number of the acquisition time;
wherein component (a)Indicating that the type m of the log file is in typejM is within [1, M ]],j∈[1,K]M is 80 to represent the total number of the log file types after clustering in the step 1, and K is 1024 to represent the number of the acquisition time;
V={v1,v2,...,vK}
Wherein v isjRepresents typejExtracting characteristic vector, j belongs to [1, K ∈ after characteristic extraction];
And step 3: positioning all occurrence positions of specified faults on a characteristic vector data set, positioning an initial position of a sliding window and a stop position of the sliding window on the characteristic vector data set, intercepting a characteristic vector sequence in the sliding window from the initial position of the sliding window and putting the characteristic vector sequence into the specified fault data set, moving the sliding window backwards by a sliding step length distance, continuously intercepting the characteristic vector sequence in the sliding window and putting the characteristic vector sequence into the specified fault data set until the sliding window reaches or exceeds the stop position of the sliding window, and taking the characteristic vector sequence as a specified fault data set;
searching and positioning a position d where the specified fault f appears in an original log file through a keyword of the specified fault f, and recording an index;
f∈[1,F],d∈[1,Df]denotes the number of all specified fault types to be predicted, DfRepresenting the number of the occurrence positions of the specified fault type f to be predicted;
positioning the acquisition of the specified fault f in the preprocessed log file data set by the recorded indexIndex j of set timef,jf∈[1,Df],DfRepresenting the number of the occurrence positions of the specified fault type f to be predicted;
wherein v isr,zIndicating the z-th feature vector in the vector sequence intercepted for the r-th time before the sliding window specifies the d-th positioning position of the fault on the feature vector data set,
f∈[1,F],d∈[1,Df],r∈[1,Sd],z∈[1,Lw],
f denotes the number of all specified fault types to be predicted, DfRepresenting the number of locations where the specified fault type f occurs, SdRepresenting the number of times of interception before the d-th position on the feature vector data set, d ∈ [1, N]N denotes the number of fault locations specified on the sequence of feature vectors, LwRepresenting the number of feature vectors contained in the sliding window;
wherein v isvDenotes the v-th feature vector contained in the sliding step, v ∈ [1, Ls],LsRepresenting the number of feature vectors contained in the step of sliding;
wherein the content of the first and second substances,representing all feature vector sequence segments intercepted before the d-th localization position of the specified fault f,representing the r-th truncated segment of the feature vector before the d-th localized position of the specified fault f,
f∈[1,F],d∈[1,Df],r∈[1,Sd],
where F-4 denotes the number of all specified fault types to be predicted, DfRepresenting the number of locations where the specified fault type f occurs, SdIndicating the number of truncations made before the d-th position on the feature vector data set.
And 4, step 4: respectively setting hyper-parameters of a Gaussian mixture hidden Markov model of each specified fault to be predicted, respectively taking a data set of each specified fault as the input of a training algorithm of the Gaussian mixture hidden Markov model, and optimizing and training the parameters to be estimated of the Gaussian mixture hidden Markov model of each specified fault through what algorithm to obtain the parameters to be estimated of the optimized Gaussian mixture hidden Markov model of each specified fault so as to construct the optimized Gaussian mixture hidden Markov prediction model of each specified fault;
Data={Data1,Data2,...,DataF}
wherein DatafData set representing a specified failure F, F ∈ [1, F ∈ [ ]]F represents the number of all specified fault types to be predicted;
the number of hidden states in the hidden Markov model is QfF denotes a specified fault type, F ∈ [1, F ∈ [ ]3]F represents the number of all specified fault types to be predicted;
the number of the Gaussian partial models is G f6, F denotes a specified fault type, F ∈ [1, F ∈]F represents the number of all specified fault types to be predicted;
the weight of the Gaussian partial model isRepresenting the weight of a Gaussian component model g in the mixed Gaussian model corresponding to the hidden state p of the specified fault f;
the mean vector of the Gaussian partial model isRepresenting a mean vector of a Gaussian mixture model g in a mixed Gaussian model corresponding to a hidden state p of a specified fault f;
the Gaussian mixture model covariance matrix isRepresenting a covariance matrix of a Gaussian mixture model g in a mixed Gaussian model corresponding to a hidden state p of a specified fault f;
the state transition probability matrix isWhereinRepresenting the probability of a hidden state p of a given fault f transitioning to a hidden state q;
the initial state probability vector isπp fIndicating the probability of occurrence of the hidden state p at the initial moment of the specified fault f,
wherein the content of the first and second substances,
g represents a Gaussian component model, and G belongs to [1, G ]f],GfRepresenting the number of Gaussian component models corresponding to the hidden state of the specified fault f;
f represents the specified fault type, F belongs to [1, F ], and F represents the number of all specified fault types to be predicted;
p represents the hidden state at the current time, p is in [1, Q ]f],QfRepresenting the number of hidden states of a specified fault type f;
q represents the hidden state at the next time, Q ∈ [1, Q ]f],QfIndicating the number of hidden states specifying the fault type f.
And 5: intercepting a section of real-time log sequence through the sliding window in the step 3 to serve as a log sequence to be predicted; converting the log sequence to be predicted into a type sequence of the log files after the preprocessing of clustering by the method of step 1, converting the type sequence of the log files after the preprocessing of clustering into a characteristic vector sequence by the method of step 2, and predicting the characteristic vector sequence by the optimized Gaussian mixed hidden Markov model of each specified fault to obtain a prediction result;
step 5, the specific method for predicting the feature vector sequence by the optimized gaussian mixed hidden markov model of each specified fault in step 4 is as follows:
and taking the characteristic vector sequence as the input of a backward algorithm of each Gaussian mixture hidden Markov model to obtain the probability of the characteristic vector sequence appearing under each Gaussian mixture hidden Markov model:
PR={PR1,PR2,...,PRF}
wherein, PRfRepresenting the probability of occurrence of a sequence of feature vectors in a Gaussian mixture hidden Markov model of the fault type F, found by a backward algorithm, with F ∈ [1, F]And F denotes the number of specified fault types to be predicted.
Step 5, the specific method for obtaining the prediction result is as follows:
defining a threshold value T of 0.76, if PRresult={PRtIs not empty, where PRt>T,PRt∈PR,t∈[1,F]F denotes the number of all specified fault types to be predicted,
then take max { PRresultAnd taking the corresponding fault type as a prediction result, otherwise, taking the prediction result as no fault.
Modeling transitions between system states and log generation processes using GNN-HMM
The system is divided into a system state layer, a task layer and an observation layer. The system layer is also referred to as a hidden layer, where a finite number of system states can transition with some probability to any one system state. When the system state is converted, a series of system tasks are executed, so that a task layer is generated. The corresponding log is generated while the task is executed, thereby creating an observation layer. In the observation layer, the system log is processed into a numerical vector form which can be understood by a machine through a data processing method. For the sake of simplifying the problem, the invention ignores the task layer and considers that the transition of the system state only depends on the state of the previous time, and simultaneously generates the log, so that the generation of the log at the current time also only depends on the system state at the current time. The process is modeled by a hidden Markov model, and further a Gaussian mixture distribution density is used as a function of the log generated by the system state, namely, a GMM model is used for fitting the probability distribution of the observed values in the HMM model.
In light of the above-described discussion,
FIG. 1 is a flow chart illustrating the overall process of the present invention from step 1 to step 5.
Figure 2 shows the three-layer structure of the system and the connection mode between the layers.
FIG. 3 is a detailed flow diagram illustrating the operation of the data preprocessing of FIG. 1.
FIG. 4 depicts the change in data during the process of converting the original log sequence into feature vectors.
The embodiment of the invention verifies the description method from three aspects
And verifying the influence of the system state quantity and the Gaussian mixture model quantity on the fault identification rate. After the number of hidden states and the number of Gaussian partial models are set, the models are respectively trained by using the data sets of each fault. And inputting the training data into the trained model again, calculating the probability, and marking the input data as the type with the highest probability value. The recognition rate is equal to the fraction of correctly marked data in the original data. And performing multiple tests by setting different state numbers and Gaussian branch model numbers, and taking the state number and the Gaussian branch model number corresponding to the model with the highest recognition rate as corresponding parameter values in subsequent experiments.
And verifying whether the feature construction method solves the problem of log interleaving. The original log sequence is artificially interleaved for a short time: and traversing the log data set in sequence, and randomly extracting a log exchange position from the first 50 logs or the last 50 logs of each log, thereby artificially manufacturing more staggered logs. And performing a comparison experiment on the manually staggered log data set and the original log data set, training a model, predicting faults, and comparing the predicted accuracy, recall rate and F-value.
The prediction effect of the GMM-HMM model is compared with other models. Comparing a prediction method based on a GMM-HMM (Gaussian mixture model) -HMM (hidden Markov model) with a prediction method based on a random index-support vector machine (RI-SVN), a prediction method based on a dual-combination long-and-short term memory network (CNN-LSTM) and a prediction method based on log event sequence clustering (Cluster), and evaluating the accuracy, the recall rate, the F-value, the training time and the prediction time of each method.
The invention verifies that the selected data set is the log data generated by the supercomputer Spirit in the actual operation process, and the data set is disclosed on the internet. The scale of the test data set is shown in table 1. The data set is divided into training data and test data, a training data set with 3 faults is constructed by a log data preprocessing method and a data set construction method, and the number of three fault events and corresponding seed logs is shown in table 2.
TABLE 1 Spirit data set Specification
Table 2 fault event seed log
Description of failure events | Corresponding seed Log quantity |
drive error SCSI port ID | 52 |
writing message file | 38 |
unknown service | 83 |
And (5) training the model after setting the number of the hidden states and the number of the Gaussian mixture models, and calculating the recognition rate of the model. And setting different numbers of hidden states and Gaussian partial models, and performing multiple tests to find better numbers of hidden states and Gaussian partial models. The labeling rates corresponding to the number of different hidden states and the number of gaussian models are shown in table 3.
TABLE 3 identification rates of different hidden states and Gaussian fraction model numbers
It can be seen from table 3 that when the number of concealment is constant, the recognition rate increases with the number of gaussian partial models, because the higher the number of gaussian partial models is, the higher its accuracy is, and the better the effect of simulating distribution is. However, the excessive number of the partial models can cause large data quantity on one hand and increase of calculation quantity and storage quantity on the other hand, thereby bringing great consumption to the system and reducing algorithm efficiency. Fig. 5 is a thermodynamic diagram of recognition rate as a function of hidden state and partial model changes. It can be observed that the recognition rate is highest when the number of hidden states is 3 and the number of gaussian component models is 6 or more. In order to reduce the amount of calculation and the amount of memory, the number of partial models is selected to be 6.
Comparing the manually staggered logs with the original logs, training the model by adopting the optimal group of hidden state quantity and Gaussian score model quantity in the training model after the number of hidden states and the number of Gaussian score models are preset, and calculating the accuracy rate, the recall rate and the F-value of the model.
As can be seen from fig. 6, the change in the accuracy, recall, and F-value of the model is not large after the cross-logs are artificially made. This shows that the log data preprocessing method of the present invention can solve the problem of interleaving of the log in a short time.
By combining various experimental verifications and analyses, the method provided by the invention has the advantages that under the conditions of better super-parameters and threshold values, the accuracy rate exceeds 80%, and the recall rate also reaches over 75%. In the aspect of predicting fault events, each evaluation index is good, and the time complexity and the space complexity of the algorithm can be accepted in the training and prediction of the model. Therefore, the failure prediction method of the present invention is also highly feasible in practice.
Comparing and analyzing the fault prediction effects of a GMM-HMM fault prediction method and other fault prediction methods, such as a random index and support vector machine (RI-SVN) method, a double-combination long-time memory network (CNN-LSTM) method and a log event sequence clustering (Cluster) method. Fig. 7 illustrates the difference in accuracy and recall between the above methods and the method of the present invention. Fig. 8 illustrates the difference between the training time and the prediction time for these methods. As can be seen from FIG. 3, the prediction effect of the GMM-HMM failure prediction model is immediately after the deep learning method CNN-LSTM and is superior to statistical learning methods such as RI-SVN, Cluster, etc. This is because the method of the present invention improves the log data preprocessing step, solves the log interleaving problem, and preserves a certain amount of redundant data in combination with the log distribution characteristics before the failure occurs, because the improvement of the data set improves the prediction effect. The CNN-LSTM has better prediction effect because CNN can locally read data and directly input the data to LSTM for analysis and calculation, thereby solving the problem of local interleaving of logs. However, the neural network has more complex structure parameters and requires a larger amount of data, and therefore, more calculation time and calculation resources are consumed. It can be seen from FIG. 8 that the training time of CNN-LSTM is much longer than that of GMM-HMM, and the prediction time is also longer than that of GMM-HMM, but the overall efficiency of each method in the statistical learning prediction method is not very different. Therefore, the GMM-HMM fault prediction method provided by the invention has certain advantages when the efficiency and the effect of the algorithm are comprehensively considered.
It should be understood that parts of the specification not set forth in detail are well within the prior art.
It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (6)
1. A system fault prediction method based on a Gaussian mixture hidden Markov model is characterized by comprising the following steps:
step 1: preprocessing an original log file data set to obtain a preprocessed log file data set, extracting a plurality of keywords of each preprocessed log file of the preprocessed log file data set by a keyword extraction method, further constructing a word frequency matrix, extracting a plurality of clustered preprocessed log files of each preprocessed log file data set of the preprocessed log file data set by the word frequency matrix by adopting a coacervation hierarchical clustering method, and manually marking the type of each clustered preprocessed log file;
step 2: extracting the characteristic of the type of each clustered preprocessed log file, further constructing a characteristic vector of the type of each clustered preprocessed log file, and arranging the characteristic vectors of the types of all clustered preprocessed log files according to the sequence of original log files to obtain a characteristic vector data set;
and step 3: positioning all occurrence positions of specified faults on a characteristic vector data set, positioning an initial position of a sliding window and a stop position of the sliding window on the characteristic vector data set, intercepting a characteristic vector sequence in the sliding window from the initial position of the sliding window and putting the characteristic vector sequence into the specified fault data set, moving the sliding window backwards by a sliding step length distance, continuously intercepting the characteristic vector sequence in the sliding window and putting the characteristic vector sequence into the specified fault data set until the sliding window reaches or exceeds the stop position of the sliding window, and taking the characteristic vector sequence as a specified fault data set;
and 4, step 4: respectively setting hyper-parameters of a Gaussian mixture hidden Markov model of each specified fault to be predicted, respectively taking a data set of each specified fault as the input of a training algorithm of the Gaussian mixture hidden Markov model, and optimizing and training the parameters to be estimated of the Gaussian mixture hidden Markov model of each specified fault through what algorithm to obtain the parameters to be estimated of the optimized Gaussian mixture hidden Markov model of each specified fault so as to construct the optimized Gaussian mixture hidden Markov prediction model of each specified fault;
and 5: intercepting a section of real-time log sequence through the sliding window in the step 3 to serve as a log sequence to be predicted; and (3) converting the log sequence to be predicted into a type sequence of the log files after the preprocessing of clustering by the method of step 1, converting the type sequence of the log files after the preprocessing of clustering into a characteristic vector sequence by the method of step 2, and predicting the characteristic vector sequence by the optimized Gaussian mixture hidden Markov model of each specified fault to obtain a prediction result.
2. The method for predicting system faults based on the Gaussian mixture hidden Markov model according to claim 1, wherein the preprocessing in the step 1 is as follows: cleaning meaningless parameters and filtering redundant logs on the original log file data set to obtain a preprocessed log file data set;
step 1, the pre-processed log files after the clustering are obtained, and the specific formula symbols are defined as follows:
wherein lj,iRepresenting the ith log file, N in the preprocessed log file data set at the jth acquisition timejRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, i belongs to [1, N ]j],j∈[1,K]K represents the number of acquisition instants;
step 1, the type of each pre-processed clustered log file is defined as following by a specific formula symbol:
wherein e isj,iRepresenting the type, N, of the ith log file in the preprocessed log file data set at the jth acquisition timejRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, i belongs to [1, N ]j],j∈[1,K]And K represents the number of acquisition instants.
3. The method for predicting system faults based on the Gaussian mixture hidden Markov model according to claim 1, wherein the step 2 is to extract the characteristics of the type of each pre-processed log file after clustering, and the specific calculation method is as follows:
wherein the content of the first and second substances,indicating that the type m of the log file is in typejFrequency of middle energizer, FmIs represented in { type1,type2,...,typeKIn the structure, m e typeiIs true i e [1, M ∈]Frequency of (N)jRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, wherein M belongs to [1, M ∈],j∈[1,K]M represents the total number of the types of the clustered log files in the step 1, and K represents the number of the acquisition moments;
step 2, the feature vector of the type of the log file after each cluster is preprocessed is defined as:
wherein component (a)Indicating that the type m of the log file is in typejM is within [1, M ]],j∈[1,K]M represents the total number of the types of the clustered log files in the step 1, and K represents the number of the acquisition moments;
step 2, the characteristic vector data set is defined by a specific formula symbol
V={v1,v2,...,vK}
Wherein v isjRepresents typejExtracting characteristic vector, j belongs to [1, K ∈ after characteristic extraction]。
4. The method for predicting system faults based on the Gaussian mixture hidden Markov model according to claim 1, wherein the step 3 locates all occurrence positions of the specified faults on the characteristic vector data set by a specific method comprising the following steps:
searching and positioning a position d where the specified fault f appears in an original log file through a keyword of the specified fault f, and recording an index;
f∈[1,F],d∈[1,Df]f denotes the number of all specified fault types to be predicted, DfRepresenting the number of the occurrence positions of the specified fault type f to be predicted;
positioning an index j of the acquisition time of the specified fault f in the preprocessed log file data set through the recorded indexesf,jf∈[1,Df],DfRepresenting the number of the occurrence positions of the specified fault type f to be predicted;
Step 3, the sliding window, the specific formula symbol is defined as:
wherein v isr,zIndicating the z-th feature vector in the vector sequence intercepted for the r-th time before the sliding window specifies the d-th positioning position of the fault on the feature vector data set,
f∈[1,F],d∈[1,Df],r∈[1,Sd],z∈[1,Lw],
f denotes the number of all specified fault types to be predicted, DfRepresenting the number of locations where the specified fault type f occurs, SdRepresenting the number of times of interception before the d-th position on the feature vector data set, d ∈ [1, N]N denotes the number of fault locations specified on the sequence of feature vectors, LwRepresenting the number of feature vectors contained in the sliding window;
step 3, the sliding step length, a specific formula symbol, is defined as:
wherein v isvDenotes the v-th feature vector contained in the sliding step, v ∈ [1, LS],LSRepresenting the number of feature vectors contained in the step of sliding;
step 3, specifying a fault data set, wherein a specific formula symbol is defined as:
wherein the content of the first and second substances,representing all feature vector sequence segments intercepted before the d-th localization position of the specified fault f,representing the r-th truncated segment of the feature vector before the d-th localized position of the specified fault f,
f∈[1,F],d∈[1,Df],r∈[1,Sd],
where F represents the number of all specified fault types to be predicted, DfRepresenting the number of locations where the specified fault type f occurs, SdIndicating the number of truncations made before the d-th position on the feature vector data set.
5. The Gaussian mixture hidden Markov model-based system failure prediction method of claim 4,
step 4, the hyper-parameters of the Gaussian mixture hidden Markov model for specifying the faults comprise: the number of hidden states and the number of Gaussian components in the hidden Markov model;
hidden Markov modelThe number of hidden states is QfF denotes specifying the type of failure, F ∈ [1, F ∈ [ ]]F represents the number of all specified fault types to be predicted;
the number of the Gaussian partial models is GfF denotes specifying the type of failure, F ∈ [1, F ∈ [ ]]F represents the number of all specified fault types to be predicted;
step 4, the parameters to be estimated of the Gaussian mixture hidden Markov prediction model of each specified fault comprise: the method comprises the following steps of weighting a Gaussian mixture model, a mean vector of the Gaussian mixture model, a covariance matrix of the Gaussian mixture model, a state transition probability matrix and an initial state probability vector;
the weight of the Gaussian partial model isRepresenting the weight of a Gaussian component model g in the mixed Gaussian model corresponding to the hidden state p of the specified fault f;
the mean vector of the Gaussian partial model isRepresenting a mean vector of a Gaussian mixture model g in a mixed Gaussian model corresponding to a hidden state p of a specified fault f;
the Gaussian mixture model covariance matrix isRepresenting a covariance matrix of a Gaussian mixture model g in a mixed Gaussian model corresponding to a hidden state p of a specified fault f;
the state transition probability matrix isWhereinRepresenting the probability of a hidden state p of a given fault f transitioning to a hidden state q;
the initial state probability vector isπp fIndicating the probability of occurrence of the hidden state p at the initial moment of the specified fault f,
wherein the content of the first and second substances,
g represents a Gaussian component model, and G belongs to [1, G ]f],GfRepresenting the number of Gaussian component models corresponding to the hidden state of the specified fault f;
f represents the specified fault type, F belongs to [1, F ], and F represents the number of all specified fault types to be predicted;
p represents the hidden state at the current time, p is in [1, Q ]f],QfRepresenting the number of hidden states of a specified fault type f;
q represents the hidden state at the next time, Q ∈ [1, Q ]f],QfIndicating the number of hidden states specifying the fault type f.
6. The method for predicting system faults based on the Gaussian mixture hidden Markov model according to claim 1, wherein the specific method for predicting the characteristic vector sequence by the optimized Gaussian mixture hidden Markov model for each fault in step 4 is as follows:
and taking the characteristic vector sequence as the input of a backward algorithm of each Gaussian mixture hidden Markov model to obtain the probability of the characteristic vector sequence appearing under each Gaussian mixture hidden Markov model:
PR={PR1,PR2,...,PRF}
wherein, PRfRepresenting the probability of occurrence of a sequence of feature vectors in a Gaussian mixture hidden Markov model of the fault type F, found by a backward algorithm, with F ∈ [1, F]F represents the number of all specified fault types to be predicted;
step 5, the specific method for obtaining the prediction result is as follows:
defining a threshold value T, if PRresult={PRtIs not empty, where PRt>T,PRt∈PR,t∈[1,F]F denotes the number of all specified fault types to be predicted,
then take max { PRresultAnd taking the corresponding fault type as a prediction result, otherwise, taking the prediction result as no fault.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110597641.XA CN113342597B (en) | 2021-05-31 | 2021-05-31 | System fault prediction method based on Gaussian mixture hidden Markov model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110597641.XA CN113342597B (en) | 2021-05-31 | 2021-05-31 | System fault prediction method based on Gaussian mixture hidden Markov model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113342597A CN113342597A (en) | 2021-09-03 |
CN113342597B true CN113342597B (en) | 2022-04-29 |
Family
ID=77472623
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110597641.XA Active CN113342597B (en) | 2021-05-31 | 2021-05-31 | System fault prediction method based on Gaussian mixture hidden Markov model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113342597B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113704075B (en) * | 2021-09-23 | 2022-09-02 | 中国人民解放军国防科技大学 | Fault log-based high-performance computing system fault prediction method |
CN114816962B (en) * | 2022-06-27 | 2022-11-04 | 南京争锋信息科技有限公司 | ATTENTION-LSTM-based network fault prediction method |
CN116150636B (en) * | 2023-04-18 | 2023-07-07 | 苏州上舜精密工业科技有限公司 | Fault monitoring method and system for transmission module |
-
2021
- 2021-05-31 CN CN202110597641.XA patent/CN113342597B/en active Active
Non-Patent Citations (5)
Title |
---|
《一种基于系统日志聚类的多类型故障事件预测方法》;王卫华;《全国优秀硕士学位论文全文库》;20180521;全文 * |
《基于深度特征学习的网络异常行为检测》;宋绪成;《全国优秀硕士学位论文全文库》;20210321;全文 * |
基于双阶段并行隐马尔科夫模型的电力系统暂态稳定评估;唐飞等;《中国电机工程学报》;20130405(第10期);全文 * |
基于日志数据的分布式软件系统故障诊断综述;贾统等;《软件学报》;20200715(第07期);全文 * |
面向服务软件中异常处理模块重要性的仿真分析方法;吴青等;《计算机科学》;20121015(第10期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113342597A (en) | 2021-09-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113342597B (en) | System fault prediction method based on Gaussian mixture hidden Markov model | |
CN110413788B (en) | Method, system, device and storage medium for predicting scene category of conversation text | |
CN111914873A (en) | Two-stage cloud server unsupervised anomaly prediction method | |
Tan et al. | Network fault prediction based on CNN-LSTM hybrid neural network | |
KR20210141784A (en) | A method for training a deep learning network based on AI and a learning device using the same | |
US11443168B2 (en) | Log analysis system employing long short-term memory recurrent neural net works | |
CN112951311B (en) | Hard disk fault prediction method and system based on variable weight random forest | |
CN114925238B (en) | Federal learning-based video clip retrieval method and system | |
CN103995828B (en) | A kind of cloud storage daily record data analysis method | |
CN112756759A (en) | Spot welding robot workstation fault judgment method | |
Qin et al. | The NLP task effectiveness of long-range transformers | |
CN111860981A (en) | Enterprise national industry category prediction method and system based on LSTM deep learning | |
CN115859777A (en) | Method for predicting service life of product system in multiple fault modes | |
CN115618732A (en) | Nuclear reactor digital twin key parameter autonomous optimization data inversion method | |
CN111949459A (en) | Hard disk failure prediction method and system based on transfer learning and active learning | |
CN117743909A (en) | Heating system fault analysis method and device based on artificial intelligence | |
CN111898673A (en) | Dissolved oxygen content prediction method based on EMD and LSTM | |
CN114816962A (en) | ATTENTION-LSTM-based network fault prediction method | |
CN113822336A (en) | Cloud hard disk fault prediction method, device and system and readable storage medium | |
Yang et al. | Zte-predictor: Disk failure prediction system based on lstm | |
CN117472679A (en) | Anomaly detection method and system combining data flow and control flow drift discovery | |
CN117332858A (en) | Construction method of intelligent automobile fault diagnosis system based on knowledge graph | |
Dui et al. | Reliability Evaluation and Prediction Method with Small Samples. | |
CN115460061B (en) | Health evaluation method and device based on intelligent operation and maintenance scene | |
CN115328866A (en) | Log sampling based method and system for predicting event under process instance |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |