CN113342597B

CN113342597B - System fault prediction method based on Gaussian mixture hidden Markov model

Info

Publication number: CN113342597B
Application number: CN202110597641.XA
Authority: CN
Inventors: 应时; 田园; 王冰明
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2022-04-29
Anticipated expiration: 2041-05-31
Also published as: CN113342597A

Abstract

The invention discloses a system fault prediction method based on a Gaussian mixture hidden Markov model, which comprises the following steps: preprocessing an original log file and marking; extracting log file features and constructing feature vectors; constructing a corresponding data set for each fault to be predicted by using a sliding window; respectively training a Gaussian mixture hidden Markov fault prediction model for each fault to be predicted; predicting whether the real-time log can be in failure or not and the type of the failure which can be in failure through a trained Gaussian mixture hidden Markov model. By the technical scheme of the invention, the problems of interleaving and redundancy of original log files are solved, so that the extracted features are fewer and more accurate; the system state and the log before the system fails are modeled by adopting a Gaussian mixture hidden Markov model, so that the system failure is rapidly and accurately predicted, and the availability of the system is improved.

Description

System fault prediction method based on Gaussian mixture hidden Markov model

Technical Field

The invention belongs to the field of intelligent operation and maintenance, particularly relates to a system fault prediction method based on a Gaussian mixture hidden Markov model, and aims at the problem of system fault prediction.

Background

The complexity of software systems has increased over the past decade as demand has grown. The complexity of the software, human mental behavior, and other resource constraints make it very difficult to develop fault-free software. High complexity software systems need to guarantee their reliability. The software failure prediction predicts the future software failure tendency by using the basic prediction index and the historical failure data, and eliminates the potential failure by predicting the result. The method for preventing the software system from the faults in the past is beneficial to improving the usability and the use efficiency of the software system. By logging such semi-structured text-type data, however, there are two significant improvements to predicting system failure problems:

the effect and the efficiency of the fault prediction have further improved space

Based on the traditional machine learning models such as a support vector machine and a clustering fault prediction algorithm, the prediction accuracy and the recall rate are both about 80%, and the method can be further improved. Although the accuracy of the fault prediction algorithm based on deep learning such as CNN and LSTM reaches 90%, the training time and the prediction time of the model are obviously higher than those of the traditional machine learning model, so that the fault prediction efficiency can be further improved.

There is a need for more efficient data preprocessing methods

The log sequence has three characteristics after analysis:

long-term ordering: the sequential log of the system in the state transition can be generated in a time sequence in a series of actions in a long time, so that the sequence of the log sequence cannot be damaged when the frequent log sequence is analyzed and mined.

Staggering in the short term: because the system cluster is large in scale, multiple different tasks may be executed at the same node or different nodes, and corresponding logs are generated while each task is executed. The logs are arranged according to the time sequence to form a log sequence, so that the logs of other tasks can be inserted into the log sequence of a certain task, and the normal sequence of the logs corresponding to the tasks is damaged.

Redundancy in the short term: a component of the system is heavily accessed in a short period of time (especially when a failure occurs) and thus produces a large number of logs of the same type. For example, when a request connection error occurs, the system will immediately issue a connection request again until the connection is successful or a certain condition is reached. In the log-based failure prediction method, such redundant logs not only increase the computation cost, but also overwhelm other important logs, which is not favorable for the analysis of frequent log sequences. However, some kind of log is generated in a large amount in a short time, which may also be a feature of some kind of failure, and therefore a certain proportion of redundant logs also need to be kept.

Due to the preservation of the partial redundant logs, the number of logs in a certain period of time is large, and the types of logs are small. In the traditional log data preprocessing method, each log is regarded as an independent sample, and a feature vector is extracted. This method on the one hand results in a too large number of samples to be analyzed; the other party also causes a smaller amount of useful information in this time period. Therefore, a better method for preprocessing log data is needed, so that the processed data set can be more representative.

Disclosure of Invention

Aiming at the research background and problems, the invention provides a system fault prediction method based on a Gaussian mixture hidden Markov model, which comprises the following steps: according to the historical system logs, a GMM-HMM model is respectively constructed for each fault type to be predicted, the real-time log sequence of the system is respectively input into each GMM-HMM model during prediction, the probability of the log sequence under each model is calculated, and whether a fault occurs or the fault type occurs is judged based on the probability.

The technical scheme of the invention is a system fault prediction method based on a Gaussian mixture hidden Markov model, which comprises the following specific steps:

step 1: preprocessing an original log file data set to obtain a preprocessed log file data set, extracting a plurality of keywords of each preprocessed log file of the preprocessed log file data set by a keyword extraction method, further constructing a word frequency matrix, extracting a plurality of clustered preprocessed log files of each preprocessed log file data set of the preprocessed log file data set by the word frequency matrix by adopting a coacervation hierarchical clustering method, and manually marking the type of each clustered preprocessed log file;

step 2: extracting the characteristic of the type of each clustered preprocessed log file, further constructing a characteristic vector of the type of each clustered preprocessed log file, and arranging the characteristic vectors of the types of all clustered preprocessed log files according to the sequence of original log files to obtain a characteristic vector data set;

and step 3: positioning all occurrence positions of specified faults on a characteristic vector data set, positioning an initial position of a sliding window and a stop position of the sliding window on the characteristic vector data set, intercepting a characteristic vector sequence in the sliding window from the initial position of the sliding window and putting the characteristic vector sequence into the specified fault data set, moving the sliding window backwards by a sliding step length distance, continuously intercepting the characteristic vector sequence in the sliding window and putting the characteristic vector sequence into the specified fault data set until the sliding window reaches or exceeds the stop position of the sliding window, and taking the characteristic vector sequence as a specified fault data set;

and 4, step 4: respectively setting hyper-parameters of a Gaussian mixture hidden Markov model of each specified fault to be predicted, respectively taking a data set of each specified fault as the input of a training algorithm of the Gaussian mixture hidden Markov model, and optimizing and training the parameters to be estimated of the Gaussian mixture hidden Markov model of each specified fault through what algorithm to obtain the parameters to be estimated of the optimized Gaussian mixture hidden Markov model of each specified fault so as to construct the optimized Gaussian mixture hidden Markov prediction model of each specified fault;

and 5: intercepting a section of real-time log sequence through the sliding window in the step 3 to serve as a log sequence to be predicted; converting the log sequence to be predicted into a type sequence of the log files after the preprocessing of clustering by the method of step 1, converting the type sequence of the log files after the preprocessing of clustering into a characteristic vector sequence by the method of step 2, and predicting the characteristic vector sequence by the optimized Gaussian mixed hidden Markov model of each specified fault to obtain a prediction result;

preferably, the pretreatment in step 1 is: cleaning meaningless parameters and filtering redundant logs on the original log file data set to obtain a preprocessed log file data set;

step 1, the pre-processed log files after the clustering are obtained, and the specific formula symbols are defined as follows:

wherein l_j，iRepresenting the ith log file, N in the preprocessed log file data set at the jth acquisition time_jRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, i belongs to [1, N ]_j]，j∈[1,K]K represents the number of acquisition instants;

step 1, the type of each pre-processed clustered log file is defined as following by a specific formula symbol:

wherein e is_j，iRepresenting the type, N, of the ith log file in the preprocessed log file data set at the jth acquisition time_jRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, i belongs to [1, N ]_j]，j∈[1,K]K represents the number of acquisition instants;

preferably, in the step 2, the specific calculation method for extracting the feature of the type of the preprocessed log file after each cluster is as follows:

wherein the content of the first and second substances,

indicating that the type m of the log file is in type_jFrequency of middle energizer, F_mIs represented in { type₁，type₂，...，type_KIn the structure, m e type_iIs true i e [1, M ∈]Frequency of (N)_jRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, wherein M belongs to [1, M ∈]，j∈[1,K]M represents the total number of the types of the clustered log files in the step 1, and K represents the number of the acquisition moments;

step 2, the feature vector of the type of the log file after each cluster is preprocessed is defined as:

wherein component (a)

Indicating that the type m of the log file is in type_jM is within [1, M ]]，j∈[1,K]M represents the total number of the types of the clustered log files in the step 1, and K represents the number of the acquisition moments;

step 2, the characteristic vector data set is defined by a specific formula symbol

V＝{v₁，v₂，...，v_K}

Wherein v is_jRepresents type_jExtracting characteristic vector, j belongs to [1, K ∈ after characteristic extraction]；

Preferably, the step 3 of locating all occurrence positions of the specified fault on the characteristic vector data set includes:

searching and positioning a position d where the specified fault f appears in an original log file through a keyword of the specified fault f, and recording an index;

f∈[1，F],d∈[1，D_f]f denotes the number of all specified fault types to be predicted, D_fRepresenting the number of the occurrence positions of the specified fault type f to be predicted;

positioning an index j of the acquisition time of the specified fault f in the preprocessed log file data set through the recorded indexes_f，j_f∈[1，D_f]，D_fRepresenting the number of the occurrence positions of the specified fault type f to be predicted;

on the feature vector data set, locate an index j_fThe total number of the positioning positions is

Step 3, the sliding window, the specific formula symbol is defined as:

wherein v is_r，zIndicating the z-th feature vector in the vector sequence intercepted for the r-th time before the sliding window specifies the d-th positioning position of the fault on the feature vector data set,

f∈[1，F]，d∈[1，D_f]，r∈[1，S_d]，z∈[1，L_w]，

f denotes the number of all specified fault types to be predicted, D_fRepresenting the number of locations where the specified fault type f occurs, S_dRepresenting the number of times of interception before the d-th position on the feature vector data set, d ∈ [1, N]N denotes the number of fault locations specified on the sequence of feature vectors, L_wRepresenting the number of feature vectors contained in the sliding window;

step 3, the sliding step length, a specific formula symbol, is defined as:

wherein v is_vDenotes the v-th feature vector contained in the sliding step, v ∈ [1, L_s]，L_sRepresenting the number of feature vectors contained in the step of sliding;

step 3, specifying a fault data set, wherein a specific formula symbol is defined as:

wherein the content of the first and second substances,

representing all feature vector sequence segments intercepted before the d-th localization position of the specified fault f,

representing the r-th truncated segment of the feature vector before the d-th localized position of the specified fault f,

f∈[1，F]，d∈[1，D_f]，r∈[1，S_d]，

where F represents the number of all specified fault types to be predicted, D_fRepresenting the number of locations where the specified fault type f occurs, S_dIndicating the number of truncations made before the d-th position on the feature vector data set.

Preferably, in step 4, the specified failure data set is defined as:

Data＝{Data₁，Data₂，...，Data_F}

wherein Data_fData set representing a specified failure F, F ∈ [1, F ∈ [ ]]F represents the number of all specified fault types to be predicted;

step 4, the hyper-parameters of the Gaussian mixture hidden Markov model for specifying the faults comprise: the number of hidden states and the number of Gaussian components in the hidden Markov model;

the number of hidden states in the hidden Markov model is Q_fF denotes specifying the type of failure, F ∈ [1, F ∈ [ ]]F represents the number of all specified fault types to be predicted;

the number of the Gaussian partial models is G_fF denotes specifying the type of failure, F ∈ [1, F ∈ [ ]]F represents the number of all specified fault types to be predicted;

step 4, the parameters to be estimated of the Gaussian mixture hidden Markov prediction model of each specified fault comprise: the method comprises the following steps of weighting a Gaussian mixture model, a mean vector of the Gaussian mixture model, a covariance matrix of the Gaussian mixture model, a state transition probability matrix and an initial state probability vector;

the weight of the Gaussian partial model is

Representing the weight of a Gaussian component model g in the mixed Gaussian model corresponding to the hidden state p of the specified fault f;

the mean vector of the Gaussian partial model is

Representing a mean vector of a Gaussian mixture model g in a mixed Gaussian model corresponding to a hidden state p of a specified fault f;

the Gaussian mixture model covariance matrix is

Representing a covariance matrix of a Gaussian mixture model g in a mixed Gaussian model corresponding to a hidden state p of a specified fault f;

the state transition probability matrix is

Wherein

Representing the probability of a hidden state p of a given fault f transitioning to a hidden state q;

the initial state probability vector is

π_p ^fIndicating the probability of occurrence of the hidden state p at the initial moment of the specified fault f,

wherein the content of the first and second substances,

g represents a Gaussian component model, and G belongs to [1, G ]_f]，G_fRepresenting the number of Gaussian component models corresponding to the hidden state of the specified fault f;

f represents the specified fault type, F belongs to [1, F ], and F represents the number of all specified fault types to be predicted;

p represents the hidden state at the current time, p is in [1, Q ]_f]，Q_fRepresenting the number of hidden states of a specified fault type f;

q represents the hidden state at the next time, Q ∈ [1, Q ]_f]，Q_fIndicating the number of hidden states specifying the fault type f.

Preferably, in step 5, the specific method for predicting the feature vector sequence by the optimized gaussian mixture hidden markov model for each specified fault in step 4 is as follows:

and taking the characteristic vector sequence as the input of a backward algorithm of each Gaussian mixture hidden Markov model to obtain the probability of the characteristic vector sequence appearing under each Gaussian mixture hidden Markov model:

PR＝{PR₁，PR₂，...，PR_F}

wherein, PR_fRepresenting the probability of occurrence of a sequence of feature vectors in a Gaussian mixture hidden Markov model of the fault type F, found by a backward algorithm, with F ∈ [1, F]And F denotes the number of specified fault types to be predicted.

Step 5, the specific method for obtaining the prediction result is as follows:

defining a threshold value T, if PR^result＝{PR_tIs not empty, where PR_t＞T，PR_t∈PR，t∈[1，F]F denotes the number of all specified fault types to be predicted,

then take max { PR^resultAnd taking the corresponding fault type as a prediction result, otherwise, taking the prediction result as no fault.

The invention has the advantages that the invention adopts the feature extraction method of the segmentation log, solves the problem of interleaving and redundancy of the original log file, and ensures that the constructed feature vector has more identifiability; the system state and the log before the system fails are modeled by adopting the Gaussian mixture hidden Markov model, and compared with other system failure prediction models, the method has certain advantages in prediction effect and efficiency.

Drawings

FIG. 1: a flow chart of a method of fault prediction for an embodiment of the invention;

FIG. 2: the GMM-HMM system based fault prediction model of the embodiment of the invention;

FIG. 3: data pre-processing activity graphs for embodiments of the invention;

FIG. 4: data change graphs for examples of the invention;

FIG. 5: the recognition rate thermodynamic diagram of an embodiment of the invention;

FIG. 6: log cross-contrast graphs for examples of the invention;

FIG. 7: different methods predict effect contrast chart;

FIG. 8: different methods compare the efficiency to the figure.

Detailed Description

In order to facilitate the understanding and implementation of the present invention for those of ordinary skill in the art, the present invention is further described in detail with reference to the accompanying drawings and examples, it is to be understood that the embodiments described herein are merely illustrative and explanatory of the present invention and are not restrictive thereof.

The following describes a specific embodiment of the present invention with reference to fig. 1 to 8, and a technical solution of the specific embodiment of the present invention is a system fault prediction method based on a gaussian hybrid hidden markov model, which includes the following specific steps:

the pretreatment in the step 1 comprises the following steps: cleaning meaningless parameters and filtering redundant logs on the original log file data set to obtain a preprocessed log file data set;

wherein l_j，iRepresenting the ith log file, N in the preprocessed log file data set at the jth acquisition time_jRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, i belongs to [1, N ]_j]，j∈[1,K]And K represents the acquisition timeThe number of (2);

wherein e is_j，iRepresenting the type, N, of the ith log file in the preprocessed log file data set at the jth acquisition time_jRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, i belongs to [1, N ]_j]，j∈[1,K]K is 1024, which represents the number of acquisition moments;

step 2, extracting the characteristics of the type of the preprocessed log file after each cluster, wherein the specific calculation method comprises the following steps:

wherein the content of the first and second substances,

indicating that the type m of the log file is in type_jFrequency of middle energizer, F_mIs represented in { type₁，type₂，...，type_KIn the structure, m is an element tupe_iIs true i e [1, M ∈]Frequency of (N)_jRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, wherein M belongs to [1, M ∈]，j∈[1,K]M is 80 to represent the total number of the log file types after clustering in the step 1, and K is 1024 to represent the number of the acquisition time;

wherein component (a)

Indicating that the type m of the log file is in type_jM is within [1, M ]]，j∈[1,K]M is 80 to represent the total number of the log file types after clustering in the step 1, and K is 1024 to represent the number of the acquisition time;

V＝{v₁，v₂，...，v_K}

step 3, positioning all occurrence positions of the specified fault on the characteristic vector data set, wherein the specific method comprises the following steps:

f∈[1，F],d∈[1，D_f]denotes the number of all specified fault types to be predicted, D_fRepresenting the number of the occurrence positions of the specified fault type f to be predicted;

positioning the acquisition of the specified fault f in the preprocessed log file data set by the recorded indexIndex j of set time_f，j_f∈[1，D_f]，D_fRepresenting the number of the occurrence positions of the specified fault type f to be predicted;

Step 3, the sliding window, the specific formula symbol is defined as:

f∈[1，F]，d∈[1，D_f]，r∈[1，S_d]，z∈[1，L_w]，

step 3, the sliding step length, a specific formula symbol, is defined as:

wherein the content of the first and second substances,

f∈[1，F]，d∈[1，D_f]，r∈[1，S_d]，

where F-4 denotes the number of all specified fault types to be predicted, D_fRepresenting the number of locations where the specified fault type f occurs, S_dIndicating the number of truncations made before the d-th position on the feature vector data set.

step 4, specifying a fault data set, wherein a specific formula is defined as:

Data＝{Data₁，Data₂，...，Data_F}

the number of hidden states in the hidden Markov model is Q_fF denotes a specified fault type, F ∈ [1, F ∈ [ ]3]F represents the number of all specified fault types to be predicted;

the number of the Gaussian partial models is G _f6, F denotes a specified fault type, F ∈ [1, F ∈]F represents the number of all specified fault types to be predicted;

the weight of the Gaussian partial model is

the mean vector of the Gaussian partial model is

the Gaussian mixture model covariance matrix is

the state transition probability matrix is

Wherein

the initial state probability vector is

wherein the content of the first and second substances,

step 5, the specific method for predicting the feature vector sequence by the optimized gaussian mixed hidden markov model of each specified fault in step 4 is as follows:

PR＝{PR₁，PR₂，...，PR_F}

Step 5, the specific method for obtaining the prediction result is as follows:

defining a threshold value T of 0.76, if PR^result＝{PR_tIs not empty, where PR_t＞T，PR_t∈PR，t∈[1，F]F denotes the number of all specified fault types to be predicted,

Modeling transitions between system states and log generation processes using GNN-HMM

The system is divided into a system state layer, a task layer and an observation layer. The system layer is also referred to as a hidden layer, where a finite number of system states can transition with some probability to any one system state. When the system state is converted, a series of system tasks are executed, so that a task layer is generated. The corresponding log is generated while the task is executed, thereby creating an observation layer. In the observation layer, the system log is processed into a numerical vector form which can be understood by a machine through a data processing method. For the sake of simplifying the problem, the invention ignores the task layer and considers that the transition of the system state only depends on the state of the previous time, and simultaneously generates the log, so that the generation of the log at the current time also only depends on the system state at the current time. The process is modeled by a hidden Markov model, and further a Gaussian mixture distribution density is used as a function of the log generated by the system state, namely, a GMM model is used for fitting the probability distribution of the observed values in the HMM model.

In light of the above-described discussion,

FIG. 1 is a flow chart illustrating the overall process of the present invention from step 1 to step 5.

Figure 2 shows the three-layer structure of the system and the connection mode between the layers.

FIG. 3 is a detailed flow diagram illustrating the operation of the data preprocessing of FIG. 1.

FIG. 4 depicts the change in data during the process of converting the original log sequence into feature vectors.

The embodiment of the invention verifies the description method from three aspects

And verifying the influence of the system state quantity and the Gaussian mixture model quantity on the fault identification rate. After the number of hidden states and the number of Gaussian partial models are set, the models are respectively trained by using the data sets of each fault. And inputting the training data into the trained model again, calculating the probability, and marking the input data as the type with the highest probability value. The recognition rate is equal to the fraction of correctly marked data in the original data. And performing multiple tests by setting different state numbers and Gaussian branch model numbers, and taking the state number and the Gaussian branch model number corresponding to the model with the highest recognition rate as corresponding parameter values in subsequent experiments.

And verifying whether the feature construction method solves the problem of log interleaving. The original log sequence is artificially interleaved for a short time: and traversing the log data set in sequence, and randomly extracting a log exchange position from the first 50 logs or the last 50 logs of each log, thereby artificially manufacturing more staggered logs. And performing a comparison experiment on the manually staggered log data set and the original log data set, training a model, predicting faults, and comparing the predicted accuracy, recall rate and F-value.

The prediction effect of the GMM-HMM model is compared with other models. Comparing a prediction method based on a GMM-HMM (Gaussian mixture model) -HMM (hidden Markov model) with a prediction method based on a random index-support vector machine (RI-SVN), a prediction method based on a dual-combination long-and-short term memory network (CNN-LSTM) and a prediction method based on log event sequence clustering (Cluster), and evaluating the accuracy, the recall rate, the F-value, the training time and the prediction time of each method.

The invention verifies that the selected data set is the log data generated by the supercomputer Spirit in the actual operation process, and the data set is disclosed on the internet. The scale of the test data set is shown in table 1. The data set is divided into training data and test data, a training data set with 3 faults is constructed by a log data preprocessing method and a data set construction method, and the number of three fault events and corresponding seed logs is shown in table 2.

TABLE 1 Spirit data set Specification

Table 2 fault event seed log

Description of failure events	Corresponding seed Log quantity
		drive error SCSI port ID	52
writing message file	38
		unknown service	83

And (5) training the model after setting the number of the hidden states and the number of the Gaussian mixture models, and calculating the recognition rate of the model. And setting different numbers of hidden states and Gaussian partial models, and performing multiple tests to find better numbers of hidden states and Gaussian partial models. The labeling rates corresponding to the number of different hidden states and the number of gaussian models are shown in table 3.

TABLE 3 identification rates of different hidden states and Gaussian fraction model numbers

It can be seen from table 3 that when the number of concealment is constant, the recognition rate increases with the number of gaussian partial models, because the higher the number of gaussian partial models is, the higher its accuracy is, and the better the effect of simulating distribution is. However, the excessive number of the partial models can cause large data quantity on one hand and increase of calculation quantity and storage quantity on the other hand, thereby bringing great consumption to the system and reducing algorithm efficiency. Fig. 5 is a thermodynamic diagram of recognition rate as a function of hidden state and partial model changes. It can be observed that the recognition rate is highest when the number of hidden states is 3 and the number of gaussian component models is 6 or more. In order to reduce the amount of calculation and the amount of memory, the number of partial models is selected to be 6.

Comparing the manually staggered logs with the original logs, training the model by adopting the optimal group of hidden state quantity and Gaussian score model quantity in the training model after the number of hidden states and the number of Gaussian score models are preset, and calculating the accuracy rate, the recall rate and the F-value of the model.

As can be seen from fig. 6, the change in the accuracy, recall, and F-value of the model is not large after the cross-logs are artificially made. This shows that the log data preprocessing method of the present invention can solve the problem of interleaving of the log in a short time.

By combining various experimental verifications and analyses, the method provided by the invention has the advantages that under the conditions of better super-parameters and threshold values, the accuracy rate exceeds 80%, and the recall rate also reaches over 75%. In the aspect of predicting fault events, each evaluation index is good, and the time complexity and the space complexity of the algorithm can be accepted in the training and prediction of the model. Therefore, the failure prediction method of the present invention is also highly feasible in practice.

Comparing and analyzing the fault prediction effects of a GMM-HMM fault prediction method and other fault prediction methods, such as a random index and support vector machine (RI-SVN) method, a double-combination long-time memory network (CNN-LSTM) method and a log event sequence clustering (Cluster) method. Fig. 7 illustrates the difference in accuracy and recall between the above methods and the method of the present invention. Fig. 8 illustrates the difference between the training time and the prediction time for these methods. As can be seen from FIG. 3, the prediction effect of the GMM-HMM failure prediction model is immediately after the deep learning method CNN-LSTM and is superior to statistical learning methods such as RI-SVN, Cluster, etc. This is because the method of the present invention improves the log data preprocessing step, solves the log interleaving problem, and preserves a certain amount of redundant data in combination with the log distribution characteristics before the failure occurs, because the improvement of the data set improves the prediction effect. The CNN-LSTM has better prediction effect because CNN can locally read data and directly input the data to LSTM for analysis and calculation, thereby solving the problem of local interleaving of logs. However, the neural network has more complex structure parameters and requires a larger amount of data, and therefore, more calculation time and calculation resources are consumed. It can be seen from FIG. 8 that the training time of CNN-LSTM is much longer than that of GMM-HMM, and the prediction time is also longer than that of GMM-HMM, but the overall efficiency of each method in the statistical learning prediction method is not very different. Therefore, the GMM-HMM fault prediction method provided by the invention has certain advantages when the efficiency and the effect of the algorithm are comprehensively considered.

It should be understood that parts of the specification not set forth in detail are well within the prior art.

It should be understood that the above description of the preferred embodiments is given for clarity and not for any purpose of limitation, and that various changes, substitutions and alterations can be made herein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A system fault prediction method based on a Gaussian mixture hidden Markov model is characterized by comprising the following steps:

and 5: intercepting a section of real-time log sequence through the sliding window in the step 3 to serve as a log sequence to be predicted; and (3) converting the log sequence to be predicted into a type sequence of the log files after the preprocessing of clustering by the method of step 1, converting the type sequence of the log files after the preprocessing of clustering into a characteristic vector sequence by the method of step 2, and predicting the characteristic vector sequence by the optimized Gaussian mixture hidden Markov model of each specified fault to obtain a prediction result.

2. The method for predicting system faults based on the Gaussian mixture hidden Markov model according to claim 1, wherein the preprocessing in the step 1 is as follows: cleaning meaningless parameters and filtering redundant logs on the original log file data set to obtain a preprocessed log file data set;

wherein l_j，iRepresenting the ith log file, N in the preprocessed log file data set at the jth acquisition time_jRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, i belongs to [1, N ]_j]，j∈[1，K]K represents the number of acquisition instants;

wherein e is_j，iRepresenting the type, N, of the ith log file in the preprocessed log file data set at the jth acquisition time_jRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, i belongs to [1, N ]_j]，j∈[1，K]And K represents the number of acquisition instants.

3. The method for predicting system faults based on the Gaussian mixture hidden Markov model according to claim 1, wherein the step 2 is to extract the characteristics of the type of each pre-processed log file after clustering, and the specific calculation method is as follows:

wherein the content of the first and second substances,

indicating that the type m of the log file is in type_jFrequency of middle energizer, F_mIs represented in { type₁，type₂，...，type_KIn the structure, m e type_iIs true i e [1, M ∈]Frequency of (N)_jRepresenting the total number of log files in the preprocessed log file data set at the jth acquisition time, wherein M belongs to [1, M ∈]，j∈[1，K]M represents the total number of the types of the clustered log files in the step 1, and K represents the number of the acquisition moments;

wherein component (a)

Indicating that the type m of the log file is in type_jM is within [1, M ]]，j∈[1，K]M represents the total number of the types of the clustered log files in the step 1, and K represents the number of the acquisition moments;

V＝{v₁，v₂，...，v_K}

Wherein v is_jRepresents type_jExtracting characteristic vector, j belongs to [1, K ∈ after characteristic extraction]。

4. The method for predicting system faults based on the Gaussian mixture hidden Markov model according to claim 1, wherein the step 3 locates all occurrence positions of the specified faults on the characteristic vector data set by a specific method comprising the following steps:

f∈[1，F]，d∈[1，D_f]f denotes the number of all specified fault types to be predicted, D_fRepresenting the number of the occurrence positions of the specified fault type f to be predicted;

Step 3, the sliding window, the specific formula symbol is defined as:

f∈[1，F]，d∈[1，D_f]，r∈[1，S_d]，z∈[1，L_w]，

step 3, the sliding step length, a specific formula symbol, is defined as:

wherein the content of the first and second substances,

f∈[1，F]，d∈[1，D_f]，r∈[1，S_d]，

5. The Gaussian mixture hidden Markov model-based system failure prediction method of claim 4,

hidden Markov modelThe number of hidden states is Q_fF denotes specifying the type of failure, F ∈ [1, F ∈ [ ]]F represents the number of all specified fault types to be predicted;

the weight of the Gaussian partial model is

the mean vector of the Gaussian partial model is

the Gaussian mixture model covariance matrix is

the state transition probability matrix is

Wherein

the initial state probability vector is

wherein the content of the first and second substances,

6. The method for predicting system faults based on the Gaussian mixture hidden Markov model according to claim 1, wherein the specific method for predicting the characteristic vector sequence by the optimized Gaussian mixture hidden Markov model for each fault in step 4 is as follows:

PR＝{PR₁，PR₂，...，PR_F}

wherein, PR_fRepresenting the probability of occurrence of a sequence of feature vectors in a Gaussian mixture hidden Markov model of the fault type F, found by a backward algorithm, with F ∈ [1, F]F represents the number of all specified fault types to be predicted;

step 5, the specific method for obtaining the prediction result is as follows: