CN113341919A

CN113341919A - Computing system fault prediction method based on time sequence data length optimization

Info

Publication number: CN113341919A
Application number: CN202110601375.3A
Authority: CN
Inventors: 何盼; 刘刚; 洪昌萍; 江玲
Original assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Current assignee: Chongqing Institute of Green and Intelligent Technology of CAS
Priority date: 2021-05-31
Filing date: 2021-05-31
Publication date: 2021-09-03
Anticipated expiration: 2041-05-31
Also published as: CN113341919B

Abstract

The invention relates to a computing system fault prediction method based on time sequence data length optimization, and belongs to the field of fault detection. The method comprises the following steps: s1: off-line training: based on historical system operation data, data slicing is carried out by adopting different data sequence lengths, and different fault prediction models are constructed; searching the sequence data length with the optimal prediction precision and a corresponding fault prediction model based on a binary search idea; s2: online prediction: the optimal sequence data length generated by offline training is used for a real-time fault prediction process; s3: updating the model: and in the continuous operation process of the system, the real data statistical model is adopted to predict the precision in real time, and the failure prediction model parameters or the sequence data length are updated according to the decline of the precision. The invention improves the precision of the fault prediction model, reduces the times of model training in the optimal length searching process, and improves the adaptability of the prediction model to the change of a system and an environment.

Description

Computing system fault prediction method based on time sequence data length optimization

Technical Field

The invention belongs to the field of fault detection, and relates to a computing system fault prediction method based on time sequence data length optimization.

Background

In view of the common application of computing systems in various industries, unknown system faults may cause great influence, and the maintenance of system reliability is crucial to guarantee the continuous operation of the computing systems. However, a computing system is often composed of a plurality of different components, such as a hardware processor, a software module, a database, a network system, and the like, and the failure rules of the different components are unknown, the relationships are complex and influence each other, and it is difficult to perform accurate failure analysis through the internal composition structure of the system. From the perspective of a system, monitoring the state or quality of the system by using logs, probes and the like, and performing overall evaluation and fault prediction on a software module and bottom hardware and the like included in the software module through monitoring data are the main methods for maintaining the reliability of the current computing system.

The monitoring data of the computing system has periodicity and randomness, and the continuous monitoring data with single attribute or multiple attributes is the main basis for predicting and classifying system faults. Due to technical limitations of new computer technologies such as cloud platforms, microservices and the like or limitations of computing resources in real-time embedded computing systems such as unmanned aerial vehicle flight control systems, the systems often shield the outside from the bottom hardware architecture and even software module composition; meanwhile, due to the complexity of the attribute data relationship, it is difficult to establish a system state change mathematical model based on the data attribute change rule. Therefore, numerical analysis methods based on time series data, such as Bayesian analysis, machine learning, deep learning and the like, are not only widely applied to failure prediction of computing systems, but also applied to failure prediction in various fields of aerospace, intelligent manufacturing and the like.

In the prior art, known or unknown rules of monitoring data before a historical fault happens are obtained by analyzing historical monitoring data and extracting data characteristics; by comparing the characteristics of the current monitoring data, whether the fault is about to occur or not can be predicted, and the fault type can be judged. In the fault analysis of a computing system, the prior patent adopts a statistical analysis method such as a Bayes model and an ARIMA time sequence analysis model, a machine learning method such as a support vector machine and XGboost, and a deep learning method such as a deep neural network model of LSTM, CNN, GRU and the like to detect or predict faults. Compared with other methods, the deep learning method can improve the accuracy and precision of system fault prediction and classification, but usually adopts a fixed-length or indefinite-length time sequence data set for model construction. Data acquisition by a computing system in a real-world environment is a long-lasting process, and the acquired data is continuous data that changes over time. In order to generate a time series data set, the prior art discusses a data slicing method for continuous data or a timing fixed length data acquisition method, but does not discuss a selection method of a sequence data length or a data slice length, and has the following problems:

(1) for the fault prediction of different time periods, the time sequence data with different lengths have more obvious influence on the accuracy of the prediction model. In the model training phase of a real system, an algorithm model may need to be trained for multiple times by using sequence data with different lengths so as to compare the model accuracy. The existing fault prediction algorithm generally does not consider how to select the length of time series data used for training, does not consider the influence effect of the data length, does not have good practicability, and cannot ensure the performance of the algorithm in the training stage.

(2) In the system operation process, the data rule may dynamically change with time, and the fault prediction model trained by historical data may not be suitable for a long time and needs to be dynamically updated. While the prior art discusses dynamic training and updating methods for models, the change in the length of time series data used for training is not discussed further.

In view of the above disadvantages, a failure prediction method capable of improving the accuracy of the failure prediction model, reducing the number of times of training the model, and making the failure prediction model better adapt to the system change is needed.

Disclosure of Invention

In view of the above, the present invention provides a method for computing system fault prediction based on time series data length optimization, which is based on a binary search concept and uses the precision of a fault prediction model trained by different sequence data lengths as an evaluation index to compare the lengths of sequence data with different lengths, so as to achieve the purpose of optimally selecting the length of time series data of a specific fault prediction problem. Meanwhile, the purposes of estimating and maintaining the model prediction precision in real time are achieved by dynamically adjusting the length of the time series data.

In order to achieve the purpose, the invention provides the following technical scheme:

a computing system fault prediction method based on time sequence data length optimization comprises three processes of off-line training, on-line prediction and model updating. The off-line training process completes the selection of the optimal sequence length and the training of the model, the on-line prediction process adopts the off-line process training model to carry out fault prediction and system control, and the model updating process carries out the inspection and updating feedback of the model in the system operation process. As shown in fig. 1, the offline training needs to be performed before the online prediction process, and the model updating can be performed synchronously with the online prediction process. The method comprises the following steps:

s1: off-line training;

based on historical system operation data, data slicing is carried out by adopting different data sequence lengths, and different fault prediction models are constructed; searching the sequence data length with the optimal prediction precision and a corresponding fault prediction model based on a binary search idea;

s2: online prediction;

the optimal sequence data length generated by offline training is used for a real-time fault prediction process;

s3: updating the model;

and in the continuous operation process of the system, the real data statistical model is adopted to predict the precision in real time, and the failure prediction model parameters or the sequence data length are updated according to the decline of the precision.

Further, in step S1, the offline training process (as shown in fig. 2) specifically includes the following steps:

s11: and (3) selecting a prediction period: determining a fault prediction time period n according to the characteristics of the computing system and project requirements, namely predicting the probability of a certain type of fault of the system after n times; querying whether there is a trained model f associated with n_wAnd the optimal sequence data length tuple m_wSetting the length m of the initial input data sequence to be searched if it exists₀For the last searched recorded value m_wOtherwise, set the starting search length m₀Is a prediction period n;

s12: setting an initial to-be-searched setCombining: setting the value in the length set of the sequence data to be searched as m₀Setting a lower boundary m₁＝m₀/2, upper boundary m₂＝2m₀Establishing a sequence data length set M ═ M to be searched₀,m₁,m₂}；

S13: model training and evaluation: for each value M in M_jE is M (j is more than or equal to 0 and less than or equal to 2), if M does not exist in the trained model set F_jCorresponding failure prediction model f_jAnd prediction model accuracy p_jTraining a prediction model and evaluating the model precision;

s14: optimal sequence data length search: according to the sequence data length set M, different prediction models formed by the set M and model precision p_jSearching and searching for the data sequence length with the optimal prediction precision:

if m is₂-m₁If the result is less than or equal to 2, ending the search and executing the optimal result storage step;

if m is₂-m₁>2, regenerating the element M in the search set M according to the following rule_j：

If p is₀≥p₁≥p₂Then at [ m₁,m₀]Continuing searching in the interval, and resetting the median, the lower boundary and the upper boundary of the set to be m₀’＝(m₀+m₁)/2，m₁’＝m₁，m₂’＝m₀；

If p is₀≥p₂≥p₁Or p₀≥p₁And p is₂-p₀Delta is less than or equal to delta, then in [ m ]₀,m₂]Continuing searching in the interval, and resetting the median, the lower boundary and the upper boundary of the set to be m₀’＝(m₀+m₂)/2，m₁’＝m₀，m₂’＝m₂；

If p is₁≥p₂≥p₀Or p₁≥p₀≥p₂Then decrease m₁Searching direction of (1), resetting median, lower boundary and upper boundary m in the set₀’＝m₁，m₁’＝m₁/2，m₂’＝m₀；

If p is₂≥p₁≥p₀Or p₂≥p₀≥p₁And p is₂-p₀>δ, then increase m₂Searching direction of (1), resetting median, lower boundary and upper boundary m in the set₀’＝m₂，m₁’＝m₀，m₂’＝2m₂；

S15: updating a set to be searched: p generated in the search process₀、p₁、p₂Storing the model precision set P, and updating the set M to be searched into { M }₀’,m₁’,m₂' }, returning to execute the model training and evaluation in the step S13;

s16: storing the optimal search result: comparing all model precisions in the model precision set P, and selecting the first k precision data P with the highest precision_vE.g. P, calculating the average value

Comparing the highest prediction precisions P in the set P_w＝max{p_v|p_vBelongs to P, and obtains P_wCorresponding model f_wAnd data length m_wPredicting the failure period n and the prediction precision p_wPrediction model f_wLength m of sequence data_wAnd average prediction accuracy p_AStored as tuples in a pre-trained model library.

Further, in step S13, if there is no training model in the training model set F, m is not included_jCorresponding failure prediction model f_jAnd prediction model accuracy p_jTraining a prediction model and evaluating the model precision by adopting the following steps, specifically comprising:

s131: and (3) data set generation: slicing the continuous monitoring data to generate a plurality of m-length slices_jThe probability y of whether the system has specific faults after n times of each group of sequence data is used as a sequence data label, and the sequence data with the label is randomly divided into a training data set S_jAnd test data set T_j；

S132: training a fault prediction model: model training data set S adopting time series related deep learning neural network such as LSTM, GRU and the like_jObtaining a failure prediction model f_jThe relevant parameters of (1); model f_jThe middle input variable is m in length_jIf a specific type of fault occurs after the output variable is n times, the model f is processed_jAdding the training model set F;

s133: and (3) evaluating the model precision: using a prediction model f_jFor test data set T_jPredicting the intermediate sequence data and predicting the failure probability

Comparing with the actual fault probability y to evaluate the model precision p_j。

Further, in step S133, MAE and RMSE are used as model accuracy evaluation indexes, wherein

Further, in step S2, the optimal sequence data length generated by offline training is used in a real-time fault prediction process (as shown in fig. 3), which specifically includes the following steps:

s21: searching a model: inquiring whether a trained model f related to n exists or not according to the fault prediction time period n_wIf the off-line training flow does not exist, waiting for the off-line training flow to be executed; if yes, executing a fault real-time prediction step;

s22: and (3) fault real-time prediction: continuously extracting the length of m from the current latest data_wInto the model f_wIn the method, the predicted fault probability of various faults of the system after n times is obtained

If the probability of the system generating specific faults is not less than the system maintenance probability threshold value, the corresponding system maintenance strategy is executed and the step S21 is returned, otherwise, the step is executed repeatedly.

Further, in step S3, the model updating process (as shown in fig. 4) specifically includes the following steps:

s31: updating the real-time data set: extracting length m from latest operation data_wAnd n times after each set of sequence data, and the probability y of whether a particular failure has occurred in the system_wUpdating the training data set S_wAnd test data set T_w；

S32: and (3) real-time evaluation of the model: after the system continuously runs for t time, adopting a prediction model f_wFor test data set T_wPredicting the middle sequence data and evaluating the model precision p_w’；

By using amplification factor x<1, if p_w’≥xp_wReturning to step S31 to continue updating the data set;

if xp is_A<p_w’<xp_wThen go to step S33;

if p is_w’<xp_AThen specify the starting sequence search length as m_wRe-executing the off-line training process, searching for a new optimal sequence length and a prediction model, and returning to the step S31 to continuously update the data set;

s33: updating the model: using the updated test data set and training data set without changing m_wOn the premise of adopting time series related deep learning neural network such as LSTM, GRU and the like to retrain the model f_wThe relevant parameters of (1); returning to step S31 continues updating the data set.

The invention has the beneficial effects that:

1) the invention provides an optimal data sequence length selection mechanism for the fault prediction method based on the time sequence data, improves the precision of a fault prediction model, and reduces the times of model training in the optimal length search process.

2) In the running process of the system, the invention provides a dynamic optimal data sequence length transformation mechanism for the fault prediction method so as to improve the adaptability of the prediction model to the change of the system and the environment.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention may be realized and attained by the means of the instrumentalities and combinations particularly pointed out hereinafter.

Drawings

For the purposes of promoting a better understanding of the objects, aspects and advantages of the invention, reference will now be made to the following detailed description taken in conjunction with the accompanying drawings in which:

FIG. 1 is a block diagram of a method for computing system fault prediction based on time series data length optimization in accordance with the present invention;

FIG. 2 is a flow chart of the off-line training process steps in the method of the present invention;

FIG. 3 is a flow chart of the on-line prediction process steps in the method of the present invention;

FIG. 4 is a flowchart illustrating the steps of the model update process of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention in a schematic way, and the features in the following embodiments and examples may be combined with each other without conflict.

Referring to fig. 1 to 4, the embodiment takes a fault prediction of a flight control system of a multi-rotor unmanned aerial vehicle as an example, and describes implementation steps of the fault prediction method of the present invention. The flight control system is one of key core systems of the multi-rotor unmanned aerial vehicle, acquires information such as an angular velocity sensor, an attitude sensor, an altitude airspeed sensor and a position sensor, and realizes flight management, attitude control and flight on demand of the unmanned aerial vehicle. Errors in the flight control system and its associated modules can have serious consequences during flight of the drone. However, the flight control system is generally implemented by an onboard embedded system, and the complexity of associated modules and the limitation of resources limit online fault tracing and elimination, so that system fault prediction based on real-time operation data is crucial.

In this embodiment, the fault prediction is mainly performed by using 18 time-related sequence attributes, including sensor information continuously acquired by the flight control system, such as data of a gyroscope, an accelerometer, a barometer, and a GPS, real-time software and hardware information during the operation of the flight control system, such as data of CPU occupancy, memory occupancy, and IO throughput, and software logs during the flight control process, such as data of flight status, flight time, and flight distance. According to different attribute meanings, the data acquisition frequency is 0.2-5 Hz. The fault/state types of the flight control system are mainly classified into 4 types including GPS positioning fault, control instruction delay, unknown error and normal operation. In order to ensure that system faults can be processed in time, the frequency of fault prediction is 1Hz, and the prediction period is not more than 5 seconds.

According to the method content of the invention, firstly, 3 algorithm modules of model training, optimal sequence length searching and model updating are realized:

(1) a model training module: according to the specified data sequence length m_jAnd a fault prediction period n, generating a fault prediction model and a model accuracy value.

Generating a data set: slicing the continuous monitoring data to generate a plurality of m-sized pieces_jX 18 matrix data, and the probability [ y ] of whether a system has certain type of fault after 5 seconds of each group of matrix data₁，y₂，y₃，y₄]As a data tag, where y₄Is the probability of the system operating normally. Partitioning tagged matrix data into a training data set S_jAnd test data set T_j。

Training a fault prediction model: training data set S adopting time series related deep learning neural network such as LSTM, GRU and the like_jObtaining a failure prediction model f_jThe relevant parameters of (1). Model f_jAdding a trained modelType set F, model F_jThe middle input variable is m_jX 18 matrix data, the output variable is the probability of a particular type of fault occurring after 5 seconds.

Assessing the precision of the model: using a prediction model f_jFor test data set T_jPredicting the intermediate sequence data and predicting the failure probability

And the actual failure probability y_i(i is more than or equal to 1 and less than or equal to 4) are compared, and the model precision p is evaluated_j. Aiming at the multi-classification probability prediction value, the average value of MAE and RMSE is adopted as the model precision evaluation index,

wherein

(2) The optimal sequence length searching module: according to the length m of the initial data sequence_wAnd a fault prediction period n, searching and searching the length of the data sequence with the optimal prediction precision.

Firstly, search initialization: setting the initial search number i to 0 if the sequence data length m is given_wThen setting the initial value m of the length of the input sequence data to be searched₀＝m_wOtherwise, set m₀N. Establishing a sequence data length set M to be searched_i＝{m₀,m₁,m₂In which the lower boundary m₁＝m₀/2, upper boundary m₂＝2m₀。

Invoking a model training module: to M_iEach sequence length value m_j∈M_i(j is more than or equal to 0 and less than or equal to 2), calling a model training module to generate a corresponding fault prediction model f_jAnd obtaining the accuracy p of the prediction model_j。

Generating a subsequent search set: accuracy index p of prediction model formed by comparing sequence data of different lengths_jAnd generating a data length set of subsequent search:

if m is₂-m₁If not more than 2, the searching is finished, and the fifth step is executed.

If m is₂-m₁>2, regenerating the search set M according to the following rule_iElement m in (1)_j：

If p is₀≥p₁≥p₂Then at [ m₁,m₀]Continuing to search within the interval, setting m₀’＝(m₀+m₁)/2，m₁’＝m₁，m₂’＝m₀；

If p is₀≥p₂≥p₁Or p₀≥p₁And p is₂-p₀Delta is less than or equal to delta, then in [ m ]₀,m₂]Continuing to search within the interval, setting m₀’＝(m₀+m₂)/2，m₁’＝m₀，m₂’＝m₂；

If p is₁≥p₂≥p₀Or p₁≥p₀≥p₂Then decrease m₁Direction search of (1), setting m₀’＝m₁，m₁’＝m₁/2，m₂’＝m₀；

If p is₂≥p₁≥p₀Or p₂≥p₀≥p₁And p is₂-p₀>δ, then increase m₂Direction search of (1), setting m₀’＝m₂，m₁’＝m₀，m₂’＝2m₂；

Fourthly, updating the set to be searched: p is to be₀、p₁、p₂Storing the result into a model precision set P, updating the search times i to i +1, and updating a set M to be searched_i＝{m₀’,m₁’,m₂' }, return to the step of executing II.

Storing an optimal search result: comparing all model precisions stored in the set P, and selecting the first k precision data P with the highest precision_vE.g. P, calculating the average value

(3) A model updating module: obtaining a prediction model f related to a prediction period n_wAnd its precision p_wAnd p_AAnd updating the model according to the evaluation result of the latest test data set.

Adopting a prediction model f_wFor the latest test data set T_wPredicting the middle sequence data and evaluating the average accuracy p of the model_w’。

② adopting the magnification factor x as 0.9 if xp_A<p_w’<xp_wThen the optimal sequence length search module and the latest data set S are called_wAnd T_wRe-search sequence length and train algorithm model f_w。

③ if p_w’<xp_AThen the starting sequence search length is assigned to the current sequence length m_w(e.g., 24 seconds), calling an algorithm training module and the latest data set S_wAnd T_wRetraining the algorithmic model f_w。

The operation of the flight control system is divided into two stages of non-flight and flight, so that the three processes of the invention are executed at different stages of the operation of the system.

(1) An off-line training process: the method is executed in the non-flight stage of the flight control system. By collecting historical data or generating simulation data in the flight process of the unmanned aerial vehicle in advance, the optimal sequence length search algorithm is called to obtain the optimal sequence length m by searching matrix data with the lengths of 5 seconds, 10 seconds, 2.5 seconds, 20 seconds, 40 seconds, 30 seconds, 25 seconds, 26 seconds, 24 seconds and the like in sequence_w24 seconds, and calling a model training module to obtain an optimal model f in the searching process_w。

(2) An online prediction process: the method is continuously executed in the in-flight phase of the flight control system. Reading the stored prediction model f after training_wExtracting 18-attribute matrix data with the length of 24 seconds from the current latest data, and inputting the matrix data into a model f_wIn the method, the probability of various faults of the system after 5 seconds is obtained

Probability of specific fault if system

And (4) the system maintenance probability threshold is more than or equal to 0.7, a fault warning is sent to the flight control background, the background is waited to take over the manual flight or take other control measures, and otherwise, data in the flight process are continuously read and the fault prediction at the next moment is carried out.

(3) And (3) updating the model: the method is executed in the flight and non-flight phases of the flight control system.

The data acquisition process is executed in the in-flight stage, and continuous data of 18 attributes in the flight process of the unmanned aerial vehicle are continuously recorded.

Secondly, the data set updating process is executed in the non-flying stage, 18-attribute matrix data with the length of 24 seconds are continuously extracted from the latest recorded data, the data interval frequency is 1Hz, and the probability y of specific fault of the system is acquired after 5 seconds corresponding to each group of sequence data_wUpdating the training data set S_wAnd test data set T_w。

Thirdly, the model evaluation and updating process is executed in the non-flying stage, and the latest data set S is adopted_wAnd T_wAnd calling a model updating module to evaluate and update the algorithm model and the optimal sequence length value.

Finally, the above embodiments are only intended to illustrate the technical solutions of the present invention and not to limit the present invention, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions, and all of them should be covered by the claims of the present invention.

Claims

1. A method for predicting faults of a computing system based on time series data length optimization is characterized by comprising the following steps:

s1: off-line training;

s2: online prediction;

s3: updating the model;

2. The method for predicting faults of a computing system according to claim 1, wherein in the step S1, the off-line training specifically comprises the following steps:

s12: setting an initial set to be searched: setting the value in the length set of the sequence data to be searched as m₀Setting a lower boundary m₁＝m₀/2, upper boundary m₂＝2m₀Establishing a sequence data length set M ═ M to be searched₀,m₁,m₂}；

S13: model training and evaluation:for each value M in M_jE is M, j is more than or equal to 0 and less than or equal to 2, if M does not exist in the trained model set F_jCorresponding failure prediction model f_jAnd prediction model accuracy p_jTraining a prediction model and evaluating the model precision;

if m is₂-m₁>2, regenerating the element M in the set M according to the following rules_j：

3. The method of claim 2, wherein in step S13, if there is no model in the trained model set F, m is determined to be the same as m_jCorresponding failure prediction model f_jAnd prediction model accuracy p_jTraining a prediction model and evaluating the model precision by adopting the following steps, specifically comprising:

S132: training a fault prediction model: deep learning neural network training data set S adopting time series correlation_jObtaining a failure prediction model f_jThe relevant parameters of (1); model f_jThe middle input variable is m in length_jIf a specific type of fault occurs after the output variable is n times, the model f is processed_jAdding the training model set F;

4. The method of claim 3, wherein in step S133, MAE and RMSE are used as model accuracy evaluation indexes, wherein

5. The method for predicting faults of a computing system according to claim 2, wherein in step S2, the optimal sequence data length generated by offline training is used in a real-time fault prediction process, and the method specifically comprises the following steps:

If the system happensIf the probability of the specific fault is not less than the system maintenance probability threshold, the corresponding system maintenance strategy is executed and the step S21 is returned, otherwise, the step is repeatedly executed.

6. The method for predicting a failure in a computing system according to claim 2, wherein in step S3, the updating the model specifically includes the steps of:

if xp is_A<p_w’<xp_wThen go to step S33;

if p is_w’<xp_AThen specify the starting sequence search length as m_wRe-executing the off-line training process to search for a new optimal sequence length and a prediction model, and returning to the step S31 to continuously update the data set;

s33: updating the model: using the updated test data set and training data set without changing m_wOn the premise of adopting the deep learning neural network related to the time series to retrain the model f_wThe relevant parameters of (1); returning to step S31 continues updating the data set.

7. The method of predicting a failure in a computing system of claim 3 or 6, wherein the deep learning neural network comprises an LSTM or GRU network.