CN111027591B

CN111027591B - Node fault prediction method for large-scale cluster system

Info

Publication number: CN111027591B
Application number: CN201911107846.4A
Authority: CN
Inventors: 伍卫国; 毛海; 聂世强; 张驰; 董小社; 张兴军
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2019-11-13
Filing date: 2019-11-13
Publication date: 2022-07-12
Anticipated expiration: 2039-11-13
Also published as: CN111027591A

Abstract

The invention discloses a node fault prediction method facing a large-scale cluster system, which comprises the steps of collecting resource occupation data of each node and generating a data set, constructing a first data prediction model by using a long-term and short-term memory network, constructing a second fault prediction model by using a random forest, establishing a first observation window, judging the size of the first observation window, and returning to be reconstructed if the size of the first observation window does not meet a set value; if the set value is met, predicting data in the lead time window by using a first fault prediction model, combining the first observation window with the data in the lead time window to form a second observation window, judging the size of the second observation window, and if the set value is not met, returning to reconstruct the second observation window; if so, a second fault prediction model is used to predict the fault within the prediction window. The invention ensures that the accuracy of the prediction model is highest on the premise of ensuring that sufficient advance time is available for processing the node fault.

Description

Node fault prediction method for large-scale cluster system

Technical Field

The invention belongs to the technical field of reliability and availability of computer systems, and particularly relates to a node fault prediction method for a large-scale cluster system.

Background

Cluster systems are common platforms for high performance computing, cloud computing, and data centers. With the ever-increasing size and complexity of these platforms, system reliability becomes a major issue because the Mean Time Between Failure (MTBF) of the system decreases as the number of system components increases. Recent research results indicate that the reliability of existing data centers and cloud computing systems is limited by the mean time between failures of 10-100 hours. Data centers typically have a high failure rate because it has many servers and components. Furthermore, long running applications and intensive workloads are common in these facilities. The performance of the system depends on the availability of the machine, which is easily affected if the failure is not handled well.

To meet the increasing demand for cloud computing, internet companies such as google, Facebook, and Amazon typically deploy a large number of servers in their data centers. These servers are heavily loaded and handle a wide variety of requests. For such a high availability computing environment, when one server in a cluster fails, its workload is typically shifted to another machine in the same cluster, which increases the likelihood of other server failures.

Server failures can result in data loss and resource blocking due to sudden unavailability of the machine. In the worst case, these failures can crash the data center, resulting in unexpected downtime, which requires a very high cost to restore the data. As can be seen from the data center outage report issued by Ponemon Institute in 2016, an average of $ 9000/min, and up to $ 17000/min, is required to restore data. Of all server nodes in microsoft cloud system, less than 0.1% of the nodes experience failures each day, but it has a significant impact on services targeting 99.999% or higher availability. Therefore, node failure is one of the major causes of service outages.

On-line fault prediction is a technology for predicting faults by analyzing historical fault data of a machine and the current state of a system, so that adverse effects of the faults on a cluster are avoided or reduced, and the technology is an important means for improving the reliability and the availability of a storage system. While predicting the next failure of a machine appears to be a viable and promising solution to improve data center reliability, it presents two major challenges: the first challenge is that the prediction needs to be highly accurate, especially to reduce false positives. The second challenge is how to select a suitable advance time. If the advance time is too long, the significant features before the failure cannot be fully utilized, so that the model accuracy is low; if the lead time is too short, the prediction accuracy will be improved, but it is not enough for the administrator to have enough time to perform the relevant operations on the node to avoid the failure.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a node fault prediction method for a large-scale cluster system, aiming at the defects in the prior art, so that the accuracy of the prediction model is the highest on the premise of ensuring that sufficient advance time is available for processing node faults.

The invention adopts the following technical scheme:

a node fault prediction method facing a large-scale cluster system collects resource occupation data of each node and generates a data set, a first data prediction model is built by using a long-term and short-term memory network, a second fault prediction model is built by using a random forest, a first observation window is built, the size of the first observation window is judged, and if the size of the first observation window does not meet a set value, the reconstruction is returned; if the set value is met, predicting data in the lead time window by using a first data prediction model, combining the first observation window and the data in the lead time window to form a second observation window, judging the size of the second observation window, and if the set value is not met, returning to reconstruct the second observation window; if so, a second fault prediction model is used to predict the fault within the prediction window.

Specifically, each node collects actual operation parameters, the size of n unit time windows is taken to form an observation window and generate a data set, and each data of the node in the advance time window is predicted by using each data in the observation time window.

Further, the period for the nodes to collect actual operating parameters is every 5 minutes.

Further, each item of prediction data Y in the time period tau_r,τComprises the following steps:

Y_r,τ＝f(P(t))

where f denotes the model to be solved, P (t) is the vector formed by all data, t ∈ (1, τ -1), r ∈ resources.

Specifically, the input of the long-short term memory network comprises the number of training samples, a time step and a characteristic value, and the characteristic value is represented by a vector P (t) formed by all data.

Furthermore, correlation coefficients between each characteristic value and faults are obtained by calculating Pearson correlation coefficients, and 9 characteristic values with correlation coefficients larger than 0.1 are selected from actual operation parameters collected by the nodes to serve as final characteristic values.

Further, the characteristic value data is: a mean CPU usage rate, a local mean usage, a local mean utilization, a maximum CPU usage, a maximum disk IO time, a mean access per instance instruction.

Specifically, the input of the random forest is a vector P consisting of characteristic values in the first observation window_(t)And a vector Y composed of eigenvalues within the lead time window_(t1)And obtaining whether a fault occurs in a prediction window through the prediction behavior, wherein y represents whether the fault occurs in the prediction window as follows:

y＝f(P_(t),Y_(t1))

where f represents the model to be solved, 1 represents a fault and 0 represents a non-fault.

Compared with the prior art, the invention at least has the following beneficial effects:

the node fault prediction method for the large-scale cluster system can accurately predict the resource occupancy change condition of the nodes in a future period of time; and finally predicting the fault of the node by using the random forest according to the predicted node resource occupation data and the real resource occupation data, wherein the fault prediction of the node only needs to predict the machine state of the next time period, so that the method is a two-classification problem, and the random forest has higher accuracy in a classification algorithm. The random forest is not easy to fall into overfitting, high-dimensionality data can be processed, feature selection is not needed, and adaptability to a data set is high.

Furthermore, through the data prediction of the first stage, the resource occupation data of the nodes in the advance time window is predicted, the defect that no data exists in the advance time window in the traditional fault prediction method is overcome, and when the node fault prediction of the second stage is carried out, the data in the advance time window can be fully utilized, so that the prediction accuracy is improved.

Furthermore, indexes related to resource occupation amount in the nodes are many, different characteristic values have different influences on a fault prediction algorithm, and correlation coefficients between the characteristic values and faults are obtained by calculating Pearson correlation coefficients, so that the characteristic values required in prediction are determined, and influences of useless characteristic values on fault prediction are avoided.

In summary, the present invention can effectively predict the data in the time window ahead by using the advantage of LSTM for processing the data with high correlation with the time sequence and the data with longer distance in the time sequence. And then, combining the real data to jointly form data in an observation window, and performing final fault prediction by using a random forest method. Not only is advance time reserved for coping with faults, but also data in an advance time window is fully utilized, so that the accuracy of the model is ensured.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is a time window definition diagram;

FIG. 2 is a diagram of a new time window definition;

FIG. 3 is a view of the internal structure of the LSTM;

FIG. 4 is a prediction flow chart according to the present invention.

Detailed Description

Referring to fig. 4, the node failure prediction method for a large-scale cluster system of the present invention includes collecting occupancy data of each node resource, performing data processing to generate a data set, constructing a first data prediction model using a long-short term memory network (LSTM), constructing a second failure prediction model using a random forest, establishing first observation window data, determining whether the size of the first observation window is equal to 3 hours, and returning to re-construction if the size of the first observation window is not equal to 3 hours; if yes, predicting data in the time advance window by using a first data prediction model, combining the first observation window with the data in the time advance window to form a second observation window, judging whether the size of the second observation window is equal to 4 hours, and if not, returning to reconstruct the second observation window; if so, a second fault prediction model is used to predict the fault within the prediction window.

S1, predicting the node resource occupation amount based on a long-term and short-term memory network (LSTM);

in fault prediction, data within an observation window is typically used to predict whether a fault has occurred within a prediction window. The closer to the moment when the fault occurs, the more obvious the sign of the fault, i.e. the more important the characteristic value. However, in order to ensure that the administrator has enough time to cope with the failure, a lead time, i.e. the lead time window in fig. 1, must be reserved, which results in that the data in the lead time window cannot be fully utilized when performing prediction, resulting in a reduction in the prediction accuracy. Therefore, in order to ensure the prediction accuracy by using the data in the advanced time window, a node resource occupation prediction method based on a long-short term memory network (LSTM) is proposed, the data in the advanced time window is predicted by the LSTM, and the size of the observation window is enlarged, at this time, a schematic diagram of a new time window is shown in fig. 2.

LSTM (Long short-term memory) is a special RNN (recurrent neural network) that can learn Long dependencies; LSTM is carefully designed to avoid long dependency problems. LSTM is advantageous for processing data that has a high degree of correlation with time series, and is also highly advantageous when processing nodes that are far apart in time series. At this time, a diagram of a new time window is shown in fig. 2.

The data set was generated using the n unit time window sizes to construct an observation window using the actual operating parameters collected by the node every 5 minutes.

When the characteristic values are selected, correlation coefficients between each characteristic value and faults are obtained by calculating Pearson correlation coefficients, 9 characteristic values with the correlation coefficients larger than 0.1 are selected from actual operation parameters collected by nodes as final characteristic values, and the selected characteristic values are as follows:

mean CPU usage rate，canonical memory usage，total page cache memory usage， maximum memory usage，mean disk I/O time，mean local disk space used，maximum CPU usage，maximum disk IO time，memory accesses per instruction。

in the time period from 1 to tau-1, all characteristic values are normalized according to respective maximum values, the range is from 0 to 1, and a vector formed by the normalized characteristic value data is represented by P (t):

P(t)＝U_r,t,t∈(1,τ-1),r∈resources

the LSTM input layer includes the number of training samples (samples), time steps (time steps) and feature values (features). The time step is how many time series of input data before each data is related. The eigenvalues are represented by the vector p (t).

The invention selects the following data as characteristic values:

the measured values are normalized according to respective maximum values, and the range is from 0 to 1; f denotes the model to be solved, Y_r,τRepresenting the items of predicted data within the period of τ, the predicted behavior is represented as:

Y_r,τ＝f(P(t))

wherein t ∈ (1, τ -1), r ∈ resources.

And predicting various data of the nodes in the time window in advance by using various data of the nodes in the observation time window.

The internal structure of the LSTM is shown in fig. 3. The gates are realized by selectively passing information, mainly through a sigmoid neural network layer and a point-by-point multiplication operation, which is why the LSTM has 3 multiplication numbers because there are 3 gates, namely a forgetting gate (forget gate), an input gate (input gate) and an output gate (output gate).

The forget gate is used to decide which information to discard from the cell state.

The input gate is used to determine which update information is stored in the cell state. This process requires the following steps:

firstly, the sigmoid layer determines which information needs to be updated, the tanh layer generates a vector, the two parts of the value are updated to (-1,1), the two parts jointly form an input gate, and then the two vectors are combined to create an updated value.

And then, overlapping the old state and the new state to obtain the new state. The output gate determines what is output. Based on the cell state, firstly operating a sigmoid layer to determine a part to output the cell state;

finally, the cell state is passed through tanh, the value is normalized to between-1 and 1, and multiplied by the output of the sigmoid gate, so far only the decided portion is output.

The LSTM parameters are set as follows:

the time step is set to 36 (one time period every 5 minutes for 3 hours), i.e. each data is associated with the data of the previous 36 time periods.

The eigenvalue is set to 9.

The translation, i.e. the activation function, is set to 'relu'.

Dropout is set to 0.2.

The Batch _ size is set to 196.

The number of hidden layer nodes is set to 5.

And S2, predicting faults based on the random forest.

The fault prediction algorithm plays an important role in predicting the accuracy rate, and for node fault prediction, a supervised machine learning method is generally adopted, and because the resource occupation amount and the machine state of the nodes in different time periods are different, the unsupervised learning method is adopted, so that the corresponding relation between the resource occupation amount and the machine state cannot be well established, and the accuracy rate of the model is reduced.

The random forest is a supervised learning algorithm, and is an integrated learning algorithm taking a decision tree as a base learner. The random forest is not easy to fall into overfitting, high-dimensionality data can be processed, feature selection is not needed, and adaptability to a data set is high. In the classification algorithm, the random forest has higher accuracy. Therefore, a random forest algorithm is adopted when fault prediction is performed.

The fault prediction based on the random forest specifically comprises the following steps:

after the first stage, the node resource occupancy data in the advanced time window is predicted, and at this time, a second observation window, such as the observation window in fig. 2, is formed by combining the first observation window before the advanced time window, such as the observation window in fig. 1, and a random forest is used to predict whether a fault will occur in a future period of time (i.e., the prediction window).

The input of the random forest is a vector P consisting of characteristic values in the first observation window_(t)And a vector Y of eigenvalues within the advance time window_(t1)F represents the model to be solved, y represents whether a fault occurs in the prediction window, and the prediction behavior is represented as:

y＝f(P_(t),Y_(t1))

after the prediction behavior, whether a fault occurs in the prediction window or not is obtained, wherein 1 represents a fault, and 0 represents a non-fault.

The random forest parameters were set as follows:

n _ estimators is set to 20.

max _ depth is set to 50.

min _ samples _ leaf is set to 20.

min _ samples _ split is set to 30.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, as presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The node resource occupancy prediction method based on LSTM is shown in Table 1:

table 1 node resource occupancy prediction method based on LSTM

The fault prediction method based on random forests is shown in table 2.

Table 2. Fault prediction method based on random forest

At present, the research on node fault prediction at home and abroad does not fully consider a lead time window, and the lead time is not generally set, so that sufficient time cannot be provided for an administrator to take measures to avoid faults. Secondly, the data before the fault occurs has obvious symptoms on the fault, and even if the advance time is set, the data in the advance time window cannot be utilized, so that the prediction accuracy is reduced.

The invention can effectively predict the data in the time window ahead by utilizing the advantages of the LSTM on processing the data with high correlation degree with the time sequence and processing the data with longer distance on the time sequence. And then, combining the real data to jointly form data in an observation window, and performing final fault prediction by using a random forest method. Not only is advance time reserved, but also data in an advance time window is fully utilized, so that the accuracy of the model is ensured.

The above-mentioned contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modification made on the basis of the technical idea of the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A node fault prediction method for a large-scale cluster system is characterized in that resource occupation data of all nodes are collected and a data set is generated, a first data prediction model is built by using a long-term and short-term memory network, a second fault prediction model is built by using a random forest, a first observation window is built, the size of the first observation window is judged, and if the size does not meet a set value, the reconstruction is returned; if the set value is met, predicting data in the lead time window by using a first data prediction model, combining the first observation window and the data in the lead time window to form a second observation window, judging the size of the second observation window, and if the set value is not met, returning to reconstruct the second observation window; if so, predicting the fault in the prediction window by using a second fault prediction model;

when fault prediction is carried out, whether a fault occurs in a prediction window is predicted by using data in a second observation window, a period of time is reserved before the fault occurrence time to serve as a time-ahead window, a first observation window is arranged in front of the time-ahead window, a prediction window is arranged behind the time-ahead window, and whether the prediction window can generate the fault is predicted by using a random forest.

2. The large-scale cluster system-oriented node fault prediction method of claim 1, wherein each node collects actual operation parameters, takes the size of n unit time windows to form an observation window and generate a data set, and predicts each data of the node in an advance time window by using each data of the node in the observation time window.

3. The large-scale cluster system-oriented node failure prediction method according to claim 2, wherein the period for collecting actual operation parameters by the nodes is every 5 minutes.

4. The large-scale cluster system-oriented node fault prediction method of claim 2, wherein each item of prediction data Y in the time period of τ is_r,τComprises the following steps:

Y_r,τ＝f(P(t))

5. The large-scale cluster system-oriented node fault prediction method of claim 1, wherein the input of the long-short term memory network comprises a training sample number, a time step and an eigenvalue, and the eigenvalue is represented by a vector P (t) composed of all data.

6. The large-scale cluster system-oriented node fault prediction method of claim 5, wherein correlation coefficients between each eigenvalue and a fault are obtained by calculating Pearson correlation coefficients, and 9 eigenvalues with correlation coefficients larger than 0.1 are selected from actual operation parameters collected by nodes as final eigenvalues.

7. The large-scale cluster system-oriented node fault prediction method of claim 1, wherein the input of the random forest is a vector P consisting of eigenvalues in the first observation window_(t)And a vector Y of eigenvalues within the advance time window_(t1)Obtaining whether a fault occurs in the prediction window through the prediction behavior,whether a fault occurs within the prediction window y is represented as:

y＝f(P_(t),Y_(t1))