CN115373879A - Intelligent operation and maintenance disk fault prediction method for large-scale cloud data center - Google Patents

Intelligent operation and maintenance disk fault prediction method for large-scale cloud data center Download PDF

Info

Publication number
CN115373879A
CN115373879A CN202211039310.5A CN202211039310A CN115373879A CN 115373879 A CN115373879 A CN 115373879A CN 202211039310 A CN202211039310 A CN 202211039310A CN 115373879 A CN115373879 A CN 115373879A
Authority
CN
China
Prior art keywords
data
disk
data set
value
disk failure
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211039310.5A
Other languages
Chinese (zh)
Inventor
徐小龙
徐诗成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202211039310.5A priority Critical patent/CN115373879A/en
Publication of CN115373879A publication Critical patent/CN115373879A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/008Reliability or availability analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Computer Hardware Design (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a disk fault prediction method for intelligent operation and maintenance of a large-scale cloud data center, which comprises the following steps: firstly, carrying out information entropy characteristic processing on unbalanced data, and selecting more important characteristics; then dividing the processed unbalanced data, and extracting few types of sample data, namely fault samples; then, data enhancement is carried out on the fault samples by using a time progressive sampling method TPS to generate synthetic data, and more fault sample data are generated through the TPS, so that the ratio of the number of the healthy samples to the number of the fault samples can reach better balance; then, combining the synthetic data with good generation effect with the original data to generate integrated data; and finally, inputting the integrated data into a disk failure prediction model for training, selecting a time window of 7 days to predict whether a failure occurs after 7 days, and carrying out corresponding data marking.

Description

Intelligent operation and maintenance disk fault prediction method for large-scale cloud data center
Technical Field
The invention belongs to the field of operation and maintenance of cloud data centers, and particularly relates to a disk fault prediction method for intelligent operation and maintenance of a large-scale cloud data center.
Background
Magnetic disks are widely used as the general and primary storage device for large storage systems in modern large-scale data centers. In such data centers, ensuring high availability and reliability of data center management is a very challenging task, because various disk failures occur continuously in the field, which is a main factor causing service interruption in the cloud data center, and if no data redundancy scheme is deployed, the disk failures may cause temporary data loss, thereby causing system unavailability or even permanent data loss. The failure prediction of the disk is used as a key link of the intelligent operation and maintenance of the cloud data center, and the problems can be predicted and solved quickly by the operation and maintenance, so that the server can operate normally, and the problems caused by the failure are solved effectively.
With the coming of big data era and the rapid development of technologies such as machine learning and deep learning, people can utilize a complex neural network model to mine and extract key information in massive data under the support of strong calculation power.
Meanwhile, with the intensive research on the SMART data, the related characteristics of the SMART data in time series are gradually concerned by researchers. Therefore, more and more researchers are trying to implement disk failure prediction using a time series processing method.
However, in practical application, the SMART data of the disk failure acquired by the data center is degraded data from a healthy state to a failure state, and the actual failure time of the disk is unknown. That is to say, for a piece of failed disk data, we can only say that there is failed data in its data sequence, but cannot locate accurately. Faced with this problem, a natural solution is to treat the degraded data from the disk as one sample.
However, degraded data of a disk is often long-term serial data with different lengths, and how to classify the non-fixed-length time series data is an important and challenging problem in data mining. Even LSTM neural networks do not work well in the face of such long time series data. In addition, the failure phase of the disk is fast and short during operation. Therefore, the proportion of abnormal data in the life cycle data is very small, so that error information is buried in a large amount of health data. This is also known as the imbalance problem, which presents a serious challenge to conventional classification methods.
The current disk failure prediction methods are mainly divided into two types: based on a traditional machine learning method and a deep neural network method.
(1) Based on the traditional machine learning method, some research works utilize the SMART attributes and the bayesian network to predict the failure of the disk, a subset of the SMART attributes which can describe the data most is selected through a method of feature selection, a binning process and feature creation, and is used together with a group of trend indicators based on the same SMART attributes, but the time sequence features of the data are not well considered, and the effect of the dynamic bayesian network needs to be researched. Still other research efforts have treated fault prediction as a binary classification problem, while taking into account the mean time between predicted and actual faults, and evaluating model performance (FDR) based on fault detection rate, defined as the proportion of faulty drives that are correctly classified as faults, and the False Alarm Rate (FAR) defined as the proportion of good drives that are incorrectly classified as faults.
(2) Based on a deep neural network method, some research works utilize a long-time memory model (LSTM) and different data balance methods to predict disk failures before 5-7 days, so that the problem of model aging is solved, and the time range of the disk failures is widened. But the prediction is in units of days, the IO standard for giving an alarm is ignored, so that some false positives still exist in the prediction. Still other research efforts have focused on predicting disk failures using sequential information. They used a data set collected from a real word data center containing 3 different disk models (denoted as W, S and M) and built prediction models for these disk models separately, while modeling the long-term dependent sequential SMART data and demonstrating its ability to predict models.
Disclosure of Invention
The purpose of the invention is as follows: in order to solve the problems in the prior art, the invention provides a disk fault prediction method for intelligent operation and maintenance of a large-scale cloud data center, which integrates the advantages of a Transformer and optimizes by utilizing the advantages of a time progressive sampling TPS method, realizes data enhancement of unbalanced data, and classifies and predicts disk faults.
The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:
in a first aspect, a disk failure prediction method for intelligent operation and maintenance of a large-scale cloud data center is provided, and includes:
step 1, carrying out missing value filling, data normalization and information entropy processing on an unbalanced original data set to obtain a most relevant and frequently changed characteristic attribute, namely a data set G;
step 2, dividing the data set G into a source data set S1 formed by few samples with fault labels and a source data set S2 formed by multiple samples with non-fault labels according to the labels;
step 3, performing data enhancement on the source data set S1 by adopting a time progressive sampling TPS method to generate synthetic data to obtain a synthetic data set T;
step 4, integrating the source data set S1, the source data set S2 and the synthetic data set T to form an integrated data set Q, and dividing the integrated data set Q into a training set M and a test set N;
step 5, training the disk failure prediction model by using the training set M, and testing the trained disk failure prediction model by using the test set N until the model prediction effect meets the requirement, so as to obtain the trained disk failure prediction model;
step 6, inputting SMART data of the disk to be detected into a trained disk failure prediction model;
and 7, determining a disk failure prediction result according to the output of the disk failure prediction model.
In some embodiments, the missing value padding comprises: if the missing value is 2 or more in succession, using the pattern of the SMART entry on the disk as the padding value; if only one value is missing, the average value before and after the value is used as the fill value.
In some embodiments, the data normalization comprises:
scaling all values between [0,1] using the maximum and minimum values in the features, the scaling formula is as follows:
Figure BDA0003820384080000041
where x is the original value of the feature, x max And x min Maximum and minimum values of features in the dataset, respectively; and x' is the scaled eigenvalue.
In some embodiments, the information entropy processing comprises: the value of each characteristic attribute is calculated to express the information amount, and the formula is as follows:
Figure BDA0003820384080000042
where i represents the ith sample, for a total of n samples; p represents the probability of each value appearing in each SMART attribute; the higher the information entropy value H (U) of a feature, the more information it contains, meaning the more pronounced the fluctuation of the feature properties, so that the most relevant and frequently changing feature properties are selected.
In some embodiments, in step 3, performing data enhancement on the source data set S1 by using a time progressive sampling TPS method includes:
and generating and collecting fault data for the source data set S1 by adopting a time progressive sampling TPS method, performing loss calculation on the generated data and the original data, judging whether the loss is smaller than a set threshold value, if so, collecting, and otherwise, repeating the step operation until a synthetic data set T is obtained.
In some embodiments, for a given failed disk, assuming that the disk failure occurs at a timestamp t, the prediction operation occurs at a timestamp t-i, a time period t-i of length i between the occurrence of the prediction actions at t, the occurrence of the disk failure at t being denoted as lead period i;
during model training, for each failed disk, the TPS gradually collects more failure data samples in a lead period I, namely the range of the lead period I is 1 to I, wherein I is a hyper-parameter of the TPS;
there are also two important parameters in the TPS method:
the window length is defined as The size of The time window for training The network input data in each sequence sample, and The predict _ failure _ days is defined as The number of days before failure.
Further, in some embodiments, the window length is 5, then one training sample will contain SMART attribute information for The disk in The past 5 days; the value of predict _ failure _ days is within 5-7 days.
In some embodiments, the disk failure prediction model comprises: an input module, an encoder block, a decoder block, and an output module;
in the input module, the input data L (plus appropriate padding) is converted into H different query matrices using a convolution Transformer model using a convolution layer of kernel size k (i.e., "Conv, k") and step size 1
Figure BDA0003820384080000051
Key matrix
Figure BDA0003820384080000052
Sum matrix
Figure BDA0003820384080000053
Wherein H =1, …, H,
Figure BDA0003820384080000054
are all learnable parameters;
stacking a feedforward sublayer at the output of the encoder block and the decoder block respectively, wherein the position feedforward sublayer has two fully connected networks and middle ReLU activation, and the formula is as follows:
max(0,XW 1 +b 1 )W 2 +b 2 (3)
wherein X is input, W 1 、W 2 Is a learnable parameter, b 1 And b 2 The dimension of an output matrix finally obtained by the feedforward sublayer is consistent with X;
attention calculation operations are performed by convolutional projection instead of the existing position-based linear projection, and queries, keys and value embedding are performed by convolutional projection to enhance the attention to local context information.
In a second aspect, the invention provides a disk failure prediction device for intelligent operation and maintenance of a large-scale cloud data center, which comprises a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to the first aspect.
In a third aspect, the invention provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect.
The invention has the following beneficial effects:
(1) Aiming at disk failure prediction in a cloud data center, the advantages of a TPS (time progressive sampling method) are utilized, and the advantages of a Transformer are fused and optimized. The method can fully utilize the fault data to extract the relationship between the data, can enable the generated data to have the original characteristics of the fault data and the data distribution and hidden modes in the potential space, achieves excellent effects on public data, and has good practicability in a disk fault prediction system with higher requirements on an F1 value and a Matthews Correlation Coefficient (MCC).
(2) In the method, for long-time sequence data, a multi-head self-attention mechanism is utilized to superpose a multi-layer encoder-decoder model to learn and obtain the dependency relationship between time sequence data, and the time correlation between different time step data is established.
(3) In the method, the self-attention calculation method of the original Transformer is insensitive to local information, so that the model is easily influenced by abnormal points, and a potential optimization problem is brought. Therefore, the convolution projection is used for replacing the existing linear projection based on the position to perform the attention calculation operation, and the convolution projection is used for performing inquiry, key and value embedding to enhance the attention of local context information, so that the prediction is more accurate.
(4) In the method, a Time Progressive Sampling (TPS) method is utilized to perform data enhancement so as to solve the problem of data imbalance. The TPS can generate multiple failed samples for each failed disk, which not only preserves all the characteristics of a healthy disk, but also brings more failure modes.
(5) The algorithm of the method is simple in structure and low in time complexity.
Drawings
Fig. 1 is a schematic flow diagram of a disk failure prediction method for intelligent operation and maintenance of a large-scale cloud data center, which is designed in the embodiment of the invention.
FIG. 2 is a diagram of a disk failure prediction model in an embodiment of the present invention.
Fig. 3 is a design diagram of a time progressive sampling TPS method in an embodiment of the present invention.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the drawings in the specification.
In the description of the present invention, the meaning of a plurality is one or more, the meaning of a plurality is two or more, and the above, below, exceeding, etc. are understood as excluding the present numbers, and the above, below, within, etc. are understood as including the present numbers. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present invention, reference to the description of the terms "one embodiment," "some embodiments," "an illustrative embodiment," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Example 1
A disk fault prediction method for intelligent operation and maintenance of a large-scale cloud data center comprises the following steps:
step 1, carrying out missing value filling, data normalization and information entropy processing on an unbalanced original data set to obtain a most relevant and frequently changed characteristic attribute, namely a data set G;
step 2, dividing the data set G into a source data set S1 formed by few samples with fault labels and a source data set S2 formed by multiple samples with non-fault labels according to the labels;
step 3, performing data enhancement on the source data set S1 by adopting a time progressive sampling TPS method to generate synthetic data to obtain a synthetic data set T;
step 4, integrating the source data set S1, the source data set S2 and the synthetic data set T to form an integrated data set Q, and dividing the integrated data set Q into a training set M and a test set N;
step 5, training the disk failure prediction model by using the training set M, and testing the trained disk failure prediction model by using the test set N until the model prediction effect meets the requirement, so as to obtain the trained disk failure prediction model;
step 6, inputting SMART data of the disk to be detected into a trained disk failure prediction model;
and 7, determining a disk failure prediction result according to the output of the disk failure prediction model.
In some embodiments, the missing value padding comprises: if the missing value is 2 or more in succession, using the pattern of the SMART entry on the disk as the padding value; if only one value is missing, the average value before and after the value is used as the fill value.
In some embodiments, the data normalization comprises:
scaling all values between [0,1] using the maximum and minimum values in the features, the scaling formula is as follows:
Figure BDA0003820384080000081
where x is the original value of the feature, x max And x min Maximum and minimum values of features in the dataset, respectively; and x' is the scaled eigenvalue.
In some embodiments, the information entropy processing comprises: the value of each characteristic attribute is calculated to express the information amount, and the formula is as follows:
Figure BDA0003820384080000082
where i represents the ith sample, for a total of n samples; p represents the probability of each value appearing in each SMART attribute; the higher the information entropy H (U) of a feature, the more information it contains, meaning the more pronounced the volatility of the feature attributes, so that the most relevant and frequently changing feature attributes are selected.
In some embodiments, in step 3, performing data enhancement on the source data set S1 by using a time progressive sampling TPS method includes:
and generating and collecting fault data for the source data set S1 by adopting a time progressive sampling TPS method, performing loss calculation on the generated data and the original data, judging whether the loss calculation is smaller than a set threshold value, if so, collecting, otherwise, repeating the step operation until a synthetic data set T is obtained through collection.
In some embodiments, for a given failed disk, assuming that the disk failure occurs at a timestamp t, the prediction operation occurs at a timestamp t-i, a time period t-i of length i between the occurrence of the prediction actions at t, the occurrence of the disk failure at t being denoted as lead period i;
during model training, for each failed disk, the TPS gradually collects more failure data samples within a lead period I, namely the range of the lead period I is 1 to I, wherein I is a hyper-parameter of the TPS;
there are two important parameters in the TPS process:
the window length is defined as The size of The time window for training The network input data in each sequence sample, and The predict _ failure _ days is defined as The number of days before failure.
Further, in some embodiments, the window length is 5, then one training sample will contain SMART attribute information for The disk in The past 5 days; the value of predict _ failure _ days is within 5-7 days.
In some embodiments, the disk failure prediction model comprises: an input module, an encoder block, a decoder block, and an output module;
in the input module, the input data L (plus appropriate padding) is converted into H different query matrices using a convolution Transformer model using convolution layers of kernel size k (i.e. "Conv, k") and step size 1
Figure BDA0003820384080000101
Key matrix
Figure BDA0003820384080000102
Sum matrix
Figure BDA0003820384080000103
Wherein H =1, …, H,
Figure BDA0003820384080000104
are all learnable parameters;
stacking a feedforward sublayer at the output of the encoder block and the decoder block respectively, wherein the position feedforward sublayer has two fully connected networks and middle ReLU activation, and the formula is as follows:
max(0,XW 1 +b 1 )W 2 +b 2 (3)
wherein X is input, W 1 、W 2 Is a learnable parameter, b 1 And b 2 The dimension of an output matrix finally obtained by the feedforward sublayer is consistent with X;
attention calculation operations are performed by convolutional projection instead of the existing position-based linear projection, and queries, keys and value embedding are performed by convolutional projection to enhance the attention to local context information.
In some specific embodiments, as shown in fig. 1, a disk failure prediction method for intelligent operation and maintenance of a large-scale cloud data center includes the following steps: firstly, carrying out information entropy characteristic processing on unbalanced data, and selecting more important characteristics; then dividing the processed unbalanced data, and extracting few types of sample data, namely fault samples; then, data enhancement is carried out on the fault samples by using a time progressive sampling method TPS to generate synthetic data, and more fault sample data are generated through the TPS, so that the ratio of the number of the healthy samples to the number of the fault samples can be well balanced; then, combining the synthetic data with good generation effect with the original data to generate integrated data; and finally, inputting the integrated data into a disk failure prediction model for training, selecting a time window of 7 days to predict whether a failure occurs after 7 days, and carrying out corresponding data marking. The method can fully utilize the fault data to extract the dependency relationship among the time sequence data, can enable the generated data to have the original characteristics of the fault data and the data distribution and hiding modes in the potential space, and has good practicability in a disk fault prediction system with higher requirements on an F1 value and a Matthews Correlation Coefficient (MCC).
The disk failure prediction method is used for performing failure prediction on a disk of a large-scale cloud data center intelligent operation and maintenance, and in the practical application process, the method specifically comprises the following steps:
step 1, filling feature missing values in an original data set, then carrying out data normalization, scaling all values between intervals of [0,1], and then carrying out information entropy feature processing, so that the most relevant and frequently changed features are selected, and finally a feature-processed data set G is obtained.
The missing value filling adopts the following method: if the missing value is 2 or more in succession, using the pattern of the SMART entry on the disk as the padding value; if only one value is missing, the average value before and after the value is used as the fill value. Following the normalization of the data, all values are scaled between [0,1] using the maximum and minimum values in the features, using the following formula:
Figure BDA0003820384080000111
where x is the original value of the feature, x max And x min Respectively, the maximum and minimum values of the feature in the dataset, and x' is the scaled feature value. Next, we perform entropy processing on the features after missing value filling and data normalization, and this method calculates the value of each feature attribute to express the information quantity, and its formula is as follows:
Figure BDA0003820384080000112
where i represents the ith sample, for a total of n samples; where p represents the probability of each value appearing in each SMART attribute. The higher the information entropy H (U) of a feature, the more information it contains, which means that the volatility of the feature properties is more pronounced, so that the most relevant and frequently changing feature properties can be selected.
Step 2, performing data division on the marked unbalanced fault data in the data set G, screening out sample data with few categories, namely labels 1, as a source data set S1, and samples with another category, namely labels 0, as a source data set S2;
and 3, generating and collecting fault data of the source data set S1 (namely the few samples) through TPS by adopting a time progressive sampling TPS method, performing loss calculation on the generated data and the original data, judging whether the loss is smaller than a set threshold value, collecting if the loss is smaller than the set threshold value, and otherwise, repeating the operation in the step 2. The time progressive sampling TPS method is used for carrying out data enhancement on the few types of sample fault data. Before describing TPS, there is an important concept called lead time. For a given failed disk, assuming that the disk failure occurs at timestamp t, the prediction operation occurs at timestamp t-i, then a time period t-i of length i between the occurrence of the prediction action at t, the occurrence of the disk failure at t being denoted as lead period i. During model training, for each failed disk, the TPS will collect progressively more failure data samples during lead period I (i.e. lead period I ranges from 1 to I, where I is the hyper-parameter of the TPS). There are also two important parameters in the TPS method:
the window length is defined as The time window size of The training network input data in each sequence sample, e.g. h is 5, then one training sample will contain SMART attribute information for The disk in The last 5 days. h needs to have a proper value. If too small, less potential information is provided to the ConvTrans-TPS. If it is too large, it corresponds to a long time sequence. Data that is too far from the ultimate failure has little, if any, misleading impact on the prediction of the ultimate failure trend.
The predict _ failure _ days is defined as the number of days before failure, which is an alarm boundary. The value of the predict _ failure _ days also needs to be appropriate. Too long or too short a time interval can affect the effectiveness of disk failure handling. The value of predict _ failure _ days is reasonable within 5-7 days. The method selects a prediction _ failure _ days value of 7 days.
Step 4, repeating the step 2 until the generation of the source data set S1 is finished, and collecting the synthesized data as a synthesized data set T;
and 5, integrating the source data set S1, the source data set S2 and the synthetic data set T to form a final integrated data set Q, and dividing the final integrated data set Q into a training set M and a test set N. The counts of each tag in the selected data are shown in Table 1
Table 1 data set tag statistics table
Figure BDA0003820384080000131
And 6, constructing and training a disk fault prediction model by using the training set M, predicting the test set N, and outputting possible faults in the test set N by using the disk fault prediction model. The disk failure prediction model is modified on the basis of a Transformer model, attention calculation operation of linear projection based on positions on the original Transformer model is abandoned, query, key and value embedding is carried out by utilizing convolution projection to enhance attention to local context information, and meanwhile, the disk failure prediction model comprises an original encoder block and an original decoder block. In the original transform model, long-term and short-term dependencies were captured by using a multi-head self-attentive mechanism, and different attention heads learned, focusing on different aspects of the temporal patterns.
In the self-attention layer, a multi-headed self-attention sublayer (applying the same model at each time step, thus simplifying the formula with some notation) simultaneously converts input data L into H different query matrices
Figure BDA0003820384080000132
Key matrix
Figure BDA0003820384080000133
Sum matrix
Figure BDA0003820384080000134
Wherein H =1, …, H, and
Figure BDA0003820384080000135
are all learnable parameters. After these linear projections, the dot product attention computation vector output sequence is scaled:
Figure BDA0003820384080000136
wherein the mask matrix M is used to avoid future information leakage by setting all upper triangular elements to- ∞ k Is the number of columns of the Q, K matrix, the vector dimension. Then, A 1 ,A 2 ,…A H Will be cascaded and projected linearly again. Stacking a feedforward sublayer at the output, with two fully connected networks and intermediate ReLU activations, the formula is as follows:
max(0,XW 1 +b 1 )W 2 +b 2 (4)
where X is the input, the dimension of the final output matrix from the feedforward sublayer is consistent with X.
However, in the original Transformer model, the similarity between the computation query and the key is computed according to the dot product thereof, which may cause the attention point of the data to be abnormal, the original computation method cannot consider the information of the current data, and is not sensitive to the local context, thereby causing the attention score to be only the expression of the correlation between single time points, and unlike the original purpose of time series prediction, it may confuse whether the observation value of the self-attention module is an abnormal value, a change point or a part of a pattern, and bring about a potential optimization problem.
Thus, a convolutional autofocusing mechanism is used to alleviate this problem, which converts the input (plus the appropriate padding) into a query, key, using a convolutional layer of kernel size k and a stride of 1, rather than using a kernel size of 1 and a stride of 1 (matrix multiplication). Attention calculation operations are performed by convolutional projection instead of the existing location-based linear projection, and queries, key and value embedding are performed by convolutional projection to enhance the attention to local context information.
Step 7, marking all the examples in the test set N by utilizing the output of the disk failure prediction model to obtain a marking result c, wherein when the marking value of the marking result c is 0, the example is represented as non-failure, and when the marking value is 1, the example is represented as failure; wherein, the marking result c is obtained by adopting the formula (5) in the step 7:
Figure BDA0003820384080000141
in the formula (4), the reaction mixture is,
Figure BDA0003820384080000142
for the ith feature of sample k, D is all data sets available for model training, D k Is a subset of D, y j Is the eigenvalue of sample j, a is the parameter, and p is the prior value.
And 8: and outputting the fault according to the marking result c.
The positioning system of the disk fault prediction method for the intelligent operation and maintenance of the large-scale cloud data center comprises the following steps:
the data set characteristic preprocessing module is used for carrying out missing value filling, data normalization and information entropy processing on the original data set so as to obtain the most relevant and frequently changed characteristic attributes;
the data set dividing module is used for screening a source data set S1 formed by a few samples and a source data set S2 formed by a plurality of samples from the unbalanced data set;
the time progressive sampling TPS module is used for performing data enhancement on the source data set S1 so as to generate high-quality synthetic data serving as a synthetic data set T;
the disk failure prediction model is used for training a training set M divided by the final integrated data set Q, predicting the test set N and outputting each instance failure in the test set N;
the marking module is used for marking all the examples in the test set N to obtain a marking result;
and the display module is used for predicting and displaying the fault in the network according to the marking result.
The time progressive sampling TPS module generates composite data T and merges with the source data set S1 and the source data set S2 to form an integrated data set Q. And constructing a disk failure prediction model by adopting the integrated data set Q.
Example 2
In a second aspect, the embodiment provides a disk failure prediction device for intelligent operation and maintenance of a large-scale cloud data center, which includes a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to embodiment 1.
Example 3
In a third aspect, the present embodiment provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of embodiment 1.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiment, but equivalent modifications or changes made by those skilled in the art according to the present disclosure should be included in the scope of the present invention as set forth in the appended claims.

Claims (10)

1. A disk failure prediction method is characterized by comprising the following steps:
step 1, carrying out missing value filling, data normalization and information entropy processing on an unbalanced original data set to obtain a most relevant and frequently changed characteristic attribute, namely a data set G;
step 2, dividing the data set G into a source data set S1 formed by few samples with fault labels and a source data set S2 formed by multiple samples with non-fault labels according to the labels;
step 3, performing data enhancement on the source data set S1 by adopting a time progressive sampling TPS method to generate synthetic data to obtain a synthetic data set T;
step 4, integrating the source data set S1, the source data set S2 and the synthetic data set T to form an integrated data set Q, and dividing the integrated data set Q into a training set M and a test set N;
step 5, training the disk failure prediction model by using the training set M, and testing the trained disk failure prediction model by using the test set N until the model prediction effect meets the requirement, so as to obtain the trained disk failure prediction model;
step 6, inputting SMART data of the disk to be detected into a trained disk failure prediction model;
and 7, determining a disk failure prediction result according to the output of the disk failure prediction model.
2. The disk failure prediction method of claim 1, wherein the missing value padding comprises: if the missing value is 2 or more in succession, using the pattern of the SMART entry on the disk as the padding value; if only one value is missing, the average value before and after the value is used as the fill value.
3. The disk failure prediction method of claim 1, wherein the data normalization comprises:
scaling all values between [0,1] using the maximum and minimum values in the features, the scaling formula is as follows:
Figure FDA0003820384070000011
where x is the original value of the feature, x max And x min Maximum and minimum values of features in the dataset, respectively; and x' is the scaled eigenvalue.
4. The disk failure prediction method of claim 1,
the information entropy processing comprises the following steps: the value of each characteristic attribute is calculated to express the information amount, and the formula thereof is as follows:
Figure FDA0003820384070000021
where i represents the ith sample, for a total of n samples; p represents the probability of each value appearing in each SMART attribute; the higher the information entropy H (U) of a feature, the more information it contains, meaning the more pronounced the volatility of the feature attributes, so that the most relevant and frequently changing feature attributes are selected.
5. The disk failure prediction method according to claim 1, wherein in the step 3, performing data enhancement on the source data set S1 by using a time progressive sampling TPS method includes:
and generating and collecting fault data for the source data set S1 by adopting a time progressive sampling TPS method, performing loss calculation on the generated data and the original data, judging whether the loss is smaller than a set threshold value, if so, collecting, and otherwise, repeating the step operation until a synthetic data set T is obtained.
6. The disk failure prediction method of claim 5, wherein for a given failed disk, assuming that a disk failure occurs at a timestamp t, the prediction operation occurs at a timestamp t-i, a time period t-i of length i between the occurrence of the prediction action at t, and the occurrence of the disk failure at t is denoted as lead period i;
during model training, for each failed disk, the TPS gradually collects more failure data samples within a lead period I, namely the range of the lead period I is 1 to I, wherein I is a hyper-parameter of the TPS;
there are two important parameters in the TPS process:
the window length is defined as The size of The time window for training The network input data in each sequence sample, and The predict _ failure _ days is defined as The number of days before failure.
7. The disk failure prediction method of claim 6, wherein The window length is 5, then a training sample will contain SMART attribute information of The disk in The last 5 days; the value of predict _ failure _ days is within 5-7 days.
8. The disk failure prediction method of claim 1, wherein the disk failure prediction model comprises: an input module, an encoder block, a decoder block, and an output module;
in the input module, the convolution Transformer model is utilized to convert input data L into H different query matrixes by using convolution layers with the kernel size of k and the step length of 1
Figure FDA0003820384070000031
Key matrix
Figure FDA0003820384070000032
Sum matrix
Figure FDA0003820384070000033
Wherein H =1, …, H,
Figure FDA0003820384070000034
Figure FDA0003820384070000035
are all learnable parameters;
stacking a feedforward sublayer at the output of the encoder block and the decoder block, respectively, the position feedforward sublayer having two fully connected networks and a middle ReLU activation, the formula is as follows:
max(0,XW 1 +b 1 )W 2 +b 2 (3)
wherein X is input, W 1 、W 2 Is a learnable parameter, b 1 And b 2 The dimension conversion method is a preset regular term and is used for carrying out spatial dimension conversion in the connection layer, and the dimension of an output matrix finally obtained by the feedforward sublayer is consistent with X.
9. A disk failure prediction device is characterized by comprising a processor and a storage medium;
the storage medium is used for storing instructions;
the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 1 to 8.
10. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, performing the steps of the method of any one of claims 1 to 8.
CN202211039310.5A 2022-08-29 2022-08-29 Intelligent operation and maintenance disk fault prediction method for large-scale cloud data center Pending CN115373879A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211039310.5A CN115373879A (en) 2022-08-29 2022-08-29 Intelligent operation and maintenance disk fault prediction method for large-scale cloud data center

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211039310.5A CN115373879A (en) 2022-08-29 2022-08-29 Intelligent operation and maintenance disk fault prediction method for large-scale cloud data center

Publications (1)

Publication Number Publication Date
CN115373879A true CN115373879A (en) 2022-11-22

Family

ID=84069789

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211039310.5A Pending CN115373879A (en) 2022-08-29 2022-08-29 Intelligent operation and maintenance disk fault prediction method for large-scale cloud data center

Country Status (1)

Country Link
CN (1) CN115373879A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116777046A (en) * 2023-05-11 2023-09-19 中国科学院自动化研究所 Traffic pre-training model construction and traffic prediction method and device and electronic equipment
CN116956197A (en) * 2023-09-14 2023-10-27 山东理工昊明新能源有限公司 Deep learning-based energy facility fault prediction method and device and electronic equipment

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116777046A (en) * 2023-05-11 2023-09-19 中国科学院自动化研究所 Traffic pre-training model construction and traffic prediction method and device and electronic equipment
CN116956197A (en) * 2023-09-14 2023-10-27 山东理工昊明新能源有限公司 Deep learning-based energy facility fault prediction method and device and electronic equipment
CN116956197B (en) * 2023-09-14 2024-01-19 山东理工昊明新能源有限公司 Deep learning-based energy facility fault prediction method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN105893256B (en) software fault positioning method based on machine learning algorithm
CN115373879A (en) Intelligent operation and maintenance disk fault prediction method for large-scale cloud data center
CN111914873A (en) Two-stage cloud server unsupervised anomaly prediction method
CN109408389A (en) A kind of aacode defect detection method and device based on deep learning
CN106570513A (en) Fault diagnosis method and apparatus for big data network system
CN111427775B (en) Method level defect positioning method based on Bert model
CN109242149A (en) A kind of student performance early warning method and system excavated based on educational data
CN115577114A (en) Event detection method and device based on time sequence knowledge graph
CN113312447A (en) Semi-supervised log anomaly detection method based on probability label estimation
CN110851654A (en) Industrial equipment fault detection and classification method based on tensor data dimension reduction
CN115617554A (en) System fault prediction method, device, equipment and medium based on time perception
CN115309575A (en) Micro-service fault diagnosis method, device and equipment based on graph convolution neural network
CN114266289A (en) Complex equipment health state assessment method
Du et al. Convolutional neural network-based data anomaly detection considering class imbalance with limited data
CN115705501A (en) Hyper-parametric spatial optimization of machine learning data processing pipeline
Sudharson et al. Improved EM algorithm in software reliability growth models
Dhurandhar et al. Enhancing simple models by exploiting what they already know
CN117495421A (en) Power grid communication engineering cost prediction method based on power communication network construction
Cohen et al. To trust or not: Towards efficient uncertainty quantification for stochastic shapley explanations
CN116894113A (en) Data security classification method and data security management system based on deep learning
CN115470854A (en) Information system fault classification method and classification system
CN116883709A (en) Carbonate fracture-cavity identification method and system based on channel attention mechanism
JP2022082525A (en) Method and apparatus for providing information based on machine learning
CN111221704B (en) Method and system for determining running state of office management application system
Bonabi Mobaraki et al. A demonstration of interpretability methods for graph neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination