CN117951529B - Sample acquisition method, device and equipment for hard disk data fault prediction - Google Patents
Sample acquisition method, device and equipment for hard disk data fault prediction Download PDFInfo
- Publication number
- CN117951529B CN117951529B CN202410347260.XA CN202410347260A CN117951529B CN 117951529 B CN117951529 B CN 117951529B CN 202410347260 A CN202410347260 A CN 202410347260A CN 117951529 B CN117951529 B CN 117951529B
- Authority
- CN
- China
- Prior art keywords
- sample
- virtual
- label
- mixing factor
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 65
- 238000012549 training Methods 0.000 claims abstract description 179
- 238000002156 mixing Methods 0.000 claims description 160
- 230000015654 memory Effects 0.000 claims description 31
- 238000003860 storage Methods 0.000 claims description 18
- 238000010276 construction Methods 0.000 claims description 5
- 238000010801 machine learning Methods 0.000 abstract description 22
- 238000009826 distribution Methods 0.000 description 10
- 238000003745 diagnosis Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000004458 analytical method Methods 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 6
- 238000013135 deep learning Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 6
- 238000001514 detection method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000007774 longterm Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000011161 development Methods 0.000 description 4
- 230000036541 health Effects 0.000 description 4
- 238000004140 cleaning Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 125000004122 cyclic group Chemical group 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 239000007787 solid Substances 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 238000012827 research and development Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3037—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Quality & Reliability (AREA)
- Computer Hardware Design (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The disclosure relates to the field of machine learning, in particular to a sample acquisition method, a device and equipment for hard disk data fault prediction, wherein the method comprises the following steps: acquiring a training sample of hard disk data and a sample label corresponding to the training sample; constructing a first fault model for generating a virtual sample according to the training sample; constructing a second fault model for generating a virtual label corresponding to the virtual sample according to the sample label; determining the virtual sample and the virtual label according to the training sample, the sample label, the first fault model and the second fault model; and obtaining a target training sample for hard disk data fault prediction and a target sample label corresponding to the target training sample according to the virtual sample and the virtual label. The method and the device can be used for solving the common data unbalance problem in the field of hard disk faults by combining machine learning and hybrid learning, so that the attention degree of a follow-up model to few samples is improved, and the overall prediction recognition capability of the model is improved.
Description
Technical Field
The disclosure relates to the field of machine learning, and in particular relates to a sample acquisition method, device and equipment for hard disk data failure prediction.
Background
Hard disk failures can seriously jeopardize data security and system operating efficiency. Among existing hard disk fault diagnosis methods, a hard disk self-checking program (also referred to as a self-monitoring analysis report, i.e., s.m. a.r.t.) is one of the most common methods for detecting the health condition of a hard disk.
With the advent of the big data age, the development of applications such as cloud computing and big data analysis has promoted the prosperous development of the storage industry. Correspondingly, hard disk data information required to be processed by the data center also presents a tendency of blowout. Considering that the s.m.a.r.t. dataset contains a variety of evaluation metrics, the analysis of these metrics mostly relies on the experience of researchers, or manually set thresholds. The analysis process is cumbersome and complex, and cannot effectively solve a variety of complex fault problems. In this context, hard disk fault detection by using machine learning technology has become an important research idea. However, currently existing fault prediction methods based on machine learning generally assume that the proportion distribution of data in different categories is consistent. For hard disk failure detection applications, however, the hard disk failure rate is typically low and the hard disk typically requires long-term operation to fail. Therefore, the data distribution of the healthy and faulty labels in the hard disk data is very unbalanced, and the conventional convolutional neural network, long and short memory network and other methods are difficult to learn the difference characteristics of different types of data under the condition of the very unbalanced data distribution, so that the different types of data are often subjected to over fitting.
Therefore, the existing hard disk fault prediction based on machine learning still has a certain application limitation, so that the fault prediction performance of a machine learning model is limited, and the fault prediction data of the hard disk data is inaccurate.
Disclosure of Invention
In view of the above, the present disclosure provides a method, an apparatus, and a device for obtaining samples for hard disk data failure prediction, so as to solve the problem that the existing hard disk failure prediction based on machine learning still has a certain application limitation, limits the failure prediction performance of a machine learning model, and causes inaccurate failure prediction data of hard disk data.
In a first aspect, the present disclosure provides a sample acquisition method for hard disk data failure prediction, the method comprising:
Acquiring a training sample of hard disk data and a sample label corresponding to the training sample;
Constructing a first fault model for generating a virtual sample according to the training sample; constructing a second fault model for generating a virtual label corresponding to the virtual sample according to the sample label;
Determining a virtual sample and a virtual label according to the training sample, the sample label, the first fault model and the second fault model;
And obtaining a target training sample for hard disk data fault prediction and a target sample label corresponding to the target training sample according to the virtual sample and the virtual label.
In the embodiment of the disclosure, a first fault model for generating a virtual sample is constructed based on the existing training sample of hard disk data, a second fault model for generating a virtual label is constructed based on the sample label corresponding to the training sample, then the virtual sample and the virtual label are determined based on the training sample, the sample label, the first fault model and the second fault model, the virtual sample is added to the training sample to obtain a target training sample for hard disk data fault prediction, the virtual label is added to the sample label to obtain the target sample label, and the target training sample and the target sample label are put into subsequent fault prediction, so that the common data unbalance problem in the hard disk fault field can be processed, the attention degree of the subsequent model to few types of samples is improved, the overall prediction recognition capability of the model is improved, the problem that the existing hard disk fault prediction based on machine learning still has a certain application limitation, the fault prediction performance of the machine learning model is limited, and the fault prediction data of the hard disk data is inaccurate is solved.
In an alternative embodiment, constructing a first fault model for generating a virtual sample from the training sample includes:
acquiring a sample mixing factor and a preset number of training samples;
And constructing a first fault model according to the sample mixing factor and a preset number of training samples.
In an optional embodiment, constructing, according to the sample label, a second fault model for generating a virtual label corresponding to the virtual sample, includes:
acquiring a preset number of sample tags;
determining a label mixing factor according to the sample mixing factor and a preset number of training samples;
and constructing a second fault model according to the label mixing factor and the preset number of sample labels.
In an alternative embodiment, determining the tag mixing factor from the sample mixing factor, a predetermined number of training samples, includes:
acquiring the sample size of each training sample;
And determining the label mixing factor according to the sample size, the preset decision boundary and the sample mixing factor.
In the embodiment of the disclosure, the label mixing factor is obtained according to the proportion (i.e. the sample amount) of training samples of different categories, a preset decision boundary and a sample mixing factor, and the label weight of the virtual sample is determined by carrying out numerical assignment on the label mixing factor, so that a higher weight can be given to a few samples, and the model is forced to give higher attention to the few samples.
In an alternative embodiment, in the case that the number of training samples is two, determining the label mixing factor according to the sample size, the preset decision boundary, and the sample mixing factor includes:
Obtaining a quotient of a first sample size of a first training sample and a second sample size of a second training sample to obtain a target value;
obtaining a comparison result between the target value and a preset decision boundary;
and determining the label mixing factor according to the comparison result and the sample mixing factor.
In an alternative embodiment, determining the tag mixing factor based on the comparison and the sample mixing factor includes:
Acquiring a correlation factor obtained by the sample mixing factor under the condition that the comparison result meets the first condition;
and taking the maximum value between the sample mixing factor and the correlation factor as a label mixing factor.
In an alternative embodiment, determining the tag mixing factor based on the comparison and the sample mixing factor includes:
Acquiring a correlation factor obtained by the sample mixing factor under the condition that the comparison result meets a second condition;
The minimum value between the sample mixing factor and the correlation factor is taken as the label mixing factor.
In an alternative embodiment, determining the tag mixing factor based on the comparison and the sample mixing factor includes:
And taking the sample mixing factor as the label mixing factor when the comparison result meets the third condition.
In an alternative embodiment, obtaining a target training sample for hard disk data failure prediction and a target sample label corresponding to the target training sample according to the virtual sample and the virtual label includes:
Adding the virtual sample into the training sample to obtain a target training sample;
and adding the virtual tag into the sample tag to obtain the target sample tag.
In an alternative embodiment, determining the virtual sample and the virtual tag from the training sample, the sample tag, the first fault model, and the second fault model includes:
and determining the virtual sample and the virtual label according to the training sample, the sample label, the sample mixing factor, the label mixing factor, the first fault model and the second fault model.
In the embodiment of the disclosure, the proportion of different types of samples is calculated first, a new sample (i.e., a virtual sample) is generated in a linear interpolation mode, a new fuzzy label (i.e., a virtual label) is generated in a nonlinear mode, and a higher weight is given to minority samples when the new sample label is calculated, so that a subsequent fault prediction model is forced to give higher attention to minority samples. Because the samples and labels at this time have been subjected to blurring processing, the risk of reduced generalization of the failure prediction model is subsequently reduced, and newly generated data and labels can be utilized to minimize the empirical risk of failure prediction model training.
In an alternative embodiment, after obtaining the target training samples for hard disk data failure prediction and the target sample labels corresponding to the target training samples according to the virtual samples and the virtual labels, the method further includes:
Inputting the target training sample into an initial fault prediction model to obtain a classification result;
and adjusting model parameters of the initial fault prediction model according to the target sample label and the classification result to obtain a target fault prediction model for hard disk data fault prediction.
In the embodiment of the disclosure, the ideas of mixed learning and deep learning of adding virtual samples and virtual labels are fused, and the degree of attention of a model to a few types of samples is improved by carrying out nonlinear weighting on different types of data in the mixed learning, so that the overall prediction recognition capability of the model is improved, and in the deep learning LSTM, the LSTM is a variant of a deep neural network comprising feature extraction and classification and can be used for processing the cyclic neural network of the sequence data. Combining mixed learning and deep learning LSTM enables better capture of long-term dependencies in the event of handling hard disk data imbalance problems. The strategy can effectively solve the problem of data unbalance in hard disk fault diagnosis, and further enhance the characteristic learning capacity of the machine learning model.
In an alternative embodiment, after adjusting model parameters of the initial failure prediction model according to the target sample label and the classification result to obtain a target failure prediction model for hard disk data failure prediction, the method further includes:
Obtaining target hard disk data to be subjected to fault prediction;
and inputting the target hard disk data into a target fault prediction model to obtain a fault prediction result.
Alternatively, after the target failure prediction model is acquired, it may be applied to a failure prediction scenario of target hard disk data to be failure predicted.
In the embodiment of the disclosure, the target fault prediction model is a model obtained after the feature is continuously trained by utilizing the hybrid learning module and the long-short-term memory neural network, and the output prediction result can be used for judging whether the hard disk will fail in a period of time in the future or not, and corresponding measures are correspondingly taken to protect the data safety and the service continuity.
In an alternative embodiment, the first fault model is a first expression, and the formula of the first expression is: wherein/> Is a virtual sample,/>Is a sample mixing factor,/>For training samples under the first category,/>Training samples under the second category; the second fault model is a second expression, and the formula of the second expression is: /(I)Wherein/>For the virtual label corresponding to the virtual sample,/>Is a label mix factor,/>For training sample tags under the first category,/>Is a training sample label under the second category.
In a second aspect, the present disclosure provides a sample acquisition apparatus for hard disk data failure prediction, the apparatus comprising:
the first acquisition module is used for acquiring training samples of hard disk data and sample labels corresponding to the training samples;
The building module is used for building a first fault model for generating a virtual sample according to the training sample; constructing a second fault model for generating a virtual label corresponding to the virtual sample according to the sample label;
the determining module is used for determining a virtual sample and a virtual label according to the training sample, the sample label, the first fault model and the second fault model;
the first obtaining module is used for obtaining a target training sample for hard disk data fault prediction and a target sample label corresponding to the target training sample according to the virtual sample and the virtual label.
In a third aspect, the present disclosure provides a computer device comprising: the memory and the processor are in communication connection, computer instructions are stored in the memory, and the processor executes the computer instructions, so that the sample acquisition method for hard disk data failure prediction in the first aspect or any one of the corresponding embodiments is executed.
In a fourth aspect, the present disclosure provides a computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the sample acquisition method for hard disk data failure prediction of the first aspect or any one of its corresponding embodiments.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the related art, the drawings that are required to be used in the description of the embodiments or the related art will be briefly described below, and it is apparent that the drawings in the following description are some embodiments of the present disclosure, and other drawings may be obtained according to the drawings without inventive effort to those of ordinary skill in the art.
FIG. 1 is a flow diagram of a sample acquisition method for hard disk data failure prediction according to some embodiments of the present disclosure;
FIG. 2 is a schematic diagram for hard disk data failure prediction according to some embodiments of the present disclosure;
FIG. 3 is a block diagram of a sample acquisition device for hard disk data failure prediction according to some embodiments of the present disclosure;
Fig. 4 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present disclosure.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. Based on the embodiments in this disclosure, all other embodiments that a person skilled in the art would obtain without making any inventive effort are within the scope of protection of this disclosure.
Hard disk failures can seriously jeopardize data security and system operating efficiency. Among existing hard disk fault diagnosis methods, a hard disk self-checking program (also referred to as a self-monitoring analysis report, i.e., s.m. a.r.t.) is one of the most common methods for detecting the health condition of a hard disk. The s.m.a.r.t. Data set contains critical information that provides the hard disk operation, such as temperature, sector information, I/O errors, etc. Researchers can judge and analyze the health condition of the hard disk and future health trend according to various indexes in the S.M.A.R.T. data set. By monitoring the change condition of the S.M.A.R.T. data, the hard disk fault diagnosis software can automatically identify possible problems and extract relevant information so as to carry out higher-level maintenance measures such as problem repair or hard disk replacement. In general, a hard disk failure diagnosis method based on s.m. a.r.t. data is an effective, low-cost and long-term reliable hard disk failure diagnosis method.
With the advent of the big data age, the development of applications such as cloud computing and big data analysis has promoted the prosperous development of the storage industry. Correspondingly, hard disk data information required to be processed by the data center also presents a tendency of blowout. Considering that the s.m.a.r.t. dataset contains a variety of evaluation metrics, the analysis of these metrics mostly relies on the experience of researchers, or manually set thresholds. The analysis process is cumbersome and complex, and cannot effectively solve a variety of complex fault problems. In this context, hard disk fault detection by using machine learning technology has become an important research idea. The main idea of machine learning is to extract valuable information and knowledge from large-scale data and use this information to solve various practical problems. In the aspect of hard disk fault detection, the machine learning technology is utilized to process and analyze large-scale S.M.A.R.T. data, so that complex fault types can be identified, high-quality prediction suggestions can be provided, the reliability and stability of a storage medium are improved, and meanwhile, valuable data analysis and knowledge discovery can be provided for research and development of related fields.
Currently existing machine learning-based fault prediction methods generally assume that the data proportion distribution of different categories is consistent. For hard disk failure detection applications, however, the hard disk failure rate is typically low and the hard disk typically requires long-term operation to fail. Therefore, the data distribution of the "healthy" and "failed" tags in the hard disk data is extremely unbalanced. The existing convolutional neural network, long and short memory network and other methods are difficult to learn the difference characteristics of different types of data under the condition of extremely unbalanced data distribution, and are usually subjected to over fitting. Therefore, the existing hard disk fault prediction based on machine learning still has a certain application limitation, and the fault prediction performance of a machine learning model is limited.
In order to solve the above-described problems, according to an embodiment of the present disclosure, there is provided a sample acquisition method embodiment for hard disk data failure prediction, it is to be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order other than that herein.
In this embodiment, a sample acquiring method for hard disk data failure prediction is provided, fig. 1 is a flowchart of a sample acquiring method for hard disk data failure prediction according to an embodiment of the disclosure, and the method may be applied to a server side, as shown in fig. 1, and the method flow includes the following steps:
Step S101, obtaining a training sample of hard disk data and a sample label corresponding to the training sample;
step S102, constructing a first fault model for generating a virtual sample according to a training sample; constructing a second fault model for generating a virtual label corresponding to the virtual sample according to the sample label;
Step S103, determining a virtual sample and a virtual label according to the training sample, the sample label, the first fault model and the second fault model;
step S104, obtaining a target training sample for hard disk data fault prediction and a target sample label corresponding to the target training sample according to the virtual sample and the virtual label.
Optionally, in an embodiment of the present disclosure, referring to a current linear interpolation method, a sample acquisition method for hard disk data failure prediction is proposed. The linear interpolation method is to construct a new data sample by a sample interpolation mode, and replace the original data set with an imaginary sample and a corresponding imaginary label. Because the fictitious sample is within the neighborhood of the real label, this method is also called neighborhood risk minimization, whose mathematical expression is as in equation (1):
(1)
Wherein, For imaginary samples generated in neighborhood risk minimization method,/>For imaginary labels generated in neighborhood risk minimization method,/>And/>For training samples under two categories of input (e.g. category 0 and category 1)/>And/>For sample labels under two categories of input,/>Is the regulation parameter/>And/>The numerical distribution of the mixing factors between them being satisfied by the beta distribution, i.eE () is a mathematical expectation,/>() For mapping functions,/>。
Based on the concept, the embodiment of the disclosure needs to acquire a training sample of existing hard disk data and a sample label corresponding to the training sample, then construct a first fault model for generating a virtual sample based on the training sample, and construct a second fault model for generating a virtual label corresponding to the virtual sample according to the sample label.
Combining the training sample with the first fault model, combining the sample label with the second fault model to respectively obtain a virtual sample and a virtual label of the virtual sample, adding the virtual sample into the existing training sample to obtain a target training sample for hard disk data fault prediction, and adding the virtual label into the existing sample label to obtain a target sample label corresponding to the target training sample.
Thus, the target training samples and the target sample labels are used as sample data for performing fault prediction on the hard disk data.
In the embodiment of the disclosure, a first fault model for generating a virtual sample is constructed based on the existing training sample of hard disk data, a second fault model for generating a virtual label is constructed based on the sample label corresponding to the training sample, then the virtual sample and the virtual label are determined based on the training sample, the sample label, the first fault model and the second fault model, the virtual sample is added to the training sample to obtain a target training sample for hard disk data fault prediction, the virtual label is added to the sample label to obtain the target sample label, and the target training sample and the target sample label are put into subsequent fault prediction, so that the common data unbalance problem in the hard disk fault field can be processed, the attention degree of the subsequent model to few types of samples is improved, the overall prediction recognition capability of the model is improved, the problem that the existing hard disk fault prediction based on machine learning still has a certain application limitation, the fault prediction performance of the machine learning model is limited, and the fault prediction data of the hard disk data is inaccurate is solved.
In some alternative embodiments, constructing a first fault model that generates virtual samples from training samples includes:
acquiring a sample mixing factor and a preset number of training samples;
And constructing a first fault model according to the sample mixing factor and a preset number of training samples.
Optionally, in an embodiment of the present disclosure, given an input training sample and its sample label,/>) And (/ >),/>) Wherein/>,/>Representing training samples under two categories (e.g., category 0, category 1)/>And/>Representing two types of sample tags, such as a type 0 tag and a type 1 tag, the preset number is 2. It should be noted that, in the embodiment of the present disclosure, the preset number is preferably 2, and may also be 3, 4, etc., but for the numerical values 3, 4, it may also be split into training samples and sample labels in pairs, so the embodiment of the present disclosure is illustrated with the preset number being 2.
Obtaining sample mixing factors with numerical distribution meeting beta distribution,/>,/>Such as select/>. Then, according to the sample mixing factor and a preset number of training samples, a first fault model is constructed, wherein the first fault model can be a first expression, such as a formula (2):
(2)
in some optional embodiments, constructing, according to the sample label, a second fault model for generating a virtual label corresponding to the virtual sample includes:
acquiring a preset number of sample tags;
determining a label mixing factor according to the sample mixing factor and a preset number of training samples;
and constructing a second fault model according to the label mixing factor and the preset number of sample labels.
Alternatively, since the sample tags and training samples are presented in pairs, e.g. (-),/>) And (/ >),/>) A preset number (i.e., 2) of sample tags are obtained: /(I)And/>。
Sample mixing factor in the above embodimentCan be preset, but in the balance mixing training method, the label mixing factor/>An additional decision boundary is introduced in the calculation process of (1), and the label mixing factor/>, by judging the sample quantity of each type of sample in the training samplesAnd performing weighted calculation to obtain a final label mixing factor.
Constructing a second fault model according to the label mixing factor and the preset number of sample labels, wherein the second fault model can be a second expression, such as a formula (3):
(3)
In some alternative embodiments, determining the tag mixing factor from the sample mixing factor, a predetermined number of training samples, comprises:
acquiring the sample size of each training sample;
And determining the label mixing factor according to the sample size, the preset decision boundary and the sample mixing factor.
Optionally, for mixed training, the sample mixing factor is identical to the tag mixing factor, i.e. However, in the proposed balanced hybrid training method, because the sample blending factor is not equal to the tag blending factor,/>In the disclosed embodiment, the sample size of each training sample is acquired, a preset decision boundary k (k determines the key parameters of the new sample label for generating the virtual sample) is acquired, and then the label mixing factor is determined according to the sample size, the preset decision boundary and the sample mixing factor.
In the embodiment of the disclosure, the label mixing factor is obtained according to the proportion (i.e. the sample amount) of training samples of different categories, a preset decision boundary and a sample mixing factor, and the label weight of the virtual sample is determined by carrying out numerical assignment on the label mixing factor, so that a higher weight can be given to a few samples, and the model is forced to give higher attention to the few samples.
In some alternative embodiments, where the number of training samples is two, determining the tag mixing factor based on the sample size, the preset decision boundary, and the sample mixing factor includes:
Obtaining a quotient of a first sample size of a first training sample and a second sample size of a second training sample to obtain a target value;
obtaining a comparison result between the target value and a preset decision boundary;
and determining the label mixing factor according to the comparison result and the sample mixing factor.
Optionally, the essence of the disclosed embodiments is to achieve equalization of hard disk data, so that a virtual sample is obtainedThe higher specific gravity tag mix factor was then assigned to the minority class samples. At this time, a first sample size/>, of the first training sample is obtainedSecond sample size/>, with second training sampleFind/>And/>The quotient is obtained as the target value/>;
And obtaining a comparison result between the target value and a preset decision boundary, and then determining a label mixing factor according to the comparison result and the sample mixing factor.
In some alternative embodiments, determining the tag mixing factor based on the comparison and the sample mixing factor comprises:
Acquiring a correlation factor obtained by the sample mixing factor under the condition that the comparison result meets the first condition;
and taking the maximum value between the sample mixing factor and the correlation factor as a label mixing factor.
Optionally, the label is mixed with the factor according to the sample size, the preset decision boundary and the sample mixing factorWhen the weighting calculation is performed, the definition is as formula (4):
Wherein " The target value is smaller than or equal to a preset decision boundary and is called a comparison result, wherein the comparison result meets the first condition in the formula (4) and is obtained according to the sample mixing factor/>Obtain the correlation factor/>According toSelecting the maximum value between the sample mixing factor and the correlation factor as the label mixing factor/>。
For example,Belonging to class 0, total 20 samples, and/>Belonging to class 1, there are only 1000 samples, assuming a preset decision boundary k=0.5, at/>In the case of (2), 20/1000=0.02.ltoreq.k, so/>The hybrid training assigns the tags to 80% class 0 and 20% class 1.
In some alternative embodiments, determining the tag mixing factor based on the comparison and the sample mixing factor comprises:
Acquiring a correlation factor obtained by the sample mixing factor under the condition that the comparison result meets a second condition;
The minimum value between the sample mixing factor and the correlation factor is taken as the label mixing factor.
Optionally, the label is mixed with the factor according to the sample size, the preset decision boundary and the sample mixing factorWhen the weighting calculation is performed, it is defined as the above formula (4):
Wherein " "That is, the target value is equal to or greater than the inverse of the preset decision boundary, is called the comparison result, and the comparison result satisfies the second condition in the formula (4), and is based on the sample mixing factor/>Obtain the correlation factor/>According to/>Selecting the minimum value between the sample mixing factor and the correlation factor as the label mixing factor/>。
For example,Belonging to class 0, total 1000 samples, and/>Belonging to class 1, there are only 20 samples, assuming a preset decision boundary k=0.5, at/>In the case of (1)/(2), 1000/20=50.gtoreq.1/k, so/>The hybrid training assigns the tags to 20% class 0 and 80% class 1.
In some alternative embodiments, determining the tag mixing factor based on the comparison and the sample mixing factor comprises:
And taking the sample mixing factor as the label mixing factor when the comparison result meets the third condition.
Optionally, the label is mixed with the factor according to the sample size, the preset decision boundary and the sample mixing factorWhen the weighting calculation is performed, it is defined as the above formula (4):
At the target value When the comparison result with the preset decision boundary k does not meet the first condition and the second condition, the comparison result is considered to meet the third condition, namely the 'other' condition, and the/>And (3) obtaining the product.
In some alternative embodiments, determining the virtual samples and virtual tags from the training samples, the sample tags, the first fault model, and the second fault model includes:
and determining the virtual sample and the virtual label according to the training sample, the sample label, the sample mixing factor, the label mixing factor, the first fault model and the second fault model.
Alternatively, as can be seen from the above embodiments, after determining the sample mixing factor and the label mixing factor, the sample mixing factor and the label mixing factor are substituted into the first fault model and the second fault model, and a preset number of training samples and sample labels are substituted into the first fault model and the second fault model, so that the virtual sample can be determined based on the formula (2) and the formula (3)Virtual tags。
In the embodiment of the disclosure, the proportion of different types of samples is calculated first, a new sample (i.e., a virtual sample) is generated in a linear interpolation mode, a new fuzzy label (i.e., a virtual label) is generated in a nonlinear mode, and a higher weight is given to minority samples when the new sample label is calculated, so that a subsequent fault prediction model is forced to give higher attention to minority samples. Because the samples and labels at this time have been subjected to blurring processing, the risk of reduced generalization of the failure prediction model is subsequently reduced, and newly generated data and labels can be utilized to minimize the empirical risk of failure prediction model training.
In some optional embodiments, after obtaining the target training samples for hard disk data failure prediction and the target sample labels corresponding to the target training samples according to the virtual samples and the virtual labels, the method further includes:
Inputting the target training sample into an initial fault prediction model to obtain a classification result;
and adjusting model parameters of the initial fault prediction model according to the target sample label and the classification result to obtain a target fault prediction model for hard disk data fault prediction.
Alternatively, willAfter adding to the existing training sample, a target training sample is obtained, and/>And adding the target sample label to the existing sample label to obtain the target sample label. And then inputting the target training sample into the initial fault prediction model to obtain a classification result. Because the target training sample carries the target sample label, the obtained classification result is compared with the target sample label, and then model parameters of the initial fault prediction model are adjusted through loss calculation until the classification result is consistent with the target sample label, so that the trained target fault prediction model is obtained.
It will be appreciated that the target fault prediction model herein is a final model obtained by adding a dummy sample with a new virtual tag to an existing training sample and then performing continuous training, where the model may be a Long Short-Term Memory (LSTM), where the LSTM is a cyclic neural network including feature extraction and classification, and controls updating of the Memory unit through an input gate, a forgetting gate, an output gate, and a Memory unit. LSTM can effectively learn long-short-term dependencies within a time series and can be effectively applied to hard disk failure diagnosis applications based on time series signals. The strategy provided by the invention can effectively solve the problem of data unbalance in hard disk fault diagnosis, and further enhance the feature learning capacity and the prediction accuracy of the machine learning model.
Further, S.M.A.R.T. data generated in hard disk operation is collected, and the data is labeled to form an input data set. Wherein, the hard disk that fails within seven days is labeled as "about to fail", while the hard disk that does not fail within seven days is labeled as "healthy running". And secondly, cleaning the data, filling missing parts in the data, or deleting and covering the data which are partially missing. Subsequently, as shown in FIG. 2, the hybrid learning module includes training samplesTraining samples/>Sample tag/>Sample tag/>For training sample/>And training samples/>Performing mixing treatment to obtain a new mixed sample, and labeling the sample/>And sample tag/>And (3) performing mixing treatment to obtain a new mixed label, and inputting the new mixed sample and the new mixed label into the LSTM for feature training. Through iterative training of the LSTM model, the learning ability and recognition result accuracy of the model are continuously enhanced. And finally, outputting a prediction result, judging whether the hard disk fails in a period of time in the future, and taking corresponding measures to protect the data safety and service continuity.
The specific implementation process is as follows:
(1) Collecting key information such as S.M.A.R.T data and hard disk performance data, and ensuring that a data set has comprehensiveness and reliability;
(2) The method comprises the steps of aggregating various data sources into a comprehensive data set, comprehensively cleaning the data set, and processing the data quality problems such as missing values, abnormal values, repeated values and the like;
(3) Constructing an LSTM depth model (namely an initial fault prediction model), determining a network structure, the number of layers and the number of neurons, and selecting a proper activation function, an optimization algorithm and a loss function to enhance the expressive power of the model;
(4) The data after aggregation, cleaning and selection is input into a constructed LSTM model, the data set is iterated, and model training is carried out in a back propagation mode;
(5) And after the lower error is reached, indicating that model training is completed, and obtaining the target fault prediction model.
When selecting features, the s.m.a.r.t. features may select key features such as read errors, address errors, on-time, etc. The hard disk performance data includes a hard disk level performance index and a server level performance index. The hard disk level performance indicators include IOQueue size, throughput, latency, average latency of I/O operations, etc. Server level performance metrics include CPU activity, page in and out activity, and the like.
The hard disk performance data is utilized to extract the premonitory signals of hard disk faults, and support can be provided for early hard disk fault prediction. Specifically, the performance of the hard disk under different loads can be measured through the hard disk performance data, including the hard disk read-write speed, response time, data access frequency and the like, which is helpful for discovering possible hard disk faults in advance, and further enhances the prediction accuracy.
In the embodiment of the disclosure, the ideas of mixed learning and deep learning of adding virtual samples and virtual labels are fused, and the degree of attention of a model to a few types of samples is improved by carrying out nonlinear weighting on different types of data in the mixed learning, so that the overall prediction recognition capability of the model is improved, and in the deep learning LSTM, the LSTM is a variant of a deep neural network comprising feature extraction and classification and can be used for processing the cyclic neural network of the sequence data. Combining mixed learning and deep learning LSTM enables better capture of long-term dependencies in the event of handling hard disk data imbalance problems. The strategy can effectively solve the problem of data unbalance in hard disk fault diagnosis, and further enhance the characteristic learning capacity of the machine learning model.
In some alternative embodiments, after adjusting model parameters of the initial failure prediction model according to the target sample label and the classification result to obtain a target failure prediction model for hard disk data failure prediction, the method further comprises:
Obtaining target hard disk data to be subjected to fault prediction;
and inputting the target hard disk data into a target fault prediction model to obtain a fault prediction result.
Alternatively, after the target failure prediction model is acquired, it may be applied to a failure prediction scenario of target hard disk data to be failure predicted.
Specifically, (1) collecting target hard disk data to be subjected to fault prediction;
(2) And inputting the target hard disk data into a target fault prediction model, and outputting a fault prediction result of the target hard disk data.
In the embodiment of the disclosure, the target fault prediction model is a model obtained after the feature is continuously trained by utilizing the hybrid learning module and the long-short-term memory neural network, and the output prediction result can be used for judging whether the hard disk will fail in a period of time in the future or not, and corresponding measures are correspondingly taken to protect the data safety and the service continuity.
The embodiment also provides a sample acquiring device for hard disk data failure prediction, which is used for implementing the above embodiment and the preferred implementation, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
The present embodiment provides a sample acquiring apparatus for hard disk data failure prediction, as shown in fig. 3, including:
the first obtaining module 301 is configured to obtain a training sample of hard disk data and a sample label corresponding to the training sample;
a building module 302, configured to build a first fault model that generates a virtual sample according to the training sample; constructing a second fault model for generating a virtual label corresponding to the virtual sample according to the sample label;
A determining module 303, configured to determine a virtual sample and a virtual tag according to the training sample, the sample tag, the first fault model, and the second fault model;
The first obtaining module 304 is configured to obtain, according to the virtual sample and the virtual tag, a target training sample for hard disk data failure prediction and a target sample tag corresponding to the target training sample.
In some alternative embodiments, the build module 302 includes:
the first acquisition submodule is used for acquiring a sample mixing factor and a preset number of training samples;
The first construction submodule is used for constructing a first fault model according to the sample mixing factor and a preset number of training samples.
In some alternative embodiments, the build module 302 includes:
The second acquisition sub-module is used for acquiring a preset number of sample tags;
The first determining submodule is used for determining the label mixing factor according to the sample mixing factor and a preset number of training samples;
and the second construction submodule is used for constructing a second fault model according to the label mixing factor and the preset number of sample labels.
In some alternative embodiments, the second building sub-module comprises:
The acquisition unit is used for acquiring a preset number of sample labels;
The determining unit is used for determining the label mixing factor according to the sample mixing factor and a preset number of training samples;
the construction unit is used for constructing a second fault model according to the label mixing factor and the preset number of sample labels.
In some alternative embodiments, the determining unit comprises:
An acquisition subunit, configured to acquire a sample size of each training sample;
And the determining subunit is used for determining the label mixing factor according to the sample size, the preset decision boundary and the sample mixing factor.
In some alternative embodiments, in case the number of training samples is two, the determining subunit is specifically configured to:
Obtaining a quotient of a first sample size of a first training sample and a second sample size of a second training sample to obtain a target value;
obtaining a comparison result between the target value and a preset decision boundary;
and determining the label mixing factor according to the comparison result and the sample mixing factor.
In some alternative embodiments, the determining subunit is specifically configured to:
Acquiring a correlation factor obtained by the sample mixing factor under the condition that the comparison result meets the first condition;
and taking the maximum value between the sample mixing factor and the correlation factor as a label mixing factor.
In some alternative embodiments, the determining subunit is specifically configured to:
Acquiring a correlation factor obtained by the sample mixing factor under the condition that the comparison result meets a second condition;
The minimum value between the sample mixing factor and the correlation factor is taken as the label mixing factor.
In some alternative embodiments, the determining subunit is specifically configured to:
And taking the sample mixing factor as the label mixing factor when the comparison result meets the third condition.
In some alternative embodiments, the first obtaining module 304 includes:
The first adding sub-module is used for adding the virtual sample into the training sample to obtain a target training sample;
And the second adding sub-module is used for adding the virtual tag into the sample tag to obtain the target sample tag.
In some alternative embodiments, the determining module 303 includes:
and the second determining submodule is used for determining the virtual sample and the virtual label according to the training sample, the sample label, the sample mixing factor, the label mixing factor, the first fault model and the second fault model.
In some alternative embodiments, the apparatus further comprises:
The second obtaining module is used for inputting the target training sample into the initial fault prediction model to obtain a classification result after obtaining the target training sample for hard disk data fault prediction and the target sample label corresponding to the target training sample according to the virtual sample and the virtual label;
And the third obtaining module is used for adjusting model parameters of the initial fault prediction model according to the target sample label and the classification result to obtain the target fault prediction model for hard disk data fault prediction.
In some alternative embodiments, the apparatus further comprises:
the second acquisition module is used for acquiring target hard disk data to be subjected to fault prediction after the model parameters of the initial fault prediction model are adjusted according to the target sample labels and the classification results to obtain a target fault prediction model for hard disk data fault prediction;
and the fourth obtaining module is used for inputting the target hard disk data into the target fault prediction model to obtain a fault prediction result.
In some alternative embodiments, the first fault model is a first expression, the formula of the first expression being: wherein/> Is a virtual sample,/>Is a sample mixing factor,/>For training samples under the first category,/>Training samples under the second category; the second fault model is a second expression, and the formula of the second expression is: /(I)Wherein/>For the virtual label corresponding to the virtual sample,/>Is a label mix factor,/>For training sample tags under the first category,/>Is a training sample label under the second category.
The sample acquiring device for hard disk data failure prediction in this embodiment is presented in the form of a functional unit, where a unit refers to an ASIC circuit, a processor and a memory executing one or more software or fixed programs, and/or other devices that can provide the above functions.
Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.
The embodiment of the disclosure also provides a computer device, which is provided with the sample acquisition device for hard disk data failure prediction shown in the figure 3.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a computer device according to an alternative embodiment of the disclosure, as shown in fig. 4, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 4.
The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.
Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform a method for implementing the embodiments described above.
The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created from the use of the computer device of the presentation of a sort of applet landing page, and the like. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.
The computer device also includes a communication interface 30 for the computer device to communicate with other devices or communication networks.
The presently disclosed embodiments also provide a computer readable storage medium, and the methods described above according to the presently disclosed embodiments may be implemented in hardware, firmware, or as recordable storage medium, or as computer code downloaded over a network that is originally stored in a remote storage medium or a non-transitory machine-readable storage medium and is to be stored in a local storage medium, such that the methods described herein may be stored on such software processes on a storage medium using a general purpose computer, special purpose processor, or programmable or dedicated hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.
Although embodiments of the present disclosure have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the disclosure, and such modifications and variations are within the scope defined by the appended claims.
Claims (13)
1. A sample acquisition method for hard disk data failure prediction, the method comprising:
Acquiring a training sample of hard disk data and a sample label corresponding to the training sample;
constructing a first fault model for generating a virtual sample according to the training sample; constructing a second fault model for generating a virtual label corresponding to the virtual sample according to the sample label;
Constructing a first fault model for generating a virtual sample according to the training sample, wherein the first fault model comprises the following steps: acquiring a sample mixing factor and a preset number of training samples;
Constructing a second fault model for generating a virtual tag corresponding to the virtual sample according to the sample tag, including:
acquiring the preset number of sample tags;
determining a label mixing factor according to the sample mixing factor and the preset number of training samples;
constructing the second fault model according to the label mixing factor and the preset number of sample labels;
The determining the label mixing factor according to the sample mixing factor and the preset number of training samples comprises the following steps:
acquiring the sample size of each training sample;
determining the label mixing factor according to the sample size, a preset decision boundary and the sample mixing factor;
in the case that the number of training samples is two, the determining the label mixing factor according to the sample size, the preset decision boundary, and the sample mixing factor includes:
Obtaining a quotient of a first sample size of a first training sample and a second sample size of a second training sample to obtain a target value;
Obtaining a comparison result between the target value and the preset decision boundary;
Determining the tag mixing factor according to the comparison result and the sample mixing factor;
Determining the virtual sample and the virtual tag according to the training sample, the sample tag, the first fault model and the second fault model;
And obtaining a target training sample for hard disk data fault prediction and a target sample label corresponding to the target training sample according to the virtual sample and the virtual label.
2. The method of claim 1, wherein constructing a first fault model that generates virtual samples from the training samples comprises:
and constructing the first fault model according to the sample mixing factor and a preset number of training samples.
3. The method of claim 1, wherein said determining said tag mixing factor based on said comparison result and said sample mixing factor comprises:
Acquiring a correlation factor obtained by the sample mixing factor under the condition that the comparison result meets a first condition;
And taking the maximum value between the sample mixing factor and the association factor as the label mixing factor.
4. The method of claim 1, wherein said determining said tag mixing factor based on said comparison result and said sample mixing factor comprises:
acquiring a correlation factor obtained by the sample mixing factor under the condition that the comparison result meets a second condition;
And taking the minimum value between the sample mixing factor and the association factor as the label mixing factor.
5. The method of claim 1, wherein said determining said tag mixing factor based on said comparison result and said sample mixing factor comprises:
And taking the sample mixing factor as the label mixing factor when the comparison result meets a third condition.
6. The method according to claim 1, wherein the obtaining, according to the virtual samples and the virtual labels, a target training sample for hard disk data failure prediction and a target sample label corresponding to the target training sample includes:
adding the virtual sample into the training sample to obtain the target training sample;
and adding the virtual tag into the sample tag to obtain the target sample tag.
7. The method of claim 1, wherein the determining the virtual sample and the virtual tag from the training sample, the sample tag, the first fault model, and the second fault model comprises:
and determining the virtual sample and the virtual label according to the training sample, the sample label, the sample mixing factor, the label mixing factor, the first fault model and the second fault model.
8. The method according to claim 1, wherein after the obtaining, from the virtual samples and the virtual labels, a target training sample for hard disk data failure prediction and a target sample label corresponding to the target training sample, the method further comprises:
Inputting the target training sample into an initial fault prediction model to obtain a classification result;
And adjusting model parameters of the initial fault prediction model according to the target sample label and the classification result to obtain a target fault prediction model for hard disk data fault prediction.
9. The method of claim 8, wherein after said adjusting model parameters of said initial failure prediction model based on said target sample tags and said classification results to obtain a target failure prediction model for hard disk data failure prediction, said method further comprises:
Obtaining target hard disk data to be subjected to fault prediction;
and inputting the target hard disk data into the target fault prediction model to obtain a fault prediction result.
10. The method of claim 1, wherein the first fault model is a first expression, and wherein the first expression is formulated as: wherein/> Is a virtual sample,/>Is a sample mixing factor,/>For training samples under the first category,/>Training samples under the second category; the second fault model is a second expression, and the formula of the second expression is: /(I)Wherein/>For the virtual label corresponding to the virtual sample,/>Is a label mix factor,/>For training sample tags under the first category,/>Is a training sample label under the second category.
11. A sample acquisition device for hard disk data failure prediction, the device comprising:
the first acquisition module is used for acquiring training samples of hard disk data and sample labels corresponding to the training samples;
The building module is used for building a first fault model for generating a virtual sample according to the training sample; constructing a second fault model for generating a virtual label corresponding to the virtual sample according to the sample label;
The construction module comprises:
the first acquisition submodule is used for acquiring a sample mixing factor and a preset number of training samples;
The second acquisition sub-module is used for acquiring the preset number of sample tags;
the first determining submodule is used for determining a label mixing factor according to the sample mixing factor and the preset number of training samples;
The second construction submodule is used for constructing the second fault model according to the label mixing factor and the preset number of sample labels;
The first determination submodule is further used for obtaining the sample size of each training sample; determining the label mixing factor according to the sample size, a preset decision boundary and the sample mixing factor;
In the case that the number of training samples is two, the first determining submodule determines the label mixing factor according to the sample size, a preset decision boundary and the sample mixing factor, including:
Obtaining a quotient of a first sample size of a first training sample and a second sample size of a second training sample to obtain a target value;
obtaining a comparison result between the target value and a preset decision boundary;
Determining a label mixing factor according to the comparison result and the sample mixing factor;
a determining module, configured to determine the virtual sample and the virtual tag according to the training sample, the sample tag, the first fault model, and the second fault model;
The first obtaining module is used for obtaining a target training sample for hard disk data fault prediction and a target sample label corresponding to the target training sample according to the virtual sample and the virtual label.
12. A computer device, comprising:
A memory and a processor communicatively coupled to each other, the memory having stored therein computer instructions that, upon execution, perform the sample acquisition method for hard disk data failure prediction of any one of claims 1 to 10.
13. A computer-readable storage medium having stored thereon computer instructions for causing a computer to execute the sample acquisition method for hard disk data failure prediction according to any one of claims 1 to 10.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410347260.XA CN117951529B (en) | 2024-03-26 | 2024-03-26 | Sample acquisition method, device and equipment for hard disk data fault prediction |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410347260.XA CN117951529B (en) | 2024-03-26 | 2024-03-26 | Sample acquisition method, device and equipment for hard disk data fault prediction |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117951529A CN117951529A (en) | 2024-04-30 |
CN117951529B true CN117951529B (en) | 2024-06-21 |
Family
ID=90805535
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410347260.XA Active CN117951529B (en) | 2024-03-26 | 2024-03-26 | Sample acquisition method, device and equipment for hard disk data fault prediction |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117951529B (en) |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113449618A (en) * | 2021-06-17 | 2021-09-28 | 南京航空航天大学 | Method for carrying out deep learning rolling bearing fault diagnosis based on feature fusion and mixed enhancement |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112434733B (en) * | 2020-11-17 | 2024-04-02 | 西安交通大学 | Small-sample hard disk fault data generation method, storage medium and computing device |
CN112765662B (en) * | 2021-01-22 | 2022-06-03 | 电子科技大学 | Method for supporting privacy protection of training integrator under deep learning |
CN115272797A (en) * | 2022-07-29 | 2022-11-01 | 腾讯科技(深圳)有限公司 | Training method, using method, device, equipment and storage medium of classifier |
CN116956048B (en) * | 2023-09-19 | 2023-12-15 | 北京航空航天大学 | Industrial equipment fault diagnosis method and device based on cross-domain generalized label |
CN117571312A (en) * | 2023-11-16 | 2024-02-20 | 中国航空综合技术研究所 | Rotary machine fault diagnosis method for noise label industrial scene |
-
2024
- 2024-03-26 CN CN202410347260.XA patent/CN117951529B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113449618A (en) * | 2021-06-17 | 2021-09-28 | 南京航空航天大学 | Method for carrying out deep learning rolling bearing fault diagnosis based on feature fusion and mixed enhancement |
Also Published As
Publication number | Publication date |
---|---|
CN117951529A (en) | 2024-04-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108923952B (en) | Fault diagnosis method, equipment and storage medium based on service monitoring index | |
CN108683562B (en) | Anomaly detection positioning method and device, computer equipment and storage medium | |
CN108683530B (en) | Data analysis method and device for multi-dimensional data and storage medium | |
CN107025153B (en) | Disk failure prediction method and device | |
CN103988175A (en) | Methods and systems for identifying action for responding to anomaly in cloud computing system | |
CN116450399B (en) | Fault diagnosis and root cause positioning method for micro service system | |
CN112214369A (en) | Hard disk fault prediction model establishing method based on model fusion and application thereof | |
CN113010389A (en) | Training method, fault prediction method, related device and equipment | |
CN111881023B (en) | Software aging prediction method and device based on multi-model comparison | |
CN113837596B (en) | Fault determination method and device, electronic equipment and storage medium | |
CN110990575B (en) | Test case failure cause analysis method and device and electronic equipment | |
CN113127342B (en) | Defect prediction method and device based on power grid information system feature selection | |
CN117951529B (en) | Sample acquisition method, device and equipment for hard disk data fault prediction | |
CN112015995A (en) | Data analysis method, device, equipment and storage medium | |
WO2023239461A1 (en) | Capacity aware cloud environment node recovery system | |
CN116319255A (en) | Root cause positioning method, device, equipment and storage medium based on KPI | |
WO2022000285A1 (en) | Health index of a service | |
JP2022174425A (en) | Data division device, data division method and program | |
JP6588494B2 (en) | Extraction apparatus, analysis system, extraction method, and extraction program | |
CN111985651A (en) | Operation and maintenance method and device for business system | |
CN117421145B (en) | Heterogeneous hard disk system fault early warning method and device | |
CN113554126B (en) | Sample evaluation method, device, equipment and computer readable storage medium | |
CN109474445B (en) | Distributed system root fault positioning method and device | |
CN117371506A (en) | Model training method, model testing device, electronic equipment and storage medium | |
CN118534291A (en) | Method, system, equipment and storage medium for testing power supply chip |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |