CN117591351A

CN117591351A - Disk fault detection model training method and disk fault detection method

Info

Publication number: CN117591351A
Application number: CN202311541143.9A
Authority: CN
Inventors: 苏冉; 李青
Original assignee: Jinan Inspur Data Technology Co Ltd
Current assignee: Jinan Inspur Data Technology Co Ltd
Priority date: 2023-11-17
Filing date: 2023-11-17
Publication date: 2024-02-23

Abstract

The invention relates to the technical field of computers and discloses a training method of a disk fault detection model and a disk fault detection method. The training method of the disk fault detection model comprises the following steps: acquiring sample detection information of a sample disk in the super fusion all-in-one machine; extracting the characteristics of the sample detection information to obtain sample characteristic data; based on the sample feature data, obtaining a plurality of sample training sets and sample random features; obtaining a sample test set based on the plurality of sample training sets and the sample detection information; and training a preset fault detection model based on each sample training set, the sample random features and the sample testing set to obtain fault detection sub-models corresponding to each sample training set so as to obtain a target fault detection model, wherein the target fault detection model comprises fault detection sub-models corresponding to each sample training set. According to the scheme provided by the embodiment of the invention, the product experience and the data guarantee accuracy can be improved.

Description

Disk fault detection model training method and disk fault detection method

Technical Field

The invention relates to the technical field of computers, in particular to a training method of a disk fault detection model and a disk fault detection method.

Background

In the related art, the detection of the damage to the magnetic disk by the super-fusion integrated machine product is usually based on a general bad disk detection standard, but the situation that the bad disk is misreported when a service task is executed is dealt with, so that the correspondingly generated magnetic disk alarm information is inaccurate, the problem of excessive operation and maintenance or untimely operation and maintenance is introduced, and further the product experience and the data guarantee accuracy are reduced.

Disclosure of Invention

In view of the above, the invention provides a training method of a disk fault detection model and a disk fault detection method, which can improve product experience and data guarantee accuracy.

In a first aspect, the present invention provides a method for training a disk failure detection model, where the method includes:

acquiring sample detection information of a sample disk in the super fusion all-in-one machine;

extracting the characteristics of the sample detection information to obtain sample characteristic data;

based on the sample feature data, obtaining a plurality of sample training sets and sample random features;

Obtaining a sample test set based on the plurality of sample training sets and the sample detection information;

and training a preset fault detection model based on each sample training set, the sample random characteristics and the sample testing set to obtain fault detection sub-models corresponding to each sample training set so as to obtain a target fault detection model, wherein the target fault detection model comprises fault detection sub-models corresponding to each sample training set and a data fusion module, and the data fusion module is used for carrying out fusion on the output of each fault detection sub-model and outputting the fused output so as to obtain the output result of the target fault detection model.

According to the training method of the disk fault detection model, provided by the embodiment of the invention, the sample detection information of the sample disk of the super-fusion all-in-one machine is obtained, and the sample detection information is subjected to characteristic extraction to obtain sample characteristic data, so that a plurality of sample training sets and sample random characteristics can be obtained based on the sample characteristic data; based on a plurality of sample training sets and sample detection information, a sample testing set is obtained, a preset fault detection model can be trained based on each sample training set, sample random characteristics and sample testing set, a fault detection sub-model corresponding to each sample training set is obtained, a target fault detection model is obtained, the target fault detection model comprises fault detection sub-models corresponding to each sample training set and a data fusion module, the data fusion module is used for carrying out fusion on the output of each fault detection sub-model and outputting the output result of the target fault detection model, so that the disk fault can be detected through the target fault detection model, the accuracy of disk fault detection is improved, the situation of bad disk false alarm when service tasks are executed is avoided, and the product experience and the data guarantee accuracy of the super-fusion all-in-one machine are improved.

In some optional embodiments, the feature extracting the sample detection information to obtain sample feature data includes:

inputting the sample detection information into a preset variable container model to obtain a dictionary list;

and extracting dictionary features of the dictionary list to obtain sample feature data.

According to the training method of the disk fault detection model, provided by the embodiment of the invention, the sample detection information is input into the preset variable container model to obtain the dictionary list, and the dictionary characteristic extraction is carried out on the dictionary list to obtain the sample characteristic data, so that redundant information in the sample detection information can be removed, and the most important and essential sample characteristic data can be reserved.

In some optional embodiments, the obtaining a plurality of sample training sets and sample random features based on the sample feature data includes:

converting the data type of the sample characteristic data to obtain converted sample characteristic data;

sampling the converted sample characteristic data with a plurality of times of replacement to obtain a plurality of sample training sets;

and randomly generating the sample random features based on the converted sample feature data.

According to the training method for the disk fault detection model, the sample characteristic data are converted into the converted sample characteristic data through data type conversion, and the converted sample characteristic data are sampled for multiple times to obtain multiple sample training sets, so that the training effect on the target fault detection model is improved, the random sample characteristics are randomly generated based on the converted sample characteristic data, and the reliability of deducing the converted sample characteristic data can be improved.

In a second aspect, the present invention provides a method for detecting a disk failure, which is applied to a super-fusion integrated machine, and the method includes:

acquiring target detection information of a target disk of the super fusion all-in-one machine;

extracting the characteristics of the target detection information to obtain target characteristic data;

and inputting the target characteristic data into a target fault detection model to obtain a detection result, wherein the target fault detection model is trained according to the training method of the target fault detection model of the first aspect or any corresponding implementation mode of the first aspect.

According to the disk fault detection method provided by the embodiment of the invention, the target characteristic data is obtained by acquiring the target detection information of the target disk of the super-fusion all-in-one machine and extracting the characteristics of the target detection information, the target characteristic data is input into the target fault detection model to obtain the detection result, whether the corresponding target disk is damaged or not can be determined according to the detection result, and therefore, the accuracy of disk fault detection can be improved through the target fault detection model, so that the situation of false alarm of the bad disk when a service task is executed is avoided, and further the product experience and the data guarantee accuracy of the super-fusion all-in-one machine are improved.

In some alternative embodiments, the detection result includes a target accuracy and a detection result, the method further comprising:

if the detection result represents that the target disk is damaged and the target accuracy is greater than or equal to a preset accuracy, determining the target disk as a damaged disk;

or,

and if the detection result represents that the target disk is damaged and the target accuracy is smaller than the preset accuracy, determining the target disk as an undamaged disk.

In some optional embodiments, the obtaining the target detection information of the target disk of the super fusion all-in-one machine includes:

acquiring target error reporting times of the candidate disk of the super fusion integrated machine within preset time;

acquiring the safety sensitivity corresponding to the candidate magnetic disk;

determining preset error reporting times based on the safety sensitivity;

comparing the target error reporting times with the preset error reporting times to obtain a comparison result;

and if the comparison result represents that the target error reporting times are greater than or equal to the preset error reporting times, determining the candidate disk as the target disk so as to obtain target detection information of the target disk.

In some alternative embodiments, the method further comprises:

determining the disk capacity ratio of the target disk;

determining the storage type of the target disk based on the disk capacity ratio;

determining a detection mode corresponding to the target disk based on the storage type of the target disk, wherein the detection mode comprises in-band detection;

if the detection mode is in-band detection, acquiring input and output state information of a preset storage capacity level in the target disk;

if the input and output state information represents that the damaged disk block exists in the target disk, determining a normal disk block in the target disk;

mapping the data information in the damaged disk block to the normal disk block, and generating bad block alarm information;

determining the number of the bad block alarm information;

if the number of the bad block alarm information is greater than or equal to the preset number, generating bad disk alarm information;

and sending the bad disk alarm information to a target object so that the target object carries out fault processing on the target disk.

In a third aspect, the present invention provides a training apparatus for a disk failure detection model, including:

The first acquisition module is used for acquiring sample detection information of a sample disk in the super fusion all-in-one machine;

the first feature extraction module is used for carrying out feature extraction on the sample detection information to obtain sample feature data;

the first determining module is used for obtaining a plurality of sample training sets and sample random features based on the sample feature data;

the second determining module is used for obtaining a sample test set based on the plurality of sample training sets and the sample detection information;

the target fault detection model generation module is used for training a preset fault detection model based on each sample training set, the sample random characteristics and the sample testing set to obtain a fault detection sub-model corresponding to each sample training set so as to obtain a target fault detection model, wherein the target fault detection model comprises fault detection sub-models corresponding to each sample training set and a data fusion module, and the data fusion module is used for carrying out fusion on the output of each fault detection sub-model and then outputting the fused output so as to obtain the output result of the target fault detection model.

In some optional embodiments, the first feature extraction module specifically includes:

The dictionary list generation sub-module is used for inputting the sample detection information into a preset variable container model to obtain a dictionary list;

and the characteristic extraction sub-module is used for extracting dictionary characteristics of the dictionary list to obtain sample characteristic data.

In some alternative embodiments, the first determining module specifically includes:

the conversion sub-module is used for carrying out data type conversion on the sample characteristic data to obtain converted sample characteristic data;

the sampling sub-module is used for sampling the converted sample characteristic data repeatedly with a plurality of times to obtain a plurality of sample training sets;

and the sample random feature generation sub-module is used for randomly generating the sample random features based on the converted sample feature data.

In a fourth aspect, the present invention provides a disk failure detection apparatus, including:

the second acquisition module is used for acquiring target detection information of a target disk of the super fusion all-in-one machine;

the second feature extraction module is used for carrying out feature extraction on the target detection information to obtain target feature data;

the prediction module is configured to input the target feature data to a target fault detection model, so as to obtain a detection result, where the target fault detection model is trained according to the training method of the target fault detection model according to the first aspect or any implementation mode corresponding to the first aspect.

In some optional embodiments, the detection result includes a target accuracy and a detection result, and the disk failure detection apparatus further includes:

the third determining module is used for determining the target disk as a damaged disk if the detection result represents that the target disk is damaged and the target accuracy is greater than or equal to a preset accuracy; or if the detection result represents that the target disk is damaged and the target accuracy is smaller than the preset accuracy, determining the target disk as an undamaged disk.

In some optional embodiments, the second obtaining module specifically includes:

the first acquisition sub-module is used for acquiring target error reporting times of the candidate disk of the super-fusion all-in-one machine within preset time;

the second acquisition sub-module is used for acquiring the safety sensitivity corresponding to the candidate magnetic disk;

the first determining submodule is used for determining preset error reporting times based on the safety sensitivity;

the comparison sub-module is used for comparing the target error reporting times with the preset error reporting times to obtain a comparison result;

and the second determining submodule is used for determining the candidate disk as the target disk if the comparison result represents that the target error reporting times are larger than or equal to the preset error reporting times so as to obtain target detection information of the target disk.

In some alternative embodiments, the disk failure detection apparatus further includes:

a fourth determining module, configured to determine a disk capacity ratio of the target disk;

a fifth determining module, configured to determine a storage type of the target disk based on the disk capacity ratio;

a sixth determining module, configured to determine a detection mode corresponding to the target disk based on a storage type of the target disk, where the detection mode includes in-band detection;

the third acquisition module is used for acquiring input and output state information of a preset storage capacity level in the target disk if the detection mode is in-band detection;

a seventh determining module, configured to determine a normal disk block in the target disk if the input/output status information indicates that a damaged disk block exists in the target disk;

the bad block alarm information generating module is used for mapping the data information in the damaged disk block to the normal disk block and generating bad block alarm information;

the quantity determining module is used for determining the quantity of the bad block alarm information;

the bad disc alarm information generation module is used for generating bad disc alarm information if the number of the bad block alarm information is greater than or equal to the preset number;

And the sending module is used for sending the bad disk alarm information to a target object so that the target object can perform fault processing on the target disk.

In a fifth aspect, the present invention provides a computer device comprising: the memory and the processor are in communication connection, the memory stores computer instructions, and the processor executes the computer instructions to execute the training method of the disk fault detection model according to the first aspect or any implementation manner corresponding to the first aspect or the second aspect or any implementation manner corresponding to the second aspect.

In a sixth aspect, the present invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to execute the training method of the disk failure detection model of the first aspect or any one of the embodiments corresponding thereto, or the disk failure detection method of the second aspect or any one of the embodiments corresponding thereto.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an alternative super-fusion all-in-one machine according to an embodiment of the present disclosure;

FIG. 2 is a flow chart of a method of training a disk failure detection model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a target fault detection model according to an embodiment of the present invention;

FIG. 4 is a flow chart of another method of training a disk failure detection model according to an embodiment of the present invention;

FIG. 5 is a flow chart of a method of disk failure detection according to an embodiment of the present invention;

FIG. 6 is a flow chart of another method of disk failure detection according to an embodiment of the present invention;

FIG. 7 is a flow chart of yet another method of disk failure detection according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a training device for a disk failure detection model according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a disk failure detection apparatus according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of an alternative computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, fig. 1 is a schematic diagram of an optional super-fusion all-in-one machine provided in an embodiment of the present application, where the super-fusion all-in-one machine includes a plurality of target disks and a target failure detection model.

The super-fusion all-in-one machine can acquire target detection information of a plurality of target disks, then performs feature extraction on the target detection information to obtain target feature data, finally inputs the target feature data into a target fault detection model to obtain a detection result, and can determine whether the corresponding target disk is damaged according to the detection result, so that the accuracy of disk fault detection can be improved through the target fault detection model, the situation of false alarm of the bad disk is avoided when a service task is executed, and further product experience and data guarantee accuracy of the super-fusion all-in-one machine are improved.

It should be noted that, the super fusion integrated machine can realize high-speed storage by using the local disk of the server, and fully exert the performance of the local disk by using a distributed storage mechanism, so as to achieve the performance effect even exceeding that of independent centralized storage.

Notably, it is the mainstream to build up and use super integration all-in-one solution in data centers such as big enterprises, industries and the like. The data center construction scheme of the super-integration all-in-one machine is characterized in that a server is connected with a network, storage, an advanced network and the like are realized by software, important resources such as calculation, storage, network and the like are stably and conveniently provided, and the super-integration all-in-one machine is simple in operation and maintenance performance.

The storage module for storing data in the super fusion architecture is mainly realized by adopting a server local disk comprising a cache disk and a large-capacity data disk, so that the detection of the disk state directly relates to the safety of enterprise data and is an important function.

However, at present, the detection of the damage to the magnetic disk by the super-fusion integrated machine product mainly refers to some general bad disk detection standards, for example, a SMART information detection mechanism is used to select partial parameters suitable for the product and set as a detection threshold, and when the magnetic disk reaches the threshold in work, a corresponding alarm is triggered to remind a user that the magnetic disk has a fault risk, and corresponding operation and maintenance are required. However, the mechanism has obvious defects, and firstly, the super-fusion integrated machine can be compatible with hard disk models of various large manufacturers, and a compatibility list of the hard disks is rich. Therefore, the detection standards of the unified hardware for the compatibility of the hard disks with various types and different batches are different, and the method is not applicable to all hard disks. The method is characterized in that the customer service is normal in practice but false alarm triggering, too frequent alarm triggering, untimely alarm triggering or inaccuracy and the like are realized. The problem of excessive operation and maintenance or untimely operation and maintenance can be introduced due to inaccurate disk alarm information, and the product experience and the data guarantee accuracy can be greatly reduced. Moreover, there is currently a lack of a more complete disk detection scheme, which is also a common pain point for industry-type products.

Based on the above, the embodiment of the invention provides a training method of a disk fault detection model and a disk fault detection method, which can improve product experience and data guarantee accuracy.

In accordance with an embodiment of the present invention, there is provided a training method embodiment of a disk failure detection model, it being noted that the steps illustrated in the flowchart of the figures may be performed in a computer system, such as a set of computer-executable instructions, and that, although a logical sequence is illustrated in the flowchart, in some cases, the steps illustrated or described may be performed in a different order than that illustrated herein.

The embodiment provides a training method of a disk fault detection model, which can be used for the super-fusion all-in-one machine. FIG. 2 is a flowchart of a method of training a disk failure detection model according to an embodiment of the present invention, as shown in FIG. 2, the flowchart including the steps of:

step S201, sample detection information of a sample disk in the super fusion all-in-one machine is obtained.

The sample detection information is SMART (Self-Monitoring, analysis and Reporting Technology, self-Monitoring analysis and reporting technology) information, and the SMART information includes parameters such as disk temperature, seek error rate, failure prediction value, etc.

For example, the sample detection information may be disk SMART information for customer base use by the super fusion machine for less than one year, and disk SMART information for customer base use by more than 3 years.

It should be noted that, the super-fusion integrated machine is a product combining software and hardware, and the service life of the magnetic disk is prolonged by more than 3 years, so that the sample detection information of the sample magnetic disk with high abrasion degree can be obtained.

Step S202, extracting features of the sample detection information to obtain sample feature data.

And extracting the characteristics of the sample detection information, namely filtering and selecting the sample detection information to obtain sample characteristic data. The sample characteristic data includes 11 indexes, specifically shown in table 1, wherein the point-to-point error detection count is used for representing the error number of mapping between the logical block address and the real physical address in the SSD. In addition, the sample feature data for extracting the features of the sample detection information may be other indexes, and may be specifically determined according to the actual requirement, which is not particularly limited herein.

TABLE 1

Step S203, based on the sample feature data, a plurality of sample training sets and sample random features are obtained.

Specifically, the sample feature data can be sampled for multiple times, a first preset number of sample feature data are sampled each time, and the sample feature data obtained by each sampling are used as a sample training set, so that a plurality of sample training sets can be obtained. The second preset number of sample random features are randomly selected from the sample feature data, where the first preset number and the second preset number may be determined according to actual situations, and are not specifically limited herein.

Step S204, a sample test set is obtained based on the plurality of sample training sets and the sample detection information.

The sample detection information includes a sample training set and a sample testing set, and therefore, the sample detection information subtracts the sample training set to obtain the sample testing set.

The number of the sample test sets may be 1 or more. Specifically, if the number of sample test sets is 1, the sample test set may be obtained by subtracting the first sample training set from the sample detection information; if the number of the sample test sets is multiple, the number of the sample test sets is the same as the number of the sample training sets, and the sample test sets correspond to the sample training sets, i.e. the first sample test set can be obtained by subtracting the first sample training set from the sample detection information, the second sample test set can be obtained by subtracting the second sample training set from the sample detection information, and the like.

Step S205, training a preset fault detection model based on each sample training set, sample random characteristics and sample test sets to obtain a fault detection sub-model corresponding to each sample training set so as to obtain a target fault detection model.

As shown in fig. 3, the target fault detection model includes fault detection sub-models corresponding to each sample training set and a data fusion module, where the data fusion module is configured to fuse and output the outputs of each fault detection sub-model to obtain an output result of the target fault detection model. The output of each fault detection sub-model is fused, and specifically, the average value of the output of each fault detection sub-model can be calculated to obtain a final detection result; or, calculating a weighted average value of the output results of each fault detection sub-model to obtain a final detection result; alternatively, the output results of each fault detection sub-model are subjected to standard deviation calculation to obtain a final detection result, and the like, which are not listed here.

For example, a sample training set and a sample random feature are input to a preset fault detection model through a sklearn sampling algorithm, and the model_ selection import train _test_split is used for modeling from the sample training set and the sample test set required by modeling, and then a fault detection sub-model corresponding to the sample training set is obtained through training, wherein the preset fault detection model is x_train, x_test, y_train, y_test=train_test_split (data_features, data_targets, test_size=xx). And then, inputting the next sample training set and the random sample characteristics into a preset fault detection model again for modeling from sklearn. Model_ Selectionimport train _test_split, then inputting the next sample training set and the sample test set, further training to obtain a fault detection sub-model corresponding to the next sample training set, and the like until training to obtain all fault detection sub-models.

By way of example, all fault detection sub-models are aggregated through a Bagging algorithm to obtain a target fault detection model, and because the target fault detection model is composed of a plurality of fault detection sub-models, the prediction accuracy of the target fault detection model is higher than that of a single fault detection sub-model, and the data processing capacity and the processing efficiency can be improved based on the target fault detection model.

According to the training method of the disk fault detection model, provided by the embodiment of the invention, the sample detection information of the sample disk of the super-fusion all-in-one machine is obtained, and the sample detection information is subjected to characteristic extraction to obtain sample characteristic data, so that a plurality of sample training sets and sample random characteristics can be obtained based on the sample characteristic data; based on a plurality of sample training sets and sample detection information, a sample testing set is obtained, a fault detection sub-model corresponding to each sample training set can be obtained based on each sample training set, sample random characteristics and sample testing set training, so that a target fault detection model is obtained, the target fault detection model comprises the fault detection sub-model corresponding to each sample training set, and therefore, the magnetic disk faults can be detected through the target fault detection model, the accuracy of magnetic disk fault detection is improved, the situation that bad disks are misreported when service tasks are executed is avoided, and the product experience and the data guarantee accuracy of the super-fusion integrated machine are improved.

The embodiment provides a training method of a disk fault detection model, which can be used for the super-fusion all-in-one machine. FIG. 4 is a flowchart of a method of training a disk failure detection model according to an embodiment of the present invention, as shown in FIG. 4, the flowchart including the steps of:

step S401, sample detection information of a sample disk in the super fusion all-in-one machine is obtained.

Please refer to step S201 in the embodiment shown in fig. 2 in detail, which is not described herein.

Step S402, extracting features of the sample detection information to obtain sample feature data.

Specifically, the step S402 specifically includes:

in step S4021, the sample detection information is input into a preset variable container model to obtain a dictionary list.

The preset variable container model is a dictionary, and the dictionary list is a list formed by the dictionary. Thus, the sample detection information is input to the dictionary, and a dictionary list can be obtained.

Step S4022, extracting dictionary features from the dictionary list to obtain sample feature data.

Wherein the sample characteristic data is a sparse matrix (i.e., a sparse matrix).

Specifically, dictionary feature extraction is performed by inputting a dictionary list into DictVectorizer (), and sample feature data can be obtained.

In the embodiment of the application, the sample detection information is input into the preset variable container model to obtain the dictionary list, and the dictionary feature extraction is performed on the dictionary list to obtain the sample feature data, so that redundant information in the sample detection information can be removed, and the most important and essential sample feature data can be reserved.

Step S403, obtaining a plurality of sample training sets and sample random features based on the sample feature data.

Specifically, the step S403 specifically includes:

step S4031, performing data type conversion on the sample feature data to obtain converted sample feature data.

The data type of the sample feature data is a character string type, the data type of the converted sample feature data is a numerical value type, the data type of the sample feature data is converted to obtain the converted sample feature data, namely, the data type of the sample feature data is converted from the character string type to the numerical value type, and the converted sample feature data is obtained.

And step S4032, sampling the converted sample characteristic data with multiple times of replacement to obtain multiple sample training sets.

Specifically, the converted sample characteristic data is sampled repeatedly with a put back, a sample training set K can be obtained after sampling for N times each time, and a plurality of sample training sets can be obtained after sampling the converted sample characteristic data with a put back for a plurality of times, wherein N is a positive integer, and K is the number of sample training subsets in the sample training set.

Step S4033, randomly generating sample random features based on the converted sample feature data.

Specifically, M sample random features may be randomly generated from the converted sample feature data by a generator of random samples, where M < M, M is the number of the converted sample feature data.

In the embodiment of the application, the sample characteristic data is converted into the converted sample characteristic data through data type conversion, and the converted sample characteristic data is sampled for a plurality of times to obtain a plurality of sample training sets, so that the training effect on the target fault detection model is improved, and the random sample characteristics are randomly generated based on the converted sample characteristic data, so that the reliability of deducing the converted sample characteristic data can be improved.

Step S404, obtaining a sample test set based on the plurality of sample training sets and the sample detection information.

Please refer to step S204 in the embodiment shown in fig. 2 in detail, which is not described herein.

Step S405, training a preset fault detection model based on each sample training set, sample random characteristics and sample test sets to obtain a fault detection sub-model corresponding to each sample training set so as to obtain a target fault detection model.

The target fault detection model comprises fault detection sub-models corresponding to each sample training set. Please refer to step S205 in the embodiment shown in fig. 2 in detail, which is not described herein.

The embodiment provides a disk fault detection method which can be used for the super-fusion all-in-one machine. FIG. 5 is a flow chart of a disk failure detection method according to an embodiment of the present invention, as shown in FIG. 5, the flow comprising the steps of:

step S501, obtaining target detection information of a target disk of the super fusion all-in-one machine.

The target detection information is SMART (Self-Monitoring, analysis and Reporting Technology, self-Monitoring analysis and reporting technology) information, and the SMART information includes parameters such as disk temperature, seek error rate, failure prediction value, etc.

Step S502, extracting features of the target detection information to obtain target feature data.

Specifically, inputting target detection information into a preset variable container model to obtain a target dictionary list, and extracting dictionary features of the target dictionary list to obtain target feature data.

Step S503, inputting the target feature data into the target fault detection model to obtain a detection result.

The target feature data is input to the target fault detection model by a random forest classifier (random forest) (i.e., from sklearn. Ensembable import RandomForestClassifier).

In addition, in order to keep the function parameters of the random forest classifier consistent with the parameters of the target fault detection model, training may be performed by using target feature data transmitted into a training function (. Fit), and the trained target feature data is input into the target fault detection model through a random forest classifier (random forest classifier).

The target fault detection model is trained according to the training method of the target fault detection model in the embodiment shown in fig. 2 or fig. 4, please refer to the embodiment shown in fig. 2 or fig. 4 in detail, and details are not described herein.

The embodiment provides a disk fault detection method which can be used for the super-fusion all-in-one machine. FIG. 6 is a flow chart of a disk failure detection method according to an embodiment of the present invention, as shown in FIG. 6, the flow comprising the steps of:

step S601, obtaining target detection information of a target disk of the super fusion all-in-one machine;

specifically, the step S601 specifically includes:

and step S6011, obtaining target error reporting times of the candidate magnetic disk of the super fusion integrated machine in preset time.

The preset time may be a unit time, which may be determined according to actual conditions. The target error reporting number is IO error (i.e. input-output error).

Step S6012, obtaining the security sensitivity corresponding to the candidate disc.

The security sensitivity corresponding to the candidate magnetic disk is the sensitivity required by the user on the security of the data service, and the security sensitivity can be divided into a first sensitivity, a second sensitivity and a third sensitivity, wherein the first sensitivity is larger than the second sensitivity, and the second sensitivity is larger than the third sensitivity.

Step S6013, determining the preset error reporting times based on the security sensitivity.

The preset error reporting times can be set in the configuration file, and requirements of the configuration file on the preset error reporting times corresponding to the target error reporting times can be automatically matched based on the safety sensitivity, for example, the preset error reporting times can be 50, 100 or 200 times, and the like.

And step S6014, comparing the target error reporting times with preset error reporting times to obtain a comparison result.

And step S6015, if the comparison result represents that the target error reporting times are greater than or equal to the preset error reporting times, determining the candidate disk as the target disk so as to obtain target detection information of the target disk.

In the embodiment of the application, by detecting the target error reporting times of the candidate disk of the super-fusion all-in-one machine within the preset time, comparing the target error reporting times with the preset error reporting times to obtain a comparison result, determining whether the candidate disk is damaged or not based on the comparison result, if the comparison result represents that the target error reporting times are greater than or equal to the preset error reporting times, indicating that the candidate disk is damaged, at the moment, determining the candidate disk as the target disk so as to facilitate the subsequent bad disk detection on the target disk based on the target fault detection model, that is, the embodiment of the application can provide an interface feedback to the detection assembly from the product design view layer, detect that specific information is reported to indicate that the data is lost, and meanwhile, perform cluster stability and data protection when the disk is lifted.

Step S602, extracting features of the target detection information to obtain target feature data.

Please refer to step S602 in the embodiment shown in fig. 5 in detail, which is not described herein.

Step S603, inputting the target feature data into a target fault detection model to obtain a detection result.

Please refer to step S603 in the embodiment shown in fig. 5, which is not described herein.

In some optional implementations, the final score and prediction of the target disk can be obtained through detection of the target fault detection model, so that the detection result comprises a target accuracy and a detection result, and if the detection result represents that the target disk is damaged, and the target accuracy is greater than or equal to a preset accuracy, the target disk is determined to be a damaged disk; or if the detection result represents that the target disk is damaged and the target accuracy is smaller than the preset accuracy, determining the target disk as an undamaged disk.

Wherein, the target accuracy, i.e., accuracy=rf.score (x_test, y_test); the detection result is result=rf.prediction (data_prediction_features).

Illustratively, if the target accuracy (i.e., accuracy) is above 90%, the detection result corresponding to the target accuracy may be available valid data to determine whether the target disk is damaged.

In some alternative implementations, a disk capacity ratio of the target disk is determined, wherein the disk capacity ratio is a ratio of capacities of the cache disk and the data disk. And then determining the storage type of the target disk based on the disk capacity ratio, wherein the storage type comprises a mixed flash type and a full flash type, the mixed flash type is a mixed mode of SSD and HDD, the full flash type is a mode of full SSD, and a customer can select disks with different configurations to use according to actual service requirements and cost budget.

Then, determining a detection mode corresponding to the target disk based on the storage type of the target disk, wherein the detection mode comprises in-band detection and out-of-band detection, if the detection mode is in-band detection, acquiring input and output state information (such as IO error information of each 4M block level) of a preset storage capacity level in the target disk, if the input and output state information represents that damaged disk blocks exist in the target disk, determining normal disk blocks in the target disk, mapping data information in the damaged disk blocks into the normal disk blocks, generating bad block alarm information, determining the number of bad block alarm information, and if the number of bad block alarm information is greater than or equal to the preset number, generating bad disk alarm information, and transmitting the bad disk alarm information to a target object so as to enable the target object to perform fault processing on the target disk, wherein the preset number can be determined according to specific disk specifications.

Further, under the condition that the number of the bad block alarming information is larger than or equal to the preset number, the detection information of the magnetic disk corresponding to the bad block can be obtained, the feature extraction is carried out on the detection information, the feature data are obtained, the feature data are input into a target fault detection model, the detection result is obtained, whether the magnetic disk is damaged or not is judged again according to the detection result, misjudgment on the damage degree of the magnetic disk corresponding to the bad block is avoided, and the detection accuracy rate of the health degree of the magnetic disk corresponding to the bad block is improved.

Illustratively, as shown in fig. 7, first, management of the disk in the super fusion all-in-one machine may be implemented based on the super fusion all-in-one machine disk management mechanism. JBOD (Just a Bunch Of Disks, cluster disk serial bundle array) is then configured through Raid (Redundant Array of Independent Disk, redundant array of independent disks card) to pass through the management disk. Specifically, the disk for realizing the distributed storage is realized by configuring JBOD mode through a raid card inserted on a main board of the server and then directly transmitting the JBOD mode to a corresponding control virtual machine. Therefore, the method can realize the isolation of resources, the simple deployment and the high performance of hardware through a straight-through scheme, and realize the accurate identification of both the high-performance cache disk and the high-capacity data disk for distributed storage and the full performance and capacity of the high-performance cache disk and the high-capacity data disk.

In addition, since the distributed storage in the super fusion all-in-one machine provides a copy data setting entry, a specific policy can be set according to the importance of service data, for example, the capacity ratio of a cache disk to a data disk is recommended to be optimally 1 according to the current product performance: 5 to 1:10. the support types (i.e., storage types) corresponding to the cache disk are NVME (Nonvolatile Memory Express, nonvolatile memory standard) SSD (Solid State Drive, solid state disk) and SATA (Serial Advanced Technology Attachment ) SSD, the support types (i.e., storage types) corresponding to the data disk are SATA HHD (hybrid hard drive) and SSA HDD, and then, a detection mode corresponding to the disk is determined to detect a bad disk of the disk, where the detection mode includes out-of-band detection and in-band detection.

For out-of-band detection, the disk is directly detected through a hardware interface to obtain SMART detection information, and part of parameter information in the SMART detection information is directly related to the health degree of the disk. In the related art, by setting a correlation threshold, whether to give a corresponding alarm is determined by comparing the threshold with SMART detection information. However, due to the differences between the various hardware, and in combination with different virtualization software and applications, the health of the disk detected in this manner is not accurate.

According to the method and the device, the health degree of the current disk is analyzed through the target fault detection model, the association among various indexes can be more accurately mined, the data characteristics are enriched, the judgment and the prediction of the bad disk are more reasonably and accurately given, and the accuracy of the bad disk detection result is improved.

In addition, the disk type of the disk can be determined, if the disk is SSD, the health degree of the disk can be detected based on the wear degree, the writing amount, the power-on time and the like of the SSD, and if the health degree of the disk reaches a preset risk value, corresponding alarm information is generated. If the disk is an HDD, the health degree of the disk can be detected according to the SMART information, and if the health degree of the disk reaches a preset risk value, corresponding alarm information is generated.

For in-band detection, a series of detection rules are formulated based on the real use condition of the disk service, and the real IO (input output) performance of the disk is detected through system detection and network protocol, which is also the core of the bad disk detection function.

For bad block detection, detecting and counting IO error information of each 4M block level through a mechanism of a bottom storage engine part, if the input and output state information represents that a damaged disk block exists in a target disk, mapping data information in the damaged disk block into a normal disk block, and generating bad block alarm information to remind a user. Because a small number of bad blocks hardly affect normal business IO, the bad block alarm information at the moment plays a role in warning, but when the number of the bad blocks reaches a certain degree, the disk corresponding to the bad block can be determined as a bad disk.

For bad disk detection, acquiring target error reporting times of a disk of the super-fusion integrated machine in unit time, acquiring safety sensitivity corresponding to the disk, determining preset error reporting times based on the safety sensitivity, comparing the target error reporting times with the preset error reporting times to obtain a comparison result, and determining the disk as the bad disk if the comparison result represents that the target error reporting times are greater than or equal to the preset error reporting times. The bad disk is also divided into SSD and HHD, and corresponding bad disk alarm information can be generated according to the corresponding disk type. In addition, bad disc alarm information can be pushed to operation and maintenance personnel, the operation and maintenance personnel can replace the bad disc, then the system clicks 'fault repair', and files recorded in IO times can be emptied without triggering an alarm again.

For slow disk detection, when the performance of an SDS (Software Defined Storage ) cache disk and a data disk is obviously reduced, serious slow disk alarms are triggered, the implementation mechanism is that a detection mechanism of a system can periodically read/proc/disktats files and traverse each partition, whether alarms are triggered is determined by changing the number of milliseconds spent by input/output operation, if the number of milliseconds spent by the input/output operation reaches a threshold value, serious alarms are triggered, after the risk of IO card is prompted, the system repairs the slow disk and automatically carries out disk lifting isolation on the slow disk, namely, the slow disk is replaced, repair alarms are generated, then rescanning judgment is carried out on the updated disk, if the number of milliseconds spent by the input/output operation reaches the threshold value, the system carries out disk lifting isolation on the slow disk again automatically, corresponding alarm information can be pushed to operation staff, the operation staff can repair the repaired disk according to the corresponding problem, if the alarm information can be automatically eliminated in the alarm threshold value, and the state that the alarm information is not in the alarm threshold value is healthy. Otherwise, if the alarm is not triggered, the health of the disk is indicated.

It will be appreciated that after the operation and maintenance personnel have performed the fault repair, the system continues to provide service normally and no alarms are triggered.

The alarm information in the embodiment of the application may include health alarm information, wear alarm information, temperature alarm information, bad block alarm information, bad disc alarm information and slow disc alarm information. The threshold value of triggering alarm corresponding to different alarm information is different and can be set according to actual conditions.

In summary, the embodiment of the application can accurately alarm the disk faults based on the real service condition, effectively solve the problems of inaccurate, untimely or too frequent alarm information and the like commonly existing in the related technology, and perform more effective hard disk operation and maintenance. Moreover, from the comprehensive consideration of out-of-band detection and in-band detection, a reasonable bad disc detection scheme is provided more intelligently according to relevant parameters, product characteristics and user types. Through more reasonable algorithm optimization, corresponding alarm information of a user is timely and reasonably given, meanwhile, system response is made to severe bad disk conditions, isolation or elimination is carried out, and accuracy, integrity and safety of existing data are protected.

Referring to fig. 8, an embodiment of the present application further provides a training apparatus for a disk failure detection model, including:

A first obtaining module 801, configured to obtain sample detection information of a sample disk in the super fusion all-in-one machine;

a first feature extraction module 802, configured to perform feature extraction on the sample detection information to obtain sample feature data;

a first determining module 803, configured to obtain a plurality of sample training sets and sample random features based on the sample feature data;

a second determining module 804, configured to obtain a sample test set based on the plurality of sample training sets and the sample detection information;

the target fault detection model generating module 805 is configured to train a preset fault detection model based on each sample training set, a sample random feature and a sample test set, so as to obtain a fault detection sub-model corresponding to each sample training set, so as to obtain a target fault detection model, where the target fault detection model includes a fault detection sub-model corresponding to each sample training set and a data fusion module, and the data fusion module is configured to fuse and output outputs of each fault detection sub-model, so as to obtain an output result of the target fault detection model.

In some alternative embodiments, the first feature extraction module 802 specifically includes:

And the feature extraction sub-module is used for extracting dictionary features of the dictionary list to obtain sample feature data.

In some alternative embodiments, the first determining module 803 specifically includes:

the sampling sub-module is used for sampling the converted sample characteristic data repeatedly with the replacement to obtain a plurality of sample training sets;

and the sample random feature generation sub-module is used for randomly generating sample random features based on the converted sample feature data.

It is understood that the disk failure detection apparatus has advantages similar to those of the disk failure detection method, and will not be described herein.

Referring to fig. 9, an embodiment of the present application further provides a disk failure detection apparatus, including:

the second obtaining module 901 is configured to obtain target detection information of a target disk of the super fusion integrated machine;

a second feature extraction module 902, configured to perform feature extraction on the target detection information to obtain target feature data;

the prediction module 903 is configured to input the target feature data to a target fault detection model to obtain a detection result, where the target fault detection model is trained according to the training method of the target fault detection model of the first aspect or any embodiment corresponding to the first aspect.

In some alternative embodiments, the detection result includes a target accuracy and a detection result, and the disk failure detection apparatus further includes:

the third determining module is used for determining the target disk as a damaged disk if the detection result represents that the target disk is damaged and the target accuracy is greater than or equal to the preset accuracy; or,

if the detection result represents that the target disk is damaged and the target accuracy is smaller than the preset accuracy, determining the target disk as an undamaged disk.

In some optional embodiments, the second obtaining module 901 specifically includes:

the first determining sub-module is used for determining preset error reporting times based on the safety sensitivity;

the comparison sub-module is used for comparing the target error reporting times with preset error reporting times to obtain a comparison result;

and the second determining submodule is used for determining the candidate disk as the target disk if the comparison result represents that the target error reporting times are greater than or equal to the preset error reporting times so as to obtain target detection information of the target disk.

the sixth determining module is used for determining a detection mode corresponding to the target disk based on the storage type of the target disk, wherein the detection mode comprises in-band detection;

the quantity determining module is used for determining the quantity of bad block alarm information;

the bad disc alarm information generation module is used for generating bad disc alarm information if the number of bad block alarm information is greater than or equal to the preset number;

and the sending module is used for sending the bad disk alarm information to the target object so as to enable the target object to perform fault processing on the target disk.

The beneficial effects of the training device of the disk failure detection model are similar to those of the training method of the disk failure detection model, and are not repeated here.

The memory image generation device in this embodiment is presented in the form of functional modules, where the modules refer to application specific integrated circuits (Application Specific Integrated Circuit, ASIC for short), processors and memories that execute one or more software or firmware programs, and/or other devices that provide the functionality described above.

Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.

The embodiment of the invention also provides computer equipment, which is provided with the training device of the disk fault detection model shown in the figure 8 or the disk fault detection device shown in the figure 9.

Referring to fig. 10, fig. 10 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, as shown in fig. 10, the computer device includes: one or more processors 10, memory 20, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 10.

The processor 10 may be a central processor, a network processor, or a combination thereof. The processor 10 may further include a hardware chip, among others. The hardware chip may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform a method for implementing the embodiments described above.

The memory 20 may include a storage program area that may store an operating system, at least one application program required for functions, and a storage data area; the storage data area may store data created from the use of the computer device of the presentation of a sort of applet landing page, and the like. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Memory 20 may include volatile memory, such as random access memory; the memory may also include non-volatile memory, such as flash memory, hard disk, or solid state disk; the memory 20 may also comprise a combination of the above types of memories.

The computer device also includes a communication interface 30 for the computer device to communicate with other devices or communication networks.

The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium can be a magnetic disk, an optical disk, a read-only memory, a random access memory, a flash memory, a hard disk, a solid state disk or the like; further, the storage medium may also comprise a combination of memories of the kind described above. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A method for training a disk failure detection model, the method comprising:

2. The method according to claim 1, wherein the feature extracting the sample detection information to obtain sample feature data includes:

3. The method of claim 1, wherein the deriving a plurality of sample training sets and sample random features based on the sample feature data comprises:

4. The disk fault detection method is characterized by being applied to a super fusion all-in-one machine, and comprises the following steps:

inputting the target feature data into a target fault detection model to obtain a detection result, wherein the target fault detection model is trained according to the training method of the target fault detection model according to any one of claims 1 to 3.

5. The method of claim 4, wherein the detection results include a target accuracy and a detection result, the method further comprising:

or,

6. The method according to claim 4 or 5, wherein the obtaining the target detection information of the target disk of the super fusion all-in-one machine includes:

acquiring the safety sensitivity corresponding to the candidate magnetic disk;

determining preset error reporting times based on the safety sensitivity;

7. The method according to claim 4, wherein the method further comprises:

determining the disk capacity ratio of the target disk;

determining the number of the bad block alarm information;

8. A training device for a disk failure detection model, comprising:

9. A disk failure detection apparatus, comprising:

the prediction module is configured to input the target feature data into a target fault detection model, to obtain a detection result, where the target fault detection model is trained according to the training method of the target fault detection model according to any one of claims 1 to 3.

10. A computer device, comprising:

a memory and a processor, the memory and the processor being communicatively connected to each other, the memory having stored therein computer instructions, the processor executing the computer instructions to perform the method of training the disk failure detection model of any one of claims 1 to 3 or the method of detecting a disk failure of any one of claims 4 to 7.

11. A computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method of training the disk failure detection model according to any one of claims 1 to 3 or the method of detecting a disk failure according to any one of claims 4 to 7.