CN113434088A - Disk identification method and device - Google Patents

Disk identification method and device Download PDF

Info

Publication number
CN113434088A
CN113434088A CN202110718605.4A CN202110718605A CN113434088A CN 113434088 A CN113434088 A CN 113434088A CN 202110718605 A CN202110718605 A CN 202110718605A CN 113434088 A CN113434088 A CN 113434088A
Authority
CN
China
Prior art keywords
disk
target
data
training
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110718605.4A
Other languages
Chinese (zh)
Inventor
莫亚运
郭玉章
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Construction Bank Corp
Original Assignee
China Construction Bank Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Construction Bank Corp filed Critical China Construction Bank Corp
Priority to CN202110718605.4A priority Critical patent/CN113434088A/en
Publication of CN113434088A publication Critical patent/CN113434088A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/2205Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
    • G06F11/2221Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test input/output devices or peripheral units
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Abstract

The invention provides a disk identification method and a disk identification device, which are used for acquiring disk data of all currently running disks in a data center; processing the disk data based on a disk prediction model which is constructed in advance to obtain a target disk; for each target disk, processing disk parameters based on a disk identification model constructed in advance to obtain a disk score of each target disk; and when the score of the disk is determined to be larger than the preset limit value, determining that the target disk corresponding to the score of the disk is a hidden bad disk. In the scheme, the disk data is processed through a disk prediction model to determine a target disk; and performing secondary identification processing on the target disk by using the disk identification model, determining the score of the disk to determine whether the target disk is a hidden bad disk, and determining that the target disk corresponding to the score of the disk is the hidden bad disk when the score of the disk is greater than a preset limit value. The problems of calculation inclination and the like caused by recessive bad disks can be avoided, and the robustness and the usability of the disk can be improved.

Description

Disk identification method and device
Technical Field
The invention relates to the technical field of disk identification, in particular to a disk identification method and device.
Background
With the advent of the big data age, reliability of data is one of the most interesting issues for enterprises and data centers. The disk is an important storage device in the server, and when hidden dangers exist in computing nodes in the disk, computing inclination occurs, so that cluster inclination is caused, and then the disk performance of the data center is reduced or even unavailable.
Currently, hard disk Self-check (SMART) log information of disk errors in a server is often collected. The operation and maintenance personnel analyze the collected SMART log information based on experience so as to determine a possibly failed disk. Through the method, the read-write error report of the disk is difficult to find, namely the hidden danger of the computing node in the disk is difficult to find, so that when the hidden danger of the disk occurs, cluster inclination can be caused, and the disk performance of the data center is reduced or even unavailable.
Disclosure of Invention
In view of this, embodiments of the present invention provide a disk identification method and apparatus to solve the problems in the prior art.
In order to achieve the above purpose, the embodiments of the present invention provide the following technical solutions:
one aspect of the embodiments of the present invention shows a disk identification method, where the method includes:
acquiring disk data of all currently running disks in a data center;
processing the disk data based on a disk prediction model which is constructed in advance to obtain a target disk, wherein the target disk is a disk which possibly has hidden dangers, the disk prediction model is obtained by training disk input/output (I/O) data of a database in a first preset historical time period, and the number of the target disks is at least one;
for each target disk, processing disk parameters based on a disk identification model which is constructed in advance to obtain a disk score of each target disk, wherein the disk parameters are obtained by processing the hard disk self-inspection SMART log information of the target disk, and the disk identification model is obtained by training the SMART log information of the disk in a second preset historical time period;
and when the score of the disk is determined to be larger than a preset limit value, determining that the target disk corresponding to the score of the disk is a recessive bad disk.
Optionally, the process of obtaining the disk prediction model by training the disk I/O data of the database in the preset historical time period includes:
acquiring disk I/O data in a database in a first preset historical time period, and dividing the disk I/O data into a training set and a test set;
classifying the training sets, and extracting characteristic data in each type of training set;
training based on the characteristic data in each type of training set to obtain an initial disk prediction model after training;
and performing verification test on the initial disk prediction model by using the test set until an obtained test result is the same as an expected result, and determining that the initial disk prediction model obtained by current training is a disk prediction model, wherein the test result is obtained by predicting the test set in the initial disk prediction model.
Optionally, the process of training by using SMART log information of the disk in the second preset historical time period to obtain the disk identification model includes:
acquiring SMART log information of a magnetic disk in a second preset historical time period, and taking the SMART log information as a training set;
and training an expert model based on the training set to obtain a trained disk identification model.
Optionally, after the disk data is processed based on a disk prediction model that is constructed in advance to obtain a target disk, the method further includes:
processing SMART log information of each target disk to obtain disk parameters;
and calculating the disk parameters corresponding to each target disk to obtain the disk value corresponding to each target disk.
Optionally, when it is determined that the disk score is greater than the preset limit, determining that the target disk corresponding to the disk score is a hidden bad disk includes:
judging whether the disk score of each target disk is larger than a preset limit value or not;
and if the score of the magnetic disk is larger than the preset limit value, taking the target magnetic disk with the score larger than the preset limit value as a hidden bad disk.
Another aspect of the embodiments of the present invention shows a disk identification apparatus, where the apparatus includes:
the data access unit is used for acquiring the disk data of all the currently running disks in the data center;
the disk prediction model is used for processing the disk data based on a disk prediction model which is constructed in advance to obtain a target disk, the target disk is a disk which possibly has hidden danger, the disk prediction model is constructed by utilizing a first construction unit, and the number of the target disks is at least one;
the disk identification model is used for processing disk parameters on the basis of a pre-constructed disk identification model aiming at each target disk to obtain a disk score of each target disk, wherein the disk parameters are obtained by processing SMART log information of the target disk, and the disk identification model is constructed by utilizing the second construction unit;
and the determining unit is used for determining that the target disk corresponding to the disk score is a recessive bad disk when the disk score is determined to be larger than a preset limit value.
Optionally, the first building unit includes:
the system comprises a first data access module, a second data access module and a data processing module, wherein the first data access module is used for acquiring disk I/O data in a database in a first preset historical time period and dividing the disk I/O data into a training set and a test set;
the processing module is used for classifying the training sets and extracting the characteristic data in each type of training set;
the classification model training module is used for training based on the characteristic data in each class of training set to obtain an initial disk prediction model after training;
and the optimization module is used for performing verification test on the initial disk prediction model by using the test set until an obtained test result is the same as an expected result, and determining the initial disk prediction model obtained by current training as the disk prediction model, wherein the test result is obtained by predicting the test set in the initial disk prediction model.
Optionally, the second building unit includes:
the second data access module is used for acquiring SMART log information of the magnetic disk in a second preset historical time period and taking the SMART log information as a training set;
and the expert model training module is used for training an expert model based on the training set to obtain a trained disk identification model.
Optionally, the determining unit is specifically configured to:
judging whether the disk score of each target disk is larger than a preset limit value or not; and if the score of the magnetic disk is larger than the preset limit value, taking the target magnetic disk with the score larger than the preset limit value as a hidden bad disk.
Optionally, the method further includes:
the computing unit is used for processing the SMART log information of each target disk to obtain disk parameters; and calculating the disk parameters corresponding to each target disk to obtain the disk value corresponding to each target disk.
Based on the disk identification method and device provided by the embodiment of the invention, the method comprises the following steps: acquiring disk data of all currently running disks in a data center; processing the disk data based on a disk prediction model which is constructed in advance to obtain a target disk, wherein the target disk is a disk which is possible to have hidden danger, the disk prediction model is obtained by training disk I/O data of a database in a first preset historical time period, and the number of the target disks is at least one; for each target disk, processing disk parameters based on a disk identification model which is constructed in advance to obtain a disk score of each target disk, wherein the disk parameters are obtained by processing SMART log information of the target disk, and the disk identification model is obtained by training the SMART log information of the disk in a second preset historical time period; and when the score of the disk is determined to be larger than the preset limit value, determining that the target disk corresponding to the score of the disk is a hidden bad disk. In the embodiment of the invention, the acquired disk data is processed through a disk prediction model, and a disk possibly having hidden danger, namely a target disk, is determined; and performing secondary identification processing on the target disk by using the disk identification model, determining the score of the disk to determine whether the target disk is a hidden bad disk, and determining that the target disk corresponding to the score of the disk is the hidden bad disk when the score of the disk is greater than a preset limit value. The problems of calculation inclination and the like caused by recessive bad disks can be avoided, and the robustness and the usability of the disk can be improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flowchart of a disk identification method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a disk prediction model building process according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an analysis result after clustering of disk I/O data according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a disk identification apparatus according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of another disk identification apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In this application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
In the embodiment of the invention, the acquired disk data is processed through a disk prediction model, and a disk possibly having hidden danger, namely a target disk, is determined; and performing secondary identification processing on the target disk by using the disk identification model, determining the score of the disk to determine whether the target disk is a hidden bad disk, and determining that the target disk corresponding to the score of the disk is the hidden bad disk when the score of the disk is greater than a preset limit value. The problems of calculation inclination and the like caused by recessive bad disks can be avoided, and the robustness and the usability of the disk can be improved.
Referring to fig. 1, a schematic flow chart of a disk identification method according to an embodiment of the present invention is shown, where the method includes:
s101: and acquiring the disk data of all the currently operated disks in the data center.
In S101, the disk data of all the disks currently running in the data center is stored in the database.
Specific contents of S101: and traversing the disk data of all servers currently operated by the data center in the database to acquire the disk data corresponding to the disk.
The disk data includes input/output I/O payload data for a plurality of disks.
S102: and processing the disk data based on a disk prediction model which is constructed in advance to obtain the target disk.
In S102, the target disk is a disk that may have a hidden danger, the disk prediction model is obtained by training disk I/O data of the database in a first preset historical time period, and the number of the target disks is at least one.
Specific contents of S102: and calling a pre-constructed disk prediction model to process the input/output I/O load data of each disk so as to determine that a disk possibly with hidden danger exists in all currently running disks, namely an output target disk.
In the embodiment of the present invention, a process of obtaining a disk prediction model by training disk I/O data in a database in a first preset historical time period, as shown in fig. 2, includes the following steps:
s201: and acquiring disk I/O data in a database in a first preset historical time period, and dividing the disk I/O data into a training set and a test set.
In the specific implementation process of step S201, traversing data of all servers stored in the database, and acquiring disk I/O data in a first preset historical time period; and dividing the training set into a training set and a test set so as to access the training set of the classification model, thereby completing the preparation of the training set.
Wherein the disk I/O data relates to N servers, including a normal server and a failed server; each fault server corresponds to one hard disk fault, and N is a positive integer greater than or equal to 1.
In the embodiment of the invention, in order to ensure the scientificity and the effectiveness of the prediction effect evaluation, the servers corresponding to the training set and the test set are not overlapped.
It should be noted that the database may be a relational database MySQL, or may be another type of database.
In the embodiment of the invention, the disk is arranged in the server and is used by the server. When the number of the servers is large, the database can store the information of the servers according to the machine type and department classification tables. Such as: a server basic configuration database, a server log database and a SvrMon database.
Wherein, the server basic configuration database, i.e. CMDB, is used to store the system configuration information data of the server, such as: and the server fixes data such as a serial number, an IP address, a model, service and the like.
Such as: the server configuration information data table with the table name s _ server is shown in table (1).
Table (1):
name of field Type (B) Length (Byte) Whether or not it can be empty Description of the invention
svr_asset_id varcahr 64 Whether or not Server fixed number
dev_class_name varchar 64 Whether or not SCM device name
ip varchar 64 Is that Server IP address
dept_id mt 4 Is that Department ID
use_time datetime
8 Is that Time to put on shelf
The field svr _ asset _ id is a primary key of a server table and is used for uniquely identifying one server, and the corresponding record refers to a server fixed resource number which has a storage character type of varchar, can store 64 bytes and cannot be empty in content; the field dev _ class _ name represents the model of the server, and the corresponding record refers to the SCM equipment name with the storage character type of varchar, the length of the storage character of 64 bytes and empty content; the record corresponding to the field IP refers to a server IP with the storage character type of varchar, the length of the storage being 64 bytes and the content being empty; the record corresponding to the field dept _ ID is a department ID with the storage character type of int, the length of 4 bytes and empty content; the record corresponding to the field use _ time indicates that the storage character type is datetime, the length of the storable record can be 8 bytes, and the content of the storable record can be empty shelf time.
Further, the hardware configuration corresponding to a specific server model is basically the same, so the model reflects the same server hardware characteristics. Specifically, the models can be classified into three major categories, namely a management type, a calculation type and a storage type.
A server log database, namely a pmbasic database, for storing Input/Output (I/O) load information of a disk in each server;
it can be understood that, because the Input/Output (I/O) load information of the disk in the server is too large, the storage is performed by using a sub-table storage manner. Specifically, a Software Configuration Management (SCM) type ID (such as s _ dev _ type. type _ ID) of the server and a part ID (such as s _ server. dept _ ID) of the server are used as the basis of the sublist. That is, stored in each table is the basic performance data for the SCM type of the department server in which it is located.
Such as: the table name is d _ custom _ attached _ X _ Y, where X is an SCM type ID of a server, and Y is a department ID of the server, and as shown in table (2), each record corresponds to a certain characteristic value of a certain server at a certain time.
Table (2):
Figure BDA0003136065970000081
the record corresponding to the field IP is a server IP which stores the character type as char, can store the server IP with the length of 16 bytes and the content of which can not be empty, so as to determine whether the IP of the server corresponding to the X is consistent with the server IP of another performance data table cmdb.s _ server table; the record corresponding to the field date _ time is the sampling date and time for collecting the sampling value of the characteristic of the server corresponding to the X, the type of the storage character is the date time, the storage length is 4 bytes, and the sampling date and time cannot be null; the record corresponding to the field atterid is a feature ID which stores a character type of int, can store 4 bytes and cannot have empty content; the record corresponding to the field attr _ X indicates that the type of the storage character is int, the length of the storage character can be 4 bytes, and the storage character can not be a null feature ID. Wherein the feature ID is a sample value of the feature of X.
The SvrMon database is used for storing data such as fault lists of the disks in the servers. Such as: the failure data table with the table name a is used for recording the failure list data of the server B, and is shown in the table (3).
Table (3):
name of field Type (B) Length (Byte) Whether or not it can be empty Description of the invention
alarm_sn int 4 Is that Fault list number
create_time Datetime
8 Whether or not Time to build a bill for a fault
svr_asset_id varcahr 64 Whether or not Server fixed number
dev_class_name varchar 64 Whether or not SCM device name
ip varchar 64 Is that Server IP address
use_time datetime
8 Is that Time to put on shelf
The record corresponding to the field alarm _ sn refers to a fault single number corresponding to the server B, wherein the type of the storage character is int, the length of the storage character can be 4 bytes, and the content of the fault single number cannot be empty; the record corresponding to the field create _ time refers to the server list creation time in which the storage character type is Datetime, the storage length is 8 bytes, and the content cannot be empty.
Note that the records corresponding to the field svr _ asset _ id, the field dev _ class _ name, the field IP, and the field use _ time are the same as those shown in table 1 above, and can be referred to each other.
S202: and classifying the training sets, and extracting the characteristic data in each type of training set.
In S202: the training set includes multiple dimensions of disk I/O data.
In the process of implementing step S202 specifically, a K-Means algorithm is used to cluster the combination of the statistical features of the I/O data of the multiple-dimensional disks, classify the combination, and extract feature data in each class of classified data.
It should be noted that the training set may be divided into seven classes corresponding to seven kinds of feature data; the training set can also be classified into one type, corresponding to various feature data.
For example: the training set may be classified into one class corresponding to 3 kinds of feature data, where the 3 kinds of feature data include a high load high, a medium load medium, and a low load low, and as shown in fig. 3, the combination of the statistical features of the disk I/O data of multiple dimensions includes an average iteration time per second Avg _ time of the disk I/O, a highest iteration time Top _ Avg _ time of the disk I/O, and an input time of the highest data of the disk I/O.
The I/O statistical values of the disk corresponding to the low load low are lower, and the failure rate of the corresponding hard disk is also lower; the I/O statistical value of the disk corresponding to the cluster high is higher, and the failure rate of the corresponding hard disk is also higher and is more than 4 times of the cluster low. That is, the failure rate of the hard disk tends to increase as the statistical values of the I/O of the disk increase.
The failure rate refers to the average number of failures of the servers, i.e., the number of failures/the total number of servers.
S203: and training based on the characteristic data in each type of training set to obtain an initial disk prediction model after training.
In the embodiment of the invention, two classes of classifier vector machines C-SVM of a library LIBSVM supporting the vector machine are used for distinguishing the fault class from the normal class.
The LIBSVM is a version of MATLAB, and mainly includes two functions, namely, a function svmtrain for modeling and a function svmpredict for prediction.
Specific contents of S203: different dimensional data sets are prepared according to different time windows. Converting the data sets with different dimensions into a data format required by a library LIBSVM of a support vector machine, and selecting an optimal parameter Cost and a gamma function gamma through cross validation; then, the selected parameter Cost and gamma function gamma can be used to call a function svmtrain to train a training set so as to train a support vector machine classifier to obtain an SVM model, namely an initial disk prediction model.
It should be noted that LIBSVM is a software tool kit that is simple to operate and can rapidly perform efficient SVM pattern recognition and regression.
The SVM model has a series of parameters, including parameters substituted during training, such as an SVM type, a kernel function degree, a gamma function gamma and the like, and in addition, the SVM model also comprises a plurality of parameters related to a decision function, namely a coefficient sv _ coef of a support vector, a constant term rho of the decision function and the like. If the parameters substituted during training are RBF kernel functions, substituting the coefficients sv _ coef of the support vector, gamma function gamma and constant term rho of the decision function into formula (1) to obtain the decision function platform.
Formula (1):
Figure BDA0003136065970000101
wherein, x is the sample of the label to be predicted, n is the number of coefficients sv _ coef of the support vector, and i is the coefficient sv _ coef of the support vector of the second number.
The constant term rho of the decision function is empirically set in advance, and the embodiment of the present invention is not limited thereto.
S204: and performing verification test on the initial disk prediction model by using the test set until the obtained test result is the same as the expected result, and determining the initial disk prediction model obtained by current training as the disk prediction model.
In S204, the test result is obtained by predicting the test set in the initial disk prediction model.
Specific contents of S204: and calling the SVM model obtained by the function svmpredict to test the test set so as to judge whether the test set is a fault class. Specifically, an SVM model obtained by calling the function svmpredict checks samples of each server in a test set according to a time sequence, and once a certain sample is found to be judged as a fault class, the server is predicted to have a hard disk fault. Otherwise, the server is predicted not to have a hard disk failure in a future period of time. And recording and counting the test result corresponding to each server in the test set, and verifying the recorded and counted test result, namely verifying whether the test result is the same as the expected result or not so as to monitor the prediction performance of the SVM model. According to the prediction performance of the SVM model, namely when the test result is determined to be different from the expected result, calling a new test set to retrain the SVM model until the obtained test result is the same as the expected result, showing that the initial disk prediction model is qualified at the moment, and determining that the initial disk prediction model obtained by current training is the disk prediction model.
Optionally, when determining whether the test set is a fault class, using the function classification accuracy given after the function svmpredict of LIBSVM does not give prediction accuracy, so that other prediction indexes are needed to determine the prediction accuracy.
Such as: and determining the prediction precision of the SVM model by using the prediction identification of function prediction correctness precision, recall rate call, prediction comprehensive performance index F-measure and the like. Specifically, the prediction correctness, namely the false alarm rate, the recall rate, namely the missing report rate and the prediction comprehensive performance index F-measure after the SVM model tests the test set are obtained. And inputting the prediction correctness precision, namely the false alarm rate, the recall rate, namely the missing report rate and the prediction comprehensive performance index F-measure into a formula (2) to obtain a function F-measure.
Formula (2):
Figure BDA0003136065970000111
wherein the function F-measure is used to evaluate the overall performance.
Based on the formula (2), when the prediction correctness is high, namely the false alarm rate and the recall rate recall, namely the missing report rate, the function F-measure is high; when the prediction accuracy precision, i.e. the false alarm rate, and the recall rate, i.e. the missing report rate, are lower, the function F-measure is closer to the lower one, which is also the required characteristic of the actual system. Therefore, the function F-measure can be used for scientifically and effectively evaluating the accuracy and the quality of the fault prediction of the SVM model.
Alternatively, in addition to determining the prediction accuracy in the above manner, the prediction accuracy may be determined by using a detection rate and a false alarm rate, wherein the detection rate DetectionRate, i.e., a Recall rate, is synonymous with Recall mentioned above. False Alarm Rate (FAR), which refers to the proportion of good discs that are falsely reported as bad discs.
In order to better understand the disk prediction model shown in the above embodiments of the present invention, the following description is made.
And traversing data of all servers stored in the database, acquiring disk I/O data of one week from 6/1/2019 to 6/7/2019, dividing the disk I/O data into a training set and a test set, and accessing the training set of the classification model so as to finish training set preparation.
Wherein the disk I/O data relates to 12376 servers, including 77 failed servers, corresponding to 77 hard disk failures.
Clustering the combination of the statistical characteristics of the disk I/O data of 12 dimensions in the training set by using a K-Means algorithm, wherein the combination specifically comprises avg902all, sum902, avg902, std902, avg903all, sum903, avg903, std903, avg999all, sum999, avg999 and std 999; classifying the data, and extracting characteristic data in each class of classified data, namely data required by modeling of the disk prediction model.
Different dimensional data sets are prepared according to different time windows. Converting the data sets with different dimensions into a data format required by a library LIBSVM of a support vector machine, and selecting an optimal parameter Cost and a gamma function gamma through cross validation; then, the selected parameter Cost and gamma function gamma can be used for calling a training svmtrain function to train a training set so as to train a support vector machine classifier to obtain an SVM model, and finally, the SVM model obtained by calling the function svmpredict is used for testing and predicting the testing set so as to determine a final SVM model, namely a disk prediction model.
S103: and aiming at each target disk, processing the disk parameters based on a disk identification model constructed in advance to obtain the disk value of each target disk.
In S103, the disk parameters are obtained by processing Analysis and Reporting Technology (SMART) log information of the target disk, and the disk identification model is obtained by training with SMART log information of the disk in a second preset history time period.
In the embodiment of the invention, besides the need of predicting whether the disk fails, the recessive bad disk also needs to be accurately positioned; in consideration of predicting system performance, failure prediction is performed first, and only the disk about to fail will predict whether the disk is a hidden bad disk, so that secondary prediction needs to be performed on the disk which may have a failure, that is, step S103 is performed.
Specific contents of S103: for each target disk, multiplying each data item in the SMART log information by the ratio of a preset fault server to a preset normal server to obtain disk parameters; and inputting the disk parameters into the disk identification model so that the disk identification model outputs the disk score corresponding to each target disk after processing the disk parameters corresponding to each target disk.
It should be noted that the disk parameters include the number of bytes written by the disk, the size of the read and write commands, Error Checking and Correction (ECC) delayed correction errors, non-media error count, and ECC delayed correction read and write errors.
In the embodiment of the invention, the process of training by using the SMART log information of the disk in the second preset historical time period to obtain the disk identification model comprises the following steps:
s11: and acquiring SMART log information of the magnetic disk in a second preset historical time period, and taking the SMART log information as a training set.
It should be noted that SMART is used as a monitoring and self-checking mechanism for the internal state of the disk, and can well detect and describe each state feature of the disk, and convert the current disk state feature into a specific set of values, that is, the state feature of the disk is presented in the form of a vector.
In the embodiment of the present invention, the SMART log information includes a plurality of data items, that is, the data items refer to values into which the disk state feature of the impending failure or the normal disk state feature is converted.
Each data item can be represented in two numerical forms, namely a normal value and an original value raw value. The original value raw value refers to the original value of the entry record, and the normal value is a set of values ranging from 0 to 255 obtained by subjecting the original value to standard regularization calculation.
Analysis was performed for each data item, and it was found that all data items of the SMART log information assumed normal values except for the data item #5 and the data item #197 in the SMART log information. The raw value of data item #5 and data item #197 senses the change of the disk state more sensitively, and thus data item #5 and data item #197 adopt the raw value.
The original values or normal values corresponding to the data items of the disk to be failed show obvious variation trend in the process of state deterioration, while the numerical value variation of the numerical items in the normal disk is not obvious. Therefore, when selecting the training data, not only the values of the data items are considered, but also the change amplitude (the ratio of the change amount of the values in one day to the values before one day) of the values of the data items in the last day is used.
S11 includes: and taking the state data of the disk in the second preset historical time period as the state feature of the disk about to fail, taking the equivalent amount of healthy disk data as the state feature of the normal disk, and respectively converting the state feature of the disk about to fail and the state feature of the normal disk into data items so as to take the data items as a training set.
It should be noted that the second preset historical time period refers to certain data before the disk failure, and is preset by a technician, for example, 72 hours before the disk failure may be set.
For example: there are 19 data items in the SMART log information, including 10 normal values (#1, #3, #5, #7, #9, #187, #189, #194, #195, #197), 2 original values raw value (#5, #197) and 7 change amplitudes (#1, #5, #187, #194, #195, # 197. the status data 72 hours before the disk failure is taken as the disk status signature about to fail, and an equal amount of healthy disk data is taken as the normal disk status signature, wherein the data item in the training set, whose label is "1", is the data acquired 72 hours before the disk failure, the data item in which label is "0" is the disk in normal operation, the data item in which label is "1" is the same in number as the data item in which label is "0", that is, the data point in which label is "1" indicates the disk status signature about to fail, and data points for which label is "0" indicate a status characteristic that the disk continues to operate stably.
S12: and training the expert model based on the training set to obtain a trained disk identification model.
S12, multiplying each data item in the training set by the preset ratio of the fault server to the normal server to obtain the training parameter corresponding to each disk; and training the expert model by the training parameters corresponding to each disk in sequence to obtain a trained disk identification model.
The expert model is a machine learning model, and an optimization objective function is calculated from specified training data.
S104: and judging whether the score of the magnetic disk is larger than a preset threshold value or not, if so, executing the step S105, and if not, executing the step S106.
Specific contents of S104: and judging whether the disk score of each target disk is greater than a preset threshold, if so, executing the step S105, and if not, indicating that no hidden bad disk exists.
It should be noted that the preset threshold is a value set by a technician according to multiple experiences, and is used for indicating whether the disk is a hidden bad disk.
A hidden bad disk is a disk that is hidden and has been destroyed and is not easily discovered.
S105: and determining that the target disk corresponding to the disk score is a hidden bad disk.
In the process of specifically implementing step S105, the target disk corresponding to the disk score greater than the preset threshold is determined as a hidden bad disk.
S106: and determining that no recessive bad disk exists in the target disk.
Optionally, because the SMART log information obtained from the disk is a floating point type str type SMART log information, in order to facilitate subsequent standardized parameter output, the type of the SMART log information needs to be converted into a Json file of a Json type and stored in the bastion to establish a standard parameter list, that is, a host list, which is a key value including all hard disk parameter types, and the type of the key value is a dit type.
Optionally, the server number corresponding to the hidden bad disk is obtained, and the cluster and host numbers in the database are matched to determine whether the SMART log signal corresponding to the server number exists. Traversing each target key value target _ dit object in a json file under a host list directory, searching whether a key value matched with the host number exists or not, and if so, sequentially inputting all matched key values into a standard _ dit object; if not, the default value is recorded into the standard _ fact object. After the data is recorded, a final recessive bad disk set is obtained, and the data is stored in a memory in a numpy type data structure so as to be convenient for subsequent printing operation.
In the embodiment of the invention, the acquired disk data is processed through a disk prediction model, and a disk possibly having hidden danger, namely a target disk, is determined; and performing secondary identification processing on the target disk by using the disk identification model, determining the score of the disk to determine whether the target disk is a hidden bad disk, and determining that the target disk corresponding to the score of the disk is the hidden bad disk when the score of the disk is greater than a preset limit value. The problems of calculation inclination and the like caused by recessive bad disks can be avoided, and the robustness and the usability of the disk can be improved.
Based on the disk positioning method shown in the above embodiment of the present invention, after step S102 is executed to process the disk data based on the disk prediction model that is constructed in advance, and obtain the target disk, the method further includes the following steps:
s21: and processing the SMART log information of the target disk aiming at each target disk to obtain disk parameters.
In the process of implementing step S21 specifically, for each target disk, each data item in the SMART log information is multiplied by a ratio between a failure server and a normal server set in advance to obtain a disk parameter.
It should be noted that the disk parameters include the number of bytes written, the size of read and write commands, ECC deferred correction errors, non-media error count, and ECC deferred correction read and write errors.
S22: and calculating the disk parameters corresponding to each target disk to obtain the disk value corresponding to each target disk.
In the process of implementing step S22 specifically, for each target disk, the Number of bytes a processed by Gigabytes processed (write), the size d of the Number of times read and write commands write, the error c delayed by error ECC delayed correction, the Non-medium error count b, and the read-write error x delayed by error correct ECC delayed correction corresponding to the target disk are substituted into formula (3) to obtain the disk score L of the target disk.
Formula (3):
Figure BDA0003136065970000151
wherein a is the processed written byte number; b is the non-media error count, cECC delays correcting errors, and d is the size of read and write commands.
In the embodiment of the invention, the acquired disk data is processed through a disk prediction model, and a disk possibly having hidden danger, namely a target disk, is determined; and determining the disk parameters for calculation, calculating the disk parameters corresponding to each target disk to obtain the disk score corresponding to each target disk, determining whether the target disk is a hidden bad disk, and determining that the target disk corresponding to the disk score is the hidden bad disk when the disk score is greater than a preset limit value. The problems of calculation inclination and the like caused by recessive bad disks can be avoided, and the robustness and the usability of the disk can be improved.
Corresponding to the method shown in the above embodiment of the present invention, an embodiment of the present invention also discloses a disk identification apparatus correspondingly, as shown in fig. 4, which is a schematic diagram of a result of the disk identification apparatus shown in the embodiment of the present invention, and the apparatus includes:
the data access unit 401 is configured to acquire disk data of all disks currently running in the data center.
The disk prediction model 402 is configured to process disk data based on a disk prediction model that is constructed in advance to obtain a target disk, where the target disk is a disk that may have a hidden danger, and the disk prediction model is constructed by using the first construction unit 403, and the number of the target disks is at least one.
The disk identification model 404 is configured to, for each target disk, process a disk parameter based on a disk identification model that is constructed in advance to obtain a disk score of each target disk, where the disk parameter is obtained by processing SMART log information of the target disk, and the disk identification model is constructed by using the second construction unit 405.
The determining unit 406 is configured to determine that the target disk corresponding to the disk score is a hidden bad disk when the disk score is determined to be greater than the preset limit.
Optionally, based on the above-described magnetic disk identification apparatus, the first constructing unit 403 includes:
and the data access module is used for acquiring the disk I/O data in the database in the first preset historical time period and dividing the disk I/O data into a training set and a test set.
And the processing module is used for classifying the training sets and extracting the characteristic data in each type of training set.
And the classification model training module is used for training based on the characteristic data in each class of training set to obtain an initial disk prediction model after training.
And the optimization module is used for carrying out verification test on the initial disk prediction model by using the test set until the obtained test result is the same as the expected result, and determining the initial disk prediction model obtained by current training as the disk prediction model, wherein the test result is obtained by predicting the test set in the initial disk prediction model.
Optionally, based on the above-described magnetic disk identification apparatus, the second constructing unit 405 includes:
and the second data access module is used for acquiring SMART log information of the magnetic disk in a second preset historical time period and taking the SMART log information as a training set.
And the expert model training module is used for training the expert model based on the training set to obtain a trained disk identification model.
Optionally, based on the above-described disk identification apparatus, the determining unit 406 is specifically configured to: judging whether the disk score of each target disk is larger than a preset limit value or not; and if the score of the magnetic disk is larger than the preset limit value, taking the target magnetic disk with the score larger than the preset limit value as a hidden bad disk.
It should be noted that, the specific principle and the execution process of each unit in the disk identification apparatus disclosed in the above embodiment of the present invention are the same as those of the disk identification shown in the above embodiment of the present invention, and reference may be made to corresponding parts in the disk identification disclosed in the above embodiment of the present invention, and details are not described here again.
In the embodiment of the invention, the acquired disk data is processed through a disk prediction model, and a disk possibly having hidden danger, namely a target disk, is determined; and performing secondary identification processing on the target disk by using the disk identification model, determining the score of the disk to determine whether the target disk is a hidden bad disk, and determining that the target disk corresponding to the score of the disk is the hidden bad disk when the score of the disk is greater than a preset limit value. The problems of calculation inclination and the like caused by recessive bad disks can be avoided, and the robustness and the usability of the disk can be improved.
Based on the disk identification apparatus shown above, referring to fig. 5 in conjunction with fig. 4, the disk identification apparatus further includes:
a calculating unit 407, configured to, after the disk preset model 403 obtains the target disks, process SMART log information of the target disks for each target disk to obtain a disk parameter; and calculating the disk parameters corresponding to each target disk to obtain the disk value corresponding to each target disk.
In the embodiment of the invention, the acquired disk data is processed through a disk prediction model, and a disk possibly having hidden danger, namely a target disk, is determined; and determining the disk parameters for calculation, calculating the disk parameters corresponding to each target disk to obtain the disk score corresponding to each target disk, determining whether the target disk is a hidden bad disk, and determining that the target disk corresponding to the disk score is the hidden bad disk when the disk score is greater than a preset limit value. The problems of calculation inclination and the like caused by recessive bad disks can be avoided, and the robustness and the usability of the disk can be improved.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, the system or system embodiments are substantially similar to the method embodiments and therefore are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described system and system embodiments are only illustrative, wherein the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A disk identification method, the method comprising:
acquiring disk data of all currently running disks in a data center;
processing the disk data based on a disk prediction model which is constructed in advance to obtain a target disk, wherein the target disk is a disk which possibly has hidden dangers, the disk prediction model is obtained by training disk input/output (I/O) data of a database in a first preset historical time period, and the number of the target disks is at least one;
for each target disk, processing disk parameters based on a disk identification model which is constructed in advance to obtain a disk score of each target disk, wherein the disk parameters are obtained by processing the hard disk self-inspection SMART log information of the target disk, and the disk identification model is obtained by training the SMART log information of the disk in a second preset historical time period;
and when the score of the disk is determined to be larger than a preset limit value, determining that the target disk corresponding to the score of the disk is a recessive bad disk.
2. The method according to claim 1, wherein the training with the disk I/O data of the database in the preset historical time period to obtain the disk prediction model comprises:
acquiring disk I/O data in a database in a first preset historical time period, and dividing the disk I/O data into a training set and a test set;
classifying the training sets, and extracting characteristic data in each type of training set;
training based on the characteristic data in each type of training set to obtain an initial disk prediction model after training;
and performing verification test on the initial disk prediction model by using the test set until an obtained test result is the same as an expected result, and determining that the initial disk prediction model obtained by current training is a disk prediction model, wherein the test result is obtained by predicting the test set in the initial disk prediction model.
3. The method according to claim 1, wherein the training with SMART log information of the disk in the second preset historical time period to obtain the disk identification model comprises:
acquiring SMART log information of a magnetic disk in a second preset historical time period, and taking the SMART log information as a training set;
and training an expert model based on the training set to obtain a trained disk identification model.
4. The method of claim 1, wherein after the disk data is processed based on a disk prediction model that is constructed in advance to obtain a target disk, the method further comprises:
processing SMART log information of each target disk to obtain disk parameters;
and calculating the disk parameters corresponding to each target disk to obtain the disk value corresponding to each target disk.
5. The method according to claim 1, wherein when it is determined that the disk score is greater than a preset limit, determining that the target disk corresponding to the disk score is a hidden bad disk includes:
judging whether the disk score of each target disk is larger than a preset limit value or not;
and if the score of the magnetic disk is larger than the preset limit value, taking the target magnetic disk with the score larger than the preset limit value as a hidden bad disk.
6. A disk identification device, the device comprising:
the data access unit is used for acquiring the disk data of all the currently running disks in the data center;
the disk prediction model is used for processing the disk data based on a disk prediction model which is constructed in advance to obtain a target disk, the target disk is a disk which possibly has hidden danger, the disk prediction model is constructed by utilizing a first construction unit, and the number of the target disks is at least one;
the disk identification model is used for processing disk parameters on the basis of a pre-constructed disk identification model aiming at each target disk to obtain a disk score of each target disk, wherein the disk parameters are obtained by processing SMART log information of the target disk, and the disk identification model is constructed by utilizing the second construction unit;
and the determining unit is used for determining that the target disk corresponding to the disk score is a recessive bad disk when the disk score is determined to be larger than a preset limit value.
7. The apparatus of claim 6, wherein the first building unit comprises:
the system comprises a first data access module, a second data access module and a data processing module, wherein the first data access module is used for acquiring disk I/O data in a database in a first preset historical time period and dividing the disk I/O data into a training set and a test set;
the processing module is used for classifying the training sets and extracting the characteristic data in each type of training set;
the classification model training module is used for training based on the characteristic data in each class of training set to obtain an initial disk prediction model after training;
and the optimization module is used for performing verification test on the initial disk prediction model by using the test set until an obtained test result is the same as an expected result, and determining the initial disk prediction model obtained by current training as the disk prediction model, wherein the test result is obtained by predicting the test set in the initial disk prediction model.
8. The apparatus of claim 6, wherein the second building unit comprises:
the second data access module is used for acquiring SMART log information of the magnetic disk in a second preset historical time period and taking the SMART log information as a training set;
and the expert model training module is used for training an expert model based on the training set to obtain a trained disk identification model.
9. The apparatus according to claim 6, wherein the determining unit is specifically configured to:
judging whether the disk score of each target disk is larger than a preset limit value or not; and if the score of the magnetic disk is larger than the preset limit value, taking the target magnetic disk with the score larger than the preset limit value as a hidden bad disk.
10. The apparatus of claim 6, further comprising:
the computing unit is used for processing the SMART log information of each target disk to obtain disk parameters; and calculating the disk parameters corresponding to each target disk to obtain the disk value corresponding to each target disk.
CN202110718605.4A 2021-06-28 2021-06-28 Disk identification method and device Pending CN113434088A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110718605.4A CN113434088A (en) 2021-06-28 2021-06-28 Disk identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110718605.4A CN113434088A (en) 2021-06-28 2021-06-28 Disk identification method and device

Publications (1)

Publication Number Publication Date
CN113434088A true CN113434088A (en) 2021-09-24

Family

ID=77755230

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110718605.4A Pending CN113434088A (en) 2021-06-28 2021-06-28 Disk identification method and device

Country Status (1)

Country Link
CN (1) CN113434088A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108446734A (en) * 2018-03-20 2018-08-24 中科边缘智慧信息科技(苏州)有限公司 Disk failure automatic prediction method based on artificial intelligence
CN108986869A (en) * 2018-07-26 2018-12-11 南京群顶科技有限公司 A kind of disk failure detection method predicted using multi-model
US10467075B1 (en) * 2015-11-19 2019-11-05 American Megatrends International, Llc Systems, devices and methods for predicting disk failure and minimizing data loss
CN111767162A (en) * 2020-05-20 2020-10-13 北京大学 Fault prediction method for hard disks of different models and electronic device
CN112596964A (en) * 2020-12-15 2021-04-02 中国建设银行股份有限公司 Disk failure prediction method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10467075B1 (en) * 2015-11-19 2019-11-05 American Megatrends International, Llc Systems, devices and methods for predicting disk failure and minimizing data loss
CN108446734A (en) * 2018-03-20 2018-08-24 中科边缘智慧信息科技(苏州)有限公司 Disk failure automatic prediction method based on artificial intelligence
CN108986869A (en) * 2018-07-26 2018-12-11 南京群顶科技有限公司 A kind of disk failure detection method predicted using multi-model
CN111767162A (en) * 2020-05-20 2020-10-13 北京大学 Fault prediction method for hard disks of different models and electronic device
CN112596964A (en) * 2020-12-15 2021-04-02 中国建设银行股份有限公司 Disk failure prediction method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
董勇;蒋艳凰;卢宇彤;周恩强;: "面向磁盘故障预测的机器学习方法比较", 计算机工程与科学, no. 12, 15 December 2015 (2015-12-15), pages 15 - 17 *

Similar Documents

Publication Publication Date Title
US10031829B2 (en) Method and system for it resources performance analysis
CN107025153B (en) Disk failure prediction method and device
US8453027B2 (en) Similarity detection for error reports
CN110164501B (en) Hard disk detection method, device, storage medium and equipment
CN108052528A (en) A kind of storage device sequential classification method for early warning
US20180082215A1 (en) Information processing apparatus and information processing method
AU2017274576B2 (en) Classification of log data
CN107168995B (en) Data processing method and server
KR20180054992A (en) Failure prediction method of system resource for smart computing
WO2022001125A1 (en) Method, system and device for predicting storage failure in storage system
CN112596964A (en) Disk failure prediction method and device
CN111813585A (en) Prediction and processing of slow discs
Xu et al. General feature selection for failure prediction in large-scale SSD deployment
CN112951311A (en) Hard disk fault prediction method and system based on variable weight random forest
CN113239006A (en) Log detection model generation method and device and log detection method and device
CN112733897A (en) Method and equipment for determining abnormal reason of multi-dimensional sample data
CN111400122B (en) Hard disk health degree assessment method and device
CN113434088A (en) Disk identification method and device
CN115729761A (en) Hard disk fault prediction method, system, device and medium
CN116910526A (en) Model training method, device, communication equipment and readable storage medium
CN114661505A (en) Storage component fault processing method, device, equipment and storage medium
CN115981911A (en) Memory failure prediction method, electronic device and computer-readable storage medium
JP6666489B1 (en) Failure sign detection system
CN113806178A (en) Cluster node fault detection method and device
CN116610469B (en) Comprehensive quality performance test method and system for solid state disk

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination