CN111078440B

CN111078440B - Disk error detection method, device and storage medium

Info

Publication number: CN111078440B
Application number: CN201911243272.3A
Authority: CN
Inventors: 张
Original assignee: Tencent Technology Shenzhen Co Ltd; Huazhong University of Science and Technology
Current assignee: Tencent Technology Shenzhen Co Ltd; Huazhong University of Science and Technology
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2022-03-08
Anticipated expiration: 2039-12-06
Also published as: CN111078440A

Abstract

The embodiment of the application provides a disk error detection method, a disk error detection device and a storage medium, wherein the method comprises the following steps: acquiring a historical characteristic vector sequence corresponding to a target area of a magnetic disk; the historical characteristic vector sequence comprises historical characteristic vectors arranged according to time sequence; acquiring error probability data of each historical error sector in the target area of the disk; determining probability data of sector errors corresponding to the candidate disk risk area according to the historical feature vector sequence and the error probability data; the candidate disk risk area is an area in which new sector errors occur in an area adjacent to the historical error sector in the disk; obtaining the prediction probability data of the disk with the sector errors according to the probability data of the candidate disk risk areas with the sector errors; and generating a prediction result of the sector error of the magnetic disk based on the prediction probability data. The scheme can predict the specific position of the sector error.

Description

Disk error detection method, device and storage medium

Technical Field

The embodiment of the application relates to the technical field of storage, in particular to a disk error detection method, a disk error detection device and a storage medium.

Background

Sector errors are a common type of error in modern magnetic disks. A sector error occurring in a business input/output (I/O) layer process will result in a service being inaccessible. More seriously, it will bring about permanent data loss when reconstructing data, thereby seriously affecting the reliability of the storage system. Today, much research has been proposed to address this problem using active latent sector misprediction of machine learning methods.

During the research and practice of the prior art, the inventors of the embodiments of the present application found that these methods are not suitable for processing a time series training data set and bring about poor prediction results. Moreover, they use the prediction results to directly speed up the entire disk, but ignore those local high risk areas of the disk that can further improve the reliability of the storage system. In addition, the methods never consider that the high-risk areas are inspected in advance according to the I/O characteristics, and the overall inspection efficiency is not high.

Disclosure of Invention

The embodiment of the application provides a disk error detection method, a disk error detection device and a storage medium, which can improve the inspection efficiency and predict the specific position of a sector error.

In a first aspect, an embodiment of the present application provides a disk error detection method, where the method includes:

acquiring a historical characteristic vector sequence corresponding to a target area of a magnetic disk; the historical characteristic vector sequence comprises historical characteristic vectors arranged according to time sequence; the historical characteristic vector is a characteristic vector which represents that sector errors occur in the target area of the magnetic disk; the target area of the disk comprises historical error sectors;

acquiring error probability data of each historical error sector in the target area of the disk;

determining probability data of sector errors corresponding to the candidate disk risk area according to the historical feature vector sequence and the error probability data; the candidate disk risk area is an area in which new sector errors occur in an area adjacent to the historical error sector in the disk;

obtaining the prediction probability data of the disk with the sector errors according to the probability data of the candidate disk risk areas with the sector errors;

and generating a prediction result of the sector error of the magnetic disk based on the prediction probability data.

In one possible design, the obtaining a historical feature vector sequence corresponding to a target region of a disk includes:

acquiring disk monitoring data of a disk to which the disk target area belongs within the historical time;

determining characteristics with dependency relationship in time from the disk monitoring data;

extracting a plurality of historical feature vectors from the magnetic disk monitoring data according to the features with the dependency relationship in time;

and obtaining the historical feature vector sequence according to the plurality of historical feature vectors.

In one possible design, the method further includes:

receiving a disk service I/O request for the disk target area;

executing access operation to the target area according to the disk service I/O request;

executing a first routing inspection strategy in the target area of the disk, and executing a second routing inspection strategy in a fragment sector to route in the disk, wherein the fragment sector is a discontinuous sector after routing inspection except the target area in the disk;

and after the access operation on the target area of the disk is finished, polling a residual area according to the sequence of sectors at an initial polling rate, wherein the residual area refers to the sectors of the disk except the target area of the disk and the fragment area.

In one possible design, the obtaining error probability data of a sector error occurring in each sector in the target area of the disk includes:

acquiring historical patrol data in the historical time, wherein the historical patrol data is sector patrol data of a magnetic disk with historical sector errors;

determining the probability of the occurrence of sector errors of each disk candidate area from the historical routing inspection data, wherein the disk candidate areas comprise at least two continuous sectors;

determining a common sector of the target area of the disk and each candidate area of the disk;

and obtaining the error probability data according to the probability of the sector error in the disk candidate area and the probability of the sector error of the public sector.

In one possible design, the method further includes:

executing an initial polling strategy after the effective time of the disk service I/O request is over;

and executing an initial polling strategy in the interval time or idle time of the disk service I/O request.

In one possible design, after the outputting the indication information, the method further includes:

if the probability of the sector errors occurring in the target area of the disk is predicted to be higher than the first probability, increasing the polling rate of the target area of the disk;

and if the probability of the sector errors occurring in the target area of the disk is predicted to be lower than the second probability, reducing the polling rate of the target area of the disk.

In one possible design, after determining the candidate disk risk regions, the method further includes:

acquiring the number of first disks, the number of second disks, polling time, the extra polling time, polling time intervals and a risk ratio, wherein the risk ratio is the proportion of all disk target areas in the disks; the first disk number refers to the number of disks with historical sector errors and correct predicted sector errors; the second disk number refers to the number of disks with no historical sector errors and with correct predicted sector errors; the extra inspection time refers to the inspection time of the fragment area;

and calculating the average inspection time according to the number of the first disks, the number of the second disks, the inspection time, the extra inspection time, the inspection time interval and the risk ratio.

calculating and predicting the false positive rate and the false positive rate of the candidate disk risk area;

taking the candidate disk risk area as a disk target area, and updating the candidate disk risk area into a disk target area set;

when the number of updated disk target areas in the disk target area set exceeds a preset threshold value, calculating a compensation factor of the average inspection time;

updating the average patrol time using the compensation factor.

In one possible design, the indication information is stored on a blockchain node.

In a second aspect, an embodiment of the present application provides a disk error detection apparatus having a function of implementing a disk error detection method corresponding to the second aspect. The functions can be realized by hardware, and the functions can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware.

In one possible design, the disk error detection apparatus includes:

the acquisition module is used for acquiring a historical characteristic vector sequence corresponding to a target area of a disk; the historical characteristic vector sequence comprises historical characteristic vectors arranged according to time sequence; the historical characteristic vector is a characteristic vector which represents that sector errors occur in the target area of the magnetic disk; the target area of the disk comprises historical error sectors;

the acquisition module is further used for acquiring error probability data of each historical error sector in the target area of the disk;

the processing module is used for determining probability data of sector errors corresponding to the candidate disk risk area according to the historical characteristic vector sequence and the error probability data obtained by the obtaining module; the candidate disk risk area is an area in which new sector errors occur in an area adjacent to the historical error sector in the disk;

the processing module is further used for obtaining the predicted probability data of the disk with the sector errors according to the probability data of the candidate disk risk areas with the sector errors; and generating a prediction result of the sector error of the magnetic disk based on the prediction probability data.

In one possible design, the processing module is specifically configured to:

and extracting a plurality of historical feature vectors from the disk monitoring data according to the features with the time dependency relationship, wherein the historical feature vectors are time series in historical time.

In one possible design, the processing module is further configured to:

providing the input/output module to receive a disk service I/O request of the disk target area;

executing access operation on the disk target area according to the disk service I/O request;

executing a first routing inspection strategy in the target area of the disk, and executing a second routing inspection strategy in a fragment sector to route in the disk, wherein the fragment sector is a discontinuous sector after routing inspection except the target area of the disk in the disk;

In one possible design, the processing module is specifically configured to:

In one possible design, the processing module is further configured to:

executing an initial polling strategy in the interval time or idle time of the disk service I/O request;

in one possible design, the processing module is further configured to, after the input/output magic reactance output indication information:

In one possible design, after determining the candidate disk risk regions, the processing module is further configured to:

updating the average patrol time using the compensation factor.

In one possible design, the prediction results are stored on blockchain nodes.

In another aspect, the present invention provides a disk error detection apparatus, which includes at least one connected processor, a memory and an input/output unit, where the memory is used to store a computer program, and the processor is used to call the computer program in the memory to execute the method according to the first aspect.

Yet another aspect of the embodiments of the present application provides a computer-readable storage medium, which includes instructions that, when executed on a computer, cause the computer to perform the method of the first aspect.

Compared with the prior art, in the scheme provided by the embodiment of the application, on one hand, the probability of the sector error of the disk is used as the prediction result of the prediction model. The high-risk disk is predicted through the prediction model, manual feature selection can be avoided, and therefore the difficulty of engineering implementation is reduced, and the specific position of the sector error is predicted. On the other hand, since the long-term dependency between the state data of each sector in the disk is sufficiently taken into consideration by the history feature vector sequence, the prediction accuracy of the prediction model can be improved.

Drawings

Fig. 1 is a schematic flow chart of routing inspection of a disk inside a storage device in an embodiment of the present application;

FIG. 2 is a schematic diagram of a neural network model used in the storage system according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart illustrating a disk error detection method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of error probability data in an embodiment of the present application;

FIG. 5a is a graph illustrating a comparison of historical feature data of a first data set and a second data set in an embodiment of the present application;

FIG. 5b is a schematic diagram showing comparison between predicted results of predicted LSE scheme 1 and predicted LSE scheme 2 in the example of the present application;

fig. 5c is a schematic diagram of the prediction performance of the prediction model of LSTM according to the present embodiment as a function of the number of s.m.a.r.t. attributes;

FIG. 6 is a schematic structural diagram of a distributed system in an embodiment of the present application;

FIG. 7 is a schematic diagram showing a structure of a disk error detecting apparatus according to an embodiment of the present application;

FIG. 8 is a schematic structural diagram of a computer device for executing a disk error detection method according to an embodiment of the present application.

Detailed Description

The terms "first," "second," and the like in the description and in the claims of the embodiments of the application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprise" and "have," and any variations thereof, are intended to cover non-exclusive inclusions, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those steps or modules expressly listed, but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus, such that the division of modules presented in the present application is merely a logical division and may be implemented in a practical application in a different manner, such that multiple modules may be combined or integrated into another system or some features may be omitted or not implemented, and such that couplings or direct couplings or communicative connections shown or discussed may be through interfaces, indirect couplings or communicative connections between modules may be electrical or the like, the embodiments of the present application are not limited. Moreover, the modules or sub-modules described as separate components may or may not be physically separated, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purpose of the embodiments of the present application.

The embodiment of the application provides a disk error detection method, a disk error detection device and a storage medium. The method can be used for a storage system, and particularly used for predicting the probability of sector errors of a disk in the storage system. In some embodiments, a storage system includes a disk layer, a sector layer, and an I/O layer. Fig. 1 is a flow of polling a disk inside a storage system.

The disk layer refers to a data center (DATA CENTER), and the disk layer includes a plurality of disks, and the plurality of disks are divided into Low-risk disks (Low-risk disks) and High-risk disks (High-risk Areas in a disks). A high-risk disk refers to a disk with a risk value above a first threshold, and a low-risk disk refers to a disk with a risk value below a second threshold. Concealing sector error predictions.

The sector layer is a layer that manages a disk after logically dividing the disk. In the sector layer, the disk is divided into a high risk area and a low risk area. The high risk areas correspond to high risk disks and the low risk areas correspond to low risk disks.

The I/O layer is used for responding to the service I/O request to access the data stored by the disk in the disk layer. When the disk of the embodiment of the present application supports Analysis and Reporting Technology (s.m.a.r.t.), the data stored in the disk of the embodiment of the present application may be s.m.a.r.t. index data. The magnetic disk can analyze and compare the running state, history record and preset safety value of the magnetic head, the disk, the motor and the circuit through monitoring instructions on the magnetic disk and monitoring software on a host. When the condition outside the safe value range occurs, a warning is automatically sent out.

In some embodiments, the storage system may predict the occurrence of the sector error in the disk through a neural network model, that is, the disk error detection method may be implemented through the neural network model. Fig. 2 is a schematic structural diagram of a neural network model adopted by the storage system. The neural network model includes an input layer, at least one layer of Long Short-Term Memory (LSTM), and a fully connected layer (sense layer). The number of layers of LSTM included in the prediction model is not limited in the embodiments of the present application, and fig. 2 only illustrates 3 layers of LSTM. The number of layers of a fully-connected layer (sense layer) included in the prediction model is not limited in the embodiment of the present application, and fig. 2 only takes a 2-layer sense layer as an example. Fn, Fn-1, … Fn-m +1 are input of a prediction model, namely a historical feature vector sequence. Yn is the output of the prediction model, i.e. the probability of a sector error on the disk. Each input of the prediction model corresponds to a feature vector of the disk at a historical time.

The embodiment of the application mainly provides the following technical scheme:

1. and acquiring a sequence of historical feature vectors of the disk based on the prediction model of the LSTM, taking the sequence of the historical feature vectors as the input of the prediction model, and taking the probability of the occurrence of sector errors of the disk as the output of the prediction model. And predicting the high-risk disk through a prediction model. Manual feature selection can be avoided, and therefore the difficulty of engineering implementation is reduced. Furthermore, since LSTM is good at handling time-sequential training data sets, the accuracy of the prediction model can be improved.

2. The locality of sector errors is explored by analyzing a large number of disks with sector errors, and partial high-risk areas of the high-risk disks are further researched and priority routing inspection is performed to improve the reliability of the sector-level storage system.

3. For some high risk areas, a Piggyback polling Strategy is used, for example, as shown in fig. 1, at the I/O level, a Piggyback polling Strategy is used. That is, when the service I/O accesses those areas of high risk area, the patrol (i.e., piggybacking read operation) may be performed in the remaining scattered sectors, and the other areas perform an initial patrol policy (initial mirroring Strategy). It can be seen that the initial patrol policy is optimized according to the characteristics of the service I/O to further achieve a lower Mean Time To Diagnostic (MTTD) without or with little additional patrol cost. The inspection efficiency can be improved.

In the embodiment of the application, MTTD is used for estimating the reliability of the prediction model in the storage system, and the patrol cost is used for estimating the influence in the storage system.

Referring to fig. 3, a method for detecting a disk error provided in the embodiment of the present application is described below, where the embodiment of the present application takes predicting a sector error occurring in one disk as an example, and the operation for predicting a sector error occurring in another disk may refer to the embodiment of the present application and is not described in detail. The embodiment of the application comprises the following steps:

201. and acquiring a historical characteristic vector sequence corresponding to the target area of the disk.

The historical feature vector sequence comprises historical feature vectors arranged according to a time sequence; the historical characteristic vector is a characteristic vector which represents that sector errors occur in the target area of the magnetic disk; the target area of the disk includes historical error sectors.

In some embodiments, the obtaining a plurality of historical feature vectors corresponding to the target area of the disk includes:

Therefore, the historical characteristic vector sequence of the embodiment of the application fully considers the change trend of the historical state of the disk in a long term and the time sequence dependence characteristics presented among the historical characteristic vectors, and when the embodiment of the application is implemented through a neural network model, on one hand, the neural network model of the embodiment of the application can process the historical characteristic vector sequence with time sequence or sequence, so that the prediction precision of the neural network model can be higher, and the prediction time can be reduced. On the other hand, the historical feature vector sequence reflects a time sequence dependency characteristic, so that the change trend of the sector state in the disk can be represented in time, and therefore, when the probability of the sector error of the disk is predicted based on the historical feature vector sequence, an effective basis can be provided, so that the prediction result is obtained based on the long-term historical state change trend of the disk, and the prediction result is more accurate.

202. And acquiring error probability data of sector errors of each sector in the target area of the disk.

The error probability data is the probability of a sector error occurring in a sector where the sector error occurs in the target area of the disk. Sector errors may or may not occur in sectors in a target area of a magnetic disk, and sector errors may occur in historical time; it is also possible that a sector adjacent to it has experienced a sector error, resulting in a possible sector error for that sector. In these cases, the sectors in the target area of the disk may have an error probability of a sector error. The error probability data may be referred to a schematic diagram as shown in fig. 4.

In some embodiments, the obtaining error probability data of a sector error occurring in each sector in the target area of the magnetic disk includes:

203. And determining probability data of sector errors corresponding to the candidate disk risk area according to the historical feature vector sequence and the error probability data.

The candidate disk risk area refers to a sector with a new sector error in a target range; the target range refers to an area adjacent to a history error sector.

Specifically, the probability of occurrence of a sector error in a single disk is not uniformly distributed but has some locality. That is, the probability of a sector error occurring in some local areas of the disk (denoted as high risk areas) is much higher than in other areas. Therefore, these local areas can be regarded as candidate disk risk areas, which are high risk areas for preferentially inspecting high risk disks at the sector level.

204. And obtaining the predicted probability data of the disk with the sector errors according to the probability data of the candidate disk risk areas with the sector errors.

The prediction probability data is the probability of predicting the sector error of the magnetic disk in the preset time. The preset time can be 1 day, one week or one year in the future, and the embodiment of the application does not limit the length of the preset time, and the preset time can be regular or have a certain randomness.

Optionally, there may be many candidate disk risk regions in one disk, so when calculating the probability of a sector error occurring in one disk, it needs to calculate according to the probability of a sector error occurring in all candidate disk risk regions in the disk, so that the obtained probability of a sector error occurring in the disk is more accurate, and the probability of a sector error occurring in the disk is reflected from the disk as a whole.

205. And generating a prediction result of the sector error of the magnetic disk based on the prediction probability data.

The prediction result can be the indication information used for indicating the probability of the sector error occurring in the disk within the preset time.

In the embodiment of the application, on one hand, the probability of the sector error of the disk is used as the prediction result of the prediction model. The high-risk disk is predicted through the prediction model, manual feature selection can be avoided, and therefore the difficulty of engineering implementation is reduced, and the specific position of the sector error is predicted. On the other hand, since the long-term dependency between the state data of each sector in the disk is sufficiently taken into consideration by the history feature vector sequence, the prediction accuracy of the prediction model can be improved.

Optionally, in some embodiments of the present application, to implement a lower MTTD, the polling may be performed on the remaining fragment areas (also referred to as scattered areas) in parallel. Specifically, the method further comprises:

receiving a disk service I/O request for the disk target area;

And the second polling strategy is an initial polling strategy.

In some embodiments, the method further comprises:

when the effective TIME of the disk service I/O request (such as I/O _ TIME shown in FIG. 1) is over, executing an initial polling strategy;

during the INTERVAL time (I/O _ INTERVAL shown in fig. 1) or idle time (I/O _ exchange shown in fig. 1) of the disk service I/O request, an initial polling policy is executed.

Therefore, when the service I/O accesses the target area (other areas in the disk execute the initial polling policy), the polling (i.e. piggybacking read operation) can be executed in the fragmented sectors, and the MTTD is further reduced under the condition that no or little extra polling cost is added on the premise that the target area of the disk is preferentially polled, so that the reliability of the storage system can be further improved. In addition, after the business I/O access target area is finished, the remaining area is patrolled.

After performing the service I/O operation, the head only needs to move within a small area. In addition, this strategy may also reduce the frequency of head movement because these fragmented sectors have been piggybacked without having to do other polling operations again. Thus, not only can the high risk regions be inspected further in advance to improve the reliability of the storage system (i.e., effectively reduce the MTTD), but also little or no inspection penalty can be added and the frequency of head movement can be reduced.

Optionally, in some embodiments of the present application, after the outputting the indication information, the method further includes:

Optionally, in some embodiments of the present application, after predicting the sector errors, the disk with the historical sector errors will be verified first, and then not only the patrol speed is increased by X times, but also a second patrol policy is executed in the high-risk area. The following describes a calculation manner of MTTD, and after determining the candidate disk risk regions, the method further includes:

acquiring the number of first disks, the number of second disks, polling time, the extra polling time, polling time intervals and a risk ratio, wherein the risk ratio is the proportion of all target areas in the disks; the first disk number refers to the number of disks with historical sector errors and correct predicted sector errors; the second disk number refers to the number of disks with no historical sector errors and with correct predicted sector errors; the extra inspection time refers to the inspection time of the fragment area;

In some embodiments, the average patrol time is calculated by the following formula:

wherein, X is the polling speed, r is the polling speed of the disk, t is the time occupied by polling under the specific service workload, and TP_h(sf)Is a disk with correctly predicted sector errors with historical sector errors, TP_nh(sf)Is a disk of correctly predicted sector errors with no historical sector errors, TP_sf＝TP_h(sf)+TP_nh(sf). And o' is the proportion of the whole high-risk area of the disk, and delta T is the inspection time additionally brought by the piggyback inspection. The larger the area of the high risk region, the larger the Δ T sum. And T is the polling time interval. The calculation formula is only an example, and may be modified based on the calculation formula, and the embodiment of the present application is not limited.

In some embodiments, to evaluate the prediction accuracy of the prediction model, after determining the candidate disk risk regions, the method further comprises:

taking the candidate disk risk area as a disk target area, and updating the candidate disk risk area into a target area set;

when the number of updated target areas in the disk target area set exceeds a preset threshold value, calculating a compensation factor of the average inspection time;

updating the average patrol time using the compensation factor.

For the sake of understanding, an application scenario is described based on fig. 1 and fig. 2 to describe the disk error detection method. The embodiment of the application can also calculate the inspection cost for evaluating the cost spent in each inspection, so that the inspection strategy can be continuously improved in the later period. The inspection cost is called Costsf, and can be obtained through the following formula:

Cost_sf＝T×(X×r×FP_sf+Y×r×PN_sf)+(T+ΔT)×X×r×TP_sf

wherein X is the inspection speed, TP_h(sf)Is a disk with correctly predicted sector errors with historical sector errors, TP_nh(sf)Is a disk of correctly predicted sector errors with no historical sector errors, TP_sf＝TP_h(sf)+TP_nh(sf). And o' is the proportion of the whole high-risk area of the disk, and delta T is the inspection time additionally brought by the piggyback inspection. The larger the area of the high risk region, the larger the Δ T sum. And T is the polling time interval.

The whole prediction process is described as follows:

(1) acquiring a data set

A first data set and a second data set are acquired. The first data set and the second data set are s.m.a.r.t data of the two data centers, respectively. Wherein the first data set is historical characteristic data of the company A data center in the period T of 50 months. The second data set is historical data for company B data center at T of 26 months. As shown in FIG. 5a, the historical characterization data of the first data set and the second data set are compared, and each historical characterization data includes a model number of a prediction model used by a corresponding company (e.g., company A, company B), a disk status (sector error or non-sector error), a drive count, and a sample count.

And declaring the event of sector errors in the magnetic disk as sector errors, and declaring the time of other errors as non-sector errors. Further, an experiment may be performed using three actual service workloads from company B, and the three actual service workloads are denoted as W-A, W-B, and W-C, respectively.

(2) Calculating an evaluation index of a prediction model

In some embodiments, the prediction accuracy of the prediction model can be evaluated by four-dimensional evaluation indexes of False Positive Rate (FPR), False Positive Rate (FNR), MTTD and patrol cost. Where FPR refers to the proportion of non-LSE disks that are mispredicted as LSE disks. FNR is the proportion of LSE disks that are incorrectly predicted as no LSE disks. The lower the FNR, the better the model. In order to make the MTTD more accurate, a corresponding improvement factor can be set for the MTTD and the inspection cost respectively. In some embodiments, the calculation formula of the improvement factor of MTTD and the improvement factor of patrol cost is as follows:

wherein, MTTD_factorRefers to the improvement factor of MTTD, Cost_factorThe inspection cost is an improvement factor. MTTD_factorImprovement factor Cost related to performance of LSEs prediction model and used for inspecting Cost_factorBut also on the prediction performance and T. Both Δ T and o' depend on the size of the target area.

(3) Predicting sector errors in 4 disks using a predictive model

2 disks, M-A and M-B, were selected from the first daA set. And selecting 2 disks from the second data set, namely M-C and M-D. For 4 disks. The inspection scheme 1 introduced in the embodiment of the present application is denoted by SF, and the predicted LSE scheme 2 in the prior art is denoted by SU. Correspondingly, the prediction LSE of the disk M-A by adopting the prediction LSE scheme 1 in the embodiment of the application is represented by SF (M-A), and the prediction LSE of the disk M-A by adopting the prediction LSE scheme 2 in the prior art is represented by SU (M-A), and other similar reasons are not repeated.

The evaluation index of the calculation prediction model is described below

In some embodiments, the prediction accuracy of the prediction model can be evaluated by four-dimensional evaluation indexes of FPR, FNR, MTTD and patrol cost. Where FPR refers to the proportion of non-LSE disks that are mispredicted as LSE disks. FNR is the proportion of LSE disks that are incorrectly predicted as no LSE disks. The lower the FNR, the better the model. The following are introduced separately:

(a) calculating FPR and FNR of each disk by a prediction model

In predicted LSE scheme 1, the predicted results (i.e., FPR and FNR) for the 4 disks M-A, M-B, M-C and M-D, respectively, are obtained, as shown by the dashed lines in FIG. 5 b.

In predicted LSE scheme 2, the predicted results (i.e., FPR and FNR) for the 4 disks M-A, M-B, M-C and M-D, respectively, are obtained, as shown by the solid line in FIG. 5 b.

As can be seen from fig. 5b, the prediction results of SF are closer to the lower left corner than those of SU, that is, lower FNR and FPR can be achieved by using the predicted LSE scheme 1 of the embodiment of the present application. Specifically, when the FPR is limited to 10%, all of 98% (SU: 93%), 91% (SU: 88%), 82% (SU: 71%) and 80% (SU: 68%) errors can be correctly predicted. M-A, M-B disks from company A and M-C, M-D disks from company B were used. When the FPR is limited to 5%, 97.5% of all errors of the MA can also be correctly predicted.

As can be seen, compared with the prediction LSE scheme 2, in one aspect, the prediction model of the embodiment of the present application uses the characteristic that LSTM is good at processing data with time or sequence dependency, so that the prediction accuracy (or referred to as prediction performance or prediction effect) of the prediction model can be effectively improved. On the other hand, the predicted LSE scheme 1 in the embodiment of the present application does not require manual function selection, so that the implementation difficulty can be reduced.

Further, in the predicted LSE scheme 1 of the embodiment of the present application, a different number of s.m.a.r.t attributes (the selected 4 attributes must include 2 of the last attribute) are randomly selected and FNRs are recorded. As shown in FIG. 5c, the prediction performance of the prediction model of LSTM is decreasing as the number of attributes increases from 2 to 32. In addition, after the number of attributes reaches a certain number, the FNR tends to be stable. This is because the attributes added later do not greatly affect the prediction performance of the prediction model. The experiment shows that the prediction LSE scheme 1 of the embodiment of the present application automatically completes the process of feature selection (i.e., feature extraction) through the deep neural network without performing other operations.

(b) Calculating MTTD and routing inspection cost of each disk through a prediction model

In some embodiments, in order to improve the accuracy of the MTTD and the inspection cost, a corresponding improvement factor may be set for the MTTD and the inspection cost, respectively. In some embodiments, the calculation formula of the improvement factor of MTTD and the improvement factor of patrol cost is as follows:

Since the candidate disk risk regions are continuously updated, the set of disk target regions is also continuously updated. Therefore, as the size of the high risk zone (i.e., the disk target zone) increases, the patrol cost and the MTTD improvement factor of the predicted LSE scheme 1 both increase, since the larger the high risk zone, the larger the T sum. Note that even if LSE scheme 1 is predicted to use the largest high risk area, the patrol cost does not increase significantly. When the high risk area is 0 (piggyback patrol policy is not employed), therefore, predictive LSE scheme 1 achieves lower patrol cost and lower MTTD than predictive LSE scheme 2.

In particular, disk-level LSEs (well suited to processing sequential s.m.a.r.t data) are better predicted using LSTM, which allows predicted LSE scheme 1 to further accelerate the cleaning speed appropriately. Furthermore, when the high risk area is limited to around 104, predicted LSE scheme 1 can reduce MTTD by about 33% compared to predicted LSE scheme 2 without increasing the patrol cost. This is because the cost of piggybacking a strategy is offset by a better predictive model. Furthermore, when the high risk region is limited to around 107, it is predicted that LSE scheme 1 can also reduce MTTD by nearly 50%, while only increasing the cost by 10%.

In the embodiment of the present application, the prediction result, the set of target areas of the disk, and other information may all be stored in the block chain. The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.

The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.

The disk error detection apparatus (also referred to as a server or a storage system) executing the disk error detection method in the embodiment of the present application may be a node in a blockchain system. The disk error detection apparatus in the embodiment of the present application may be a node in a block chain system as shown in fig. 6.

Any technical feature mentioned in the embodiment corresponding to any one of fig. 1 to 6 is also applicable to the embodiments corresponding to fig. 7 and 8 in the embodiment of the present application, and the details of the subsequent similarities are not repeated.

A disk error detection method in the embodiment of the present application is described above, and an apparatus for performing the disk error detection method is described below.

The above describes a disk error detection method in the embodiment of the present application, and the following describes a disk error detection apparatus in the embodiment of the present application.

Referring to fig. 7, a schematic diagram of a structure of a magnetic disk error detection apparatus shown in fig. 7 can be applied to predict a sector error occurring in a magnetic disk. The disk error detection apparatus in the embodiment of the present application can implement the steps corresponding to the disk error detection method executed in the embodiment corresponding to fig. 1. The functions realized by the disk error detection device can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the above functions, which may be software and/or hardware. The disk error detection apparatus may include a processing module and an input/output module, and the implementation of the functions of the processing module and the input/output module may refer to the operations performed in the embodiment corresponding to fig. 1, which are not described herein again. For example, the processing module may be used to control the operations of the input-output module, such as receiving, sending input, outputting, and so on.

In some embodiments, the obtaining module may be configured to obtain a historical feature vector sequence corresponding to a target area of a disk; the historical characteristic vector sequence comprises historical characteristic vectors arranged according to time sequence; the historical characteristic vector is a characteristic vector which represents that sector errors occur in the target area of the magnetic disk; the target area of the disk comprises historical error sectors;

the processing module can be used for determining probability data of sector errors corresponding to the candidate disk risk area according to the historical characteristic vector sequence and the error probability data obtained by the obtaining module; the candidate disk risk area is an area in which a new sector error occurs in an area adjacent to the historical error sector in the disk;

In the embodiment of the application, a target area is determined, a plurality of historical characteristic vectors corresponding to the target area are obtained, and error probability data of sector errors of each sector in the target area of the magnetic disk are obtained; determining a candidate disk risk region according to the historical feature vector and the error probability data, wherein the candidate disk risk region is a sector with a new sector error in a target range; obtaining the probability of the disk generating the sector errors according to the probability of the candidate disk risk region generating the sector errors; and outputting the indication information. On one hand, the probability of sector errors occurring on the disk is used as a prediction result of the prediction model. And predicting the high-risk disk through a prediction model. Manual feature selection can be avoided, thereby reducing the difficulty of engineering implementation and predicting the specific location of the sector error. On the other hand, since the long-term dependency between the state data of each sector in the disk is sufficiently taken into consideration by the history feature vector sequence, the prediction accuracy of the prediction model can be improved.

In some embodiments, the processing module is specifically configured to:

In some embodiments, the processing module is further configured to:

In some embodiments, the processing module is specifically configured to:

In some embodiments, the processing module is further configured to:

in some embodiments, the processing module is further configured to, after the input/output magic reactance output indication information:

In some embodiments, after determining the candidate disk risk regions, the processing module is further configured to:

updating the average patrol time using the compensation factor.

The network authentication server and the terminal device in the embodiment of the present application are described above from the perspective of the modular functional entity, and the network authentication server and the terminal device in the embodiment of the present application are described below from the perspective of hardware processing. The apparatus shown in fig. 7 may have a structure as shown in fig. 7, when the apparatus shown in fig. 7 has a structure as shown in fig. 8, the processor and the input/output unit in fig. 8 can implement the same or similar functions of the processing module and the transceiver module provided in the embodiment of the apparatus corresponding to the apparatus, and the central storage in fig. 8 stores a computer program that needs to be called when the processor executes the disk error detection method. In this embodiment of the application, the entity device corresponding to the input/output module in the embodiment shown in fig. 7 may be an input/output interface, and the entity device corresponding to the processing module may be a processor.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the system, the apparatus and the module described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed system, apparatus, and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may be stored in a computer readable storage medium.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. The procedures or functions described in accordance with the embodiments of the present application are generated in whole or in part when the computer program is loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a server, a data center, etc., that is integrated with one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

The technical solutions provided by the embodiments of the present application are introduced in detail, and the principles and implementations of the embodiments of the present application are explained by applying specific examples in the embodiments of the present application, and the descriptions of the embodiments are only used to help understanding the method and core ideas of the embodiments of the present application; meanwhile, for a person skilled in the art, according to the idea of the embodiment of the present application, there may be a change in the specific implementation and application scope, and in summary, the content of the present specification should not be construed as a limitation to the embodiment of the present application.

Claims

1. A method for disk error detection, the method comprising:

generating a prediction result of the sector error of the disk based on the prediction probability data;

the method further comprises the following steps: the method comprises the steps that a piggybacked polling strategy is adopted for polling the disk, and the piggybacked polling strategy is used for executing polling of piggybacked reading operation in a fragment sector when a disk service I/O accesses a target area of the disk, and executing an initial polling strategy in the rest area;

the fragmented sectors are discontinuous sectors after inspection except the target area of the disk in the disk, and the remaining areas are sectors except the target area of the disk and the fragmented sectors in the disk.

2. The method of claim 1, wherein the obtaining the historical feature vector sequence corresponding to the target region of the disk comprises:

acquiring disk monitoring data of a disk to which the disk target area belongs within historical time;

3. The method according to claim 1 or 2, wherein the polling the disk by using the piggybacked polling policy comprises:

receiving a disk service I/O request for the disk target area;

when the target area of the disk is inspected, an initial inspection strategy is executed in the fragment sector at the same time so as to inspect in the disk;

and after the access operation on the target area of the disk is finished, polling the rest areas at the initial polling speed according to the sequence of the sectors.

4. The method of claim 3, further comprising:

5. The method of claim 3, wherein after generating the prediction of the disk having the sector error based on the prediction probability data, the method further comprises:

6. The method according to claim 4 or 5, wherein after determining the probability data of the occurrence of the sector error corresponding to the candidate disk risk region, the method further comprises:

acquiring the number of first disks, the number of second disks, routing inspection time, extra routing inspection time, routing inspection time intervals and a risk ratio, wherein the risk ratio is the proportion of all disk target areas in the disks; the first disk number refers to the number of disks with historical sector errors and correct predicted sector errors; the second disk number refers to the number of disks with no historical sector errors and with correct predicted sector errors; the extra polling time refers to polling time of the fragment sector;

7. The method of claim 6, wherein after determining the probability data of the occurrence of the sector error corresponding to the candidate disk risk region, the method further comprises:

updating the average patrol time using the compensation factor.

8. A disk error detection apparatus, characterized in that the disk error detection apparatus comprises:

the processing module is further used for obtaining the predicted probability data of the disk with the sector errors according to the probability data of the candidate disk risk areas with the sector errors; generating a prediction result of the sector error of the disk based on the prediction probability data;

the processing module is further configured to: the method comprises the steps that a piggybacked polling strategy is adopted for polling the disk, and the piggybacked polling strategy is used for executing polling of piggybacked reading operation in a fragment sector when a disk service I/O accesses a target area of the disk, and executing an initial polling strategy in the rest area;

9. A disk error detection apparatus, comprising:

at least one processor, a memory, and an input-output unit;

wherein the memory is for storing a computer program and the processor is for calling the computer program stored in the memory to perform the method of any one of claims 1-7.

10. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1-7.