WO2022227373A1 - 一种硬盘健康评估方法和存储设备 - Google Patents

一种硬盘健康评估方法和存储设备 Download PDF

Info

Publication number
WO2022227373A1
WO2022227373A1 PCT/CN2021/118513 CN2021118513W WO2022227373A1 WO 2022227373 A1 WO2022227373 A1 WO 2022227373A1 CN 2021118513 W CN2021118513 W CN 2021118513W WO 2022227373 A1 WO2022227373 A1 WO 2022227373A1
Authority
WO
WIPO (PCT)
Prior art keywords
hard disk
health
time
health degree
data
Prior art date
Application number
PCT/CN2021/118513
Other languages
English (en)
French (fr)
Inventor
王建星
李鹏
宋磊
党炜
周建华
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022227373A1 publication Critical patent/WO2022227373A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3037Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a memory, e.g. virtual memory, cache
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/22Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
    • G06F11/26Functional testing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3447Performance evaluation by modeling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present application relates to the field of storage technologies, and in particular, to a hard disk health assessment method and a storage device.
  • SMART Self-Monitoring, Analysis and Reporting Technology
  • the health degree of the hard disk is usually evaluated by methods such as Euclidean distance method or linear evaluation method.
  • Euclidean distance method the hard disk health is measured based on the distance between the SMART data and the threshold data of the hard disk.
  • linear evaluation method the hard disk health is predicted according to the linear function of the built hard disk health and time.
  • the health degree of the hard disk obtained by the above method cannot stably indicate the actual health degree of the hard disk, and has a large error.
  • the embodiments of the present application aim to provide a hard disk health degree evaluation scheme, which obtains a stable and accurate health degree index by fusing the output values of multiple anomaly detection models based on hard disk SMART data.
  • a first aspect of the present application provides a method for evaluating the health of a hard disk.
  • the method is executed by a storage device and includes: acquiring data of multiple indicators related to the degree of health of the hard disk at a specified use time; A plurality of different models are input; the health degree of the hard disk at the specified usage time is determined according to the outputs of the plurality of models. one.
  • the determining the health degree of the hard disk at the specified usage time according to the outputs of the multiple models specifically includes: based on a weighted sum of the outputs of the multiple models Determine the health of the hard disk at the specified usage time.
  • the outputs of multiple models are fused, and the weights of the outputs of each model can be equal or unequal, or can be dynamically adjusted according to different scenarios.
  • each of the multiple models is obtained by training based on an anomaly detection algorithm, and each model adopts a different anomaly detection algorithm.
  • Different models are obtained by training based on different anomaly detection algorithms. Since the anomaly detection algorithm is unsupervised learning, it is not necessary to manually label the samples, which saves labor costs. At the same time, the anomaly detection model can provide higher prediction accuracy.
  • the number of the multiple models is three, and the anomaly detection algorithms adopted by the three models are an isolated forest algorithm, a local anomaly factor algorithm, and K-means clustering respectively. algorithm.
  • the multiple models are sent to the storage device by a training device or obtained by training from the storage device, and the training device is used to train the multiple models.
  • the multiple models are trained by the following sampled data: sampled data of multiple indicators related to the health of the faulty hard disk within a preset usage period before the end of its life.
  • the anomaly detection model is trained by using the sampling data of multiple indicators of the faulty hard disk in a period of time before the end of its service life.
  • the storage device usually samples the SMART data of the second half of the service life of the hard disk, so the sampled data is easier to obtain.
  • the abnormality degree output by the abnormality detection model can be positively correlated with the health degree.
  • the abnormality degree can be directly used as the health degree, thereby reducing the health degree computational cost.
  • the hard disk includes a target disk
  • the method further includes: acquiring or generating a plurality of first data sets of target disks, the first data set including the target disks the health degree of the target disk at multiple usage times; generating a second data set of the target disk, the second data set includes the health degree of the target disk at multiple usage times, and a plurality of the first data set
  • the time span of each use time is greater than the time span of multiple use times in the second data set; according to the similarity of the health degrees of the multiple use times corresponding to the first data set and the second data set, select Aligning the target disk; predicting the health degree of the target disk at a specified time in the future according to the selected first data set of the target disk.
  • the calculation cost can be reduced, and the health degree of the target disk at a certain time in the future can be accurately predicted.
  • the predicting the health degree of the target disk at a specified time in the future according to the selected first data set of the target disk includes: fitting the selected target disk The mapping relationship between the health degree of the first use time of the target disk and the health degree of the second use time of the target disk, the first use time and the second use time are corresponding times; The mapping relationship and the first data set predict the health of the target disk at multiple specified times in the future.
  • the accuracy of the predicted health degree of the target disk in the future time can be further improved.
  • the method further includes: determining the time when the health degree of the target disk reaches a threshold value according to the predicted health degree of the target disk at multiple specified times in the future, The time when the health degree reaches the threshold value is taken as the end-of-life time of the target disk.
  • the life of the target disk can be predicted more accurately, so that operations such as data backup can be performed in advance to prevent various problems caused by the failure of the target disk.
  • the hard disk is a solid-state hard disk.
  • a second aspect of the present application provides a storage device, comprising: an acquisition unit for acquiring data of multiple indicators related to the degree of health of a hard disk at a specified usage time; an input unit for inputting the data into a plurality of different A model; a determining unit, configured to determine the health of the hard disk at the specified usage time according to the outputs of the multiple models.
  • the determining unit is specifically configured to, based on a weighted sum of outputs of the multiple models, determine the health degree of the hard disk at the specified usage time.
  • each of the multiple models is obtained by training based on an anomaly detection algorithm, and each model adopts a different anomaly detection algorithm.
  • the number of the multiple models is three, and the anomaly detection algorithms adopted by the three models are an isolated forest algorithm, a local anomaly factor algorithm, and K-means clustering respectively. algorithm.
  • the multiple models are sent to the storage device by a training device or obtained by training from the storage device, and the training device is used to train the multiple models.
  • the multiple models are trained by the following sampled data: sampled data of multiple indicators related to the health of the faulty hard disk within a preset usage period before the end of life of the faulty hard disk.
  • the hard disk includes a target disk
  • the storage device further includes: an acquiring or generating unit configured to acquire or generate a plurality of first data sets of the target disks, the The first data set includes the health degree of the target disk at multiple usage times; the generating unit is configured to generate a second data set of the target disk, the second data set includes the target disk at multiple usage times.
  • the health degree of time, the time span of the multiple usage times in the first data set is greater than the time span of the multiple usage times in the second data set; the selecting unit is used for according to the first data set and the The similarity of the health degrees of multiple usage times corresponding to the second data set is to select a target disk; the prediction unit is used to predict the future designation of the target disk according to the selected first data set of the target disk time health.
  • the prediction unit is specifically configured to: fit the health degree of the selected target disk for the first usage time and the health of the target disk for the second usage time The mapping relationship between the degrees, the first usage time and the second usage time are corresponding times; predict the health of the target disk at multiple specified times in the future according to the mapping relationship and the first data set Spend.
  • the determining unit is further configured to, according to the predicted health degree of the target disk at a plurality of specified times in the future, determine whether the health degree of the target disk reaches a threshold value. time, the time when the health degree reaches the threshold value is taken as the end-of-life time of the target disk.
  • a third aspect of the present application provides a storage device, which is characterized by comprising a processor and a memory, wherein executable computer program instructions are stored in the memory, and the processor executes the executable computer program instructions to implement the first aspect Or the method described in the possible implementation manner of the first aspect.
  • a fourth aspect of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores computer program instructions, and when the computer program instructions are executed in a computer or a processor, the computer program instructions cause the computer to Or the processor executes the method described in the first aspect or possible implementation manners of the first aspect.
  • a fifth aspect of the present application provides a computer program product, comprising computer program instructions, which, when the computer program instructions are run in a computer or a processor, cause the computer or processor to perform the first aspect or possible implementations of the first aspect method described.
  • FIG. 1A is an architectural diagram of a centralized storage system 120 with a disk control separation structure applied in an embodiment of the present application;
  • FIG. 1B is an architectural diagram of a centralized storage system 120 with an integrated disk control structure applied in an embodiment of the present application;
  • FIG. 1C is an architectural diagram of a distributed storage system to which an embodiment of the application is applied;
  • FIG. 2 is a schematic diagram of a system architecture for training a hard disk anomaly detection model provided by an embodiment of the present application
  • FIG. 3 is a flowchart of a method for training an anomaly detection model provided by an embodiment of the present application
  • FIG. 4 is a flowchart of a hard disk health assessment method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a health degree curve of each hard disk provided by an embodiment of the present application
  • Fig. 6 is an enlarged view of a health degree curve in Fig. 5;
  • FIG. 7 is a flowchart of a method for predicting a hard disk health degree provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a predicted hard disk health degree curve provided by an embodiment of the present application.
  • FIG. 9 is an architectural diagram of a storage device according to an embodiment of the present application.
  • Storage systems include centralized storage systems and distributed storage systems.
  • the centralized storage system refers to a central node composed of one or more master devices, data is centrally stored in this central node, and all data processing services of the entire system are centrally deployed on this central node.
  • the distributed storage system refers to a system in which data is distributed and stored on multiple independent storage nodes. Users can access data to storage nodes through applications. The computers that run these applications are called "application servers”.
  • Application servers can be physical machines or virtual machines. Physical application servers include, but are not limited to, desktop computers, servers, laptops, and mobile devices.
  • Application servers can access storage nodes through fabric switches to access data. Among them, the switch is only an optional device, and the application server can also communicate with the storage node directly through the network.
  • FIG. 1A is an architectural diagram of a centralized storage system 120 with a disk control separation structure applied in an embodiment of the present application.
  • the storage system 120 is connected to a plurality of hosts 200 , such as application servers, which are all connected to the storage system 120 to access data to the storage system 120 .
  • hosts 200 such as application servers
  • the characteristic of the centralized storage system shown in FIG. 1A is that it has a unified entry, through which data from the host 200 must pass, for example, the engine 121 in the storage system 120 .
  • FIG. 1A there is one or more controllers in the engine 121 .
  • FIG. 1A is illustrated by taking the engine including two controllers as an example.
  • controller 0 and controller 1 There is a mirror channel between controller 0 and controller 1, then when controller 0 writes a copy of data to its memory 124, it can send a copy of the data to controller 1 through the mirror channel, and controller 1 The copy is stored in its own local memory 124 . Therefore, controller 0 and controller 1 are mutually backup.
  • controller 0 fails, controller 1 can take over the services of controller 0.
  • controller 1 fails, controller 0 can take over the services of controller 1. services, thereby avoiding hardware failures that result in the unavailability of the entire storage system 120 .
  • four controllers are deployed in the engine 121, there is a mirror channel between any two controllers, so any two controllers serve as backups for each other.
  • the engine 121 also includes a front-end interface 125 and a back-end interface 126, wherein the front-end interface 125 is used to communicate with the application server, thereby providing storage services for the application server.
  • the back-end interface 126 is used to communicate with the hard disk 134 to expand the capacity of the storage system. Through the back-end interface 126, the engine 121 can connect to more hard disks 134, thereby forming a very large storage resource pool.
  • the controller 0 at least includes a processor 123 and a memory 124 .
  • the processor 123 is a central processing unit (central processing unit, CPU), used for processing data access requests from outside the storage system (server or other storage systems), and also used for processing requests generated inside the storage system.
  • CPU central processing unit
  • the processor 123 receives data write requests sent by the server through the front-end port, it temporarily stores the data in the data write requests in the memory 124 .
  • the processor 123 sends the data stored in the memory 124 to the hard disk 134 through the back-end port 126 for persistent storage.
  • the memory 124 refers to an internal memory that directly exchanges data with the processor 123 , which can read and write data at any time and is very fast, and serves as a temporary data storage for the operating system or other running programs.
  • the memory 124 includes at least two types of memory.
  • the memory can be either a random access memory or a read-only memory (Read Only Memory, ROM).
  • the random access memory is Dynamic Random Access Memory (DRAM), or Storage Class Memory (SCM).
  • DRAM Dynamic Random Access Memory
  • SCM Storage Class Memory
  • DRAM is a semiconductor memory, and like most Random Access Memory (RAM), it belongs to a volatile memory (volatile memory) device.
  • SCM is a composite storage technology that combines the characteristics of traditional storage devices and memory.
  • Storage-level memory can provide faster read and write speeds than hard disks, but is slower than DRAM in terms of operation speed and cheaper than DRAM in cost.
  • the DRAM and the SCM are only exemplary descriptions in this embodiment, and the memory may also include other random access memories, such as static random access memory (Static Random Access Memory, SRAM) and the like.
  • static random access memory SRAM
  • read-only memory for example, it can be Programmable Read Only Memory (PROM), Erasable Programmable Read Only Memory (EPROM), and the like.
  • the memory 124 may also be a dual in-line memory module or a dual-line memory module (Dual In-line Memory Module, DIMM for short), that is, a module composed of dynamic random access memory (DRAM), or a solid-state hard disk (Solid State Drive). State Disk, SSD).
  • DIMM Dual In-line Memory Module
  • multiple memories 124 and different types of memories 124 may be configured in the controller 0 . This embodiment does not limit the quantity and type of the memory 124 .
  • the memory 124 can be configured to have a power saving function. The power saving function means that when the system is powered off and then powered on again, the data stored in the memory 124 will not be lost. Memory with a power-saving function is called non-volatile memory.
  • the memory 124 stores software programs, and the processor 123 runs the software programs in the memory 124 to manage the hard disk.
  • the management of the hard disk for example, abstracts the hard disk into a storage resource pool, and then divides the hard disk into LUNs to be used by the server.
  • the LUN here is actually the hard disk seen on the server.
  • some centralized storage systems are also file servers themselves, which can provide shared file services for servers.
  • controller 1 (and other controllers not shown in FIG. 1A ) are similar to those of the controller 0 , and will not be repeated here.
  • the engine 121 may not have a hard disk slot, the hard disk 134 needs to be placed in the hard disk array 130 , and the back-end interface 126 communicates with the hard disk array 130 .
  • the back-end interface 126 exists in the engine 121 in the form of an adapter card.
  • One engine 121 can use two or more back-end interfaces 126 at the same time to connect multiple hard disk arrays.
  • the adapter card may also be integrated on the motherboard, and at this time, the adapter card may communicate with the processor 123 through the PCIE bus.
  • the storage system may include two or more engines 121 , and redundancy or load balancing is performed among the multiple engines 121 .
  • the hard disk array 130 includes a control unit 131 and several hard disks 134 .
  • the control unit 131 may have various forms.
  • the hard disk array 130 belongs to a smart disk enclosure.
  • the control unit 131 includes a CPU and a memory.
  • the CPU is used to perform operations such as address translation and reading and writing data.
  • the memory is used to temporarily store data to be written to the hard disk 134 or read data from the hard disk 134 to be sent to the controller.
  • the control unit 131 is a programmable electronic component, such as a data processing unit (DPU).
  • the DPU has the generality and programmability of a CPU, but is more specialized and can operate efficiently on network packets, storage requests, or analytics requests.
  • DPUs are distinguished from CPUs by a greater degree of parallelism (the need to handle a large number of requests).
  • the DPU here can also be replaced by a graphics processing unit (graphics processing unit, GPU), an embedded neural network processor (neural-network processing units, NPU) and other processing chips.
  • the number of control units 131 may be one, or two or more.
  • the hard disk array 130 includes at least two control units 131, the hard disk 134 and the control unit 131 have a belonging relationship, and each control unit can only access the hard disk belonging to it, so this often involves forwarding between the control units 131. Read/write data requests, resulting in long data access paths.
  • the functions of the control unit 131 can be offloaded to the network card 104 .
  • the hard disk array 130 does not have the control unit 131, and the network card 104 performs data reading and writing, address translation and other computing functions.
  • the network card 104 is an intelligent network card. It can contain CPU and memory. The CPU is used to perform operations such as address translation and reading and writing data.
  • the memory is used to temporarily store data to be written to the hard disk 134 or read data from the hard disk 134 to be sent to the controller. It can also be a programmable electronic component, such as a data processing unit (DPU).
  • the DPU has the generality and programmability of a CPU, but is more specialized and can operate efficiently on network packets, storage requests, or analytics requests. DPUs are distinguished from CPUs by a greater degree of parallelism (the need to handle a large number of requests).
  • the DPU here can also be replaced by a graphics processing unit (graphics processing unit, GPU), an embedded neural network processor (neural-network processing units, NPU) and other processing chips.
  • graphics processing unit graphics processing unit
  • NPU embedded neural network processor
  • the hard disk 134 may be an SSD, or may be a mechanical hard disk (ie, a magnetic disk).
  • SSD has the characteristics of fast startup, fast reading and writing, fixed reading time, wide operating temperature range and no noise.
  • solid-state disks will not have mechanical failures caused by the movement of mechanical parts, and are resistant to shock, vibration and collision. , with high security and reliability.
  • the hard disk health assessment method provided in the embodiment of the present application is suitable for performing health assessment on an SSD. It can be understood that the hard disk health assessment method provided in the embodiment of the present application is also suitable for performing health assessment on a disk.
  • the hard disk array 130 may be a SAS hard disk array, or an NVMe hard disk array or other types of hard disk arrays.
  • the SAS hard disk array adopts the SAS3.0 protocol, and each frame supports 25 SAS hard disks.
  • the engine 121 is connected to the hard disk array 130 through an onboard SAS interface or a SAS interface module.
  • the NVMe hard disk array is more like a complete computer system, and the NVMe hard disk is inserted into the NVMe hard disk array. The NVMe hard disk array is then connected to the engine 121 through the RDMA port.
  • FIG. 1A shows a centralized storage system with a disk control separation structure
  • FIG. 1B shows the disks shown in FIG. 1B .
  • a centralized storage system with integrated control structure In a centralized storage system with an integrated disk control structure, the difference from the disk control separation structure is that the engine 121 has a hard disk slot, the hard disk 134 can be directly deployed in the engine 121, and the back-end interface 126 is an optional configuration. When the space is insufficient, more hard disks or hard disk arrays can be connected through the rear interface 126 .
  • the embodiments of the present application can also be applied to the distributed storage system shown in FIG. 1C .
  • the distributed storage system includes a cluster of storage nodes.
  • the storage node cluster includes one or more storage nodes 20 (three storage nodes 20a, 20b and 20C are shown in FIG. 1C, but not limited to three storage nodes), and each storage node 20 can be interconnected.
  • Each storage node 20 is connected to a plurality of hosts 200 .
  • Each host 200 is connected to a plurality of storage nodes 20 and interacts with the plurality of storage nodes 20 to distribute and store data in the plurality of storage nodes 20, thereby realizing reliable storage of data.
  • Each storage node 20 includes at least a processor 201 , a memory 202 , and a hard disk 203 .
  • the structure and function of the processor 201, the memory 202-level hard disk 203 are the same as those of the processor 123, the memory 124, and the hard disk 134 in FIG. 1A.
  • SMART data can be regularly sampled for each hard disk, so that the health of the hard disk can be monitored according to the SMART data, and the failure of the hard disk can be predicted.
  • the SMART data includes, for example, values of multiple indicators such as power-on time, switch count, number of uncorrectable errors, number of newly added bad sectors, number of newly added bad blocks, and total number of erasures.
  • Each SMART indicator is usually set with a threshold. If a certain SMART data of a hard disk is close to the threshold, it means that the hard disk will become unreliable, for example, it may lead to data loss or hard disk failure.
  • methods for evaluating the health of a hard disk include a binary method, a Euclidean distance method, and a linear evaluation method.
  • the health of the hard disk includes two states of health and failure, that is, this method can predict whether the hard disk fails, when it fails, etc., but this method cannot quantitatively predict the real-time health of the hard disk.
  • the health of the hard disk is determined according to the distance between the SMART data of the hard disk and the threshold value of the corresponding SMART indicator.
  • the embodiment of the present application provides a method for effectively evaluating the health status of a hard disk.
  • Multiple anomaly detection models are used in a storage system to make predictions based on SMART data of the hard disk, and the outputs of the multiple anomaly detection models are fused to obtain A more accurate hard disk health degree, wherein the multiple anomaly detection models are obtained by training based on the SMART data of the hard disk.
  • the method may be performed by the storage system 120 shown in FIG. 1A , the storage system 120 shown in FIG. 1B , or the storage system shown in FIG. 1C , and the following description will take the storage system 120 shown in FIG. 1A as an example.
  • FIG. 2 is a schematic diagram of a system architecture for training an anomaly detection model provided by an embodiment of the present application.
  • the anomaly detection model has model parameters corresponding to the anomaly detection algorithm, and the model parameters of the anomaly detection model can be determined by using a plurality of training samples for training based on the anomaly detection algorithm.
  • the trained anomaly detection model can output the degree of anomaly of the sample to be tested based on the characteristics of the sample, and the degree of anomaly is the degree of difference between the sample to be tested and most of the samples in the training sample set.
  • the abnormal degree of the hard disk samples to be tested output by the anomaly detection model is negatively correlated with the degree of hard disk health, or, by using In the case of training the anomaly detection model with multiple hard disk samples corresponding to low health degrees, the abnormal degree of the hard disk samples to be tested output by the anomaly detection model is positively correlated with the hard disk health degree, so that the trained anomaly detection model can be used to predict the hard disk. of health.
  • the system architecture includes: a training device 210 for acquiring SMART data of the hard disk from the storage system 120, and using the SMART data to train a plurality of different anomaly detection models, wherein the The training process will be described below in conjunction with FIG. 3 .
  • the training device 210 After training the multiple anomaly detection models, the training device 210 sends the multiple models to the storage system 120, so that the storage system 120 can use the multiple anomaly detection models to predict the health of the hard disk. It can be understood that the training device 210 can be connected to multiple storage systems to obtain SMART data of the hard disks in each storage system for model training.
  • the storage system 120 may transmit the SMART data of the hard disk to the database of the storage device, and the training device 210 may read the hard disk data from the database to perform model training.
  • the storage system 120 may train the model through its own CPU 123 to obtain an anomaly detection model.
  • the storage system 120 can perform model training through a computing chip (such as a Field Programmable Gate Array (FPGA) chip) plugged into the storage system 120 to obtain anomaly detection Model.
  • FPGA Field Programmable Gate Array
  • the CPU 123 and the memory 124 may be the CPU and memory in the controller 0 in FIG. 1A , or may be the CPU and memory in the controller 1 . describe.
  • the storage system 120 can periodically collect SMART data of each hard disk 134 it includes during operation, and can send specific SMART data to the training device 210 according to the needs of model training, which will be described in detail below with reference to FIG. 3 .
  • the storage system 120 may store the plurality of anomaly detection models in persistent storage, such as the hard disk 134 .
  • the storage system 120 can read the multiple anomaly detection models from the hard disk 134 and store the multiple anomaly detection models in the memory 124 for the CPU 123 to use. Read and run the multiple models.
  • the CPU 123 can periodically acquire the SMART data of each hard disk 134 at a specified usage time (eg, the current time), and input the SMART data into a plurality of abnormality detection models respectively, and then the CPU123 fuses the outputs of the plurality of abnormality detection models, thereby The health of each hard disk 134 at a specified usage time can be obtained. Or the CPU 123 can obtain the SMART data of the new hard disk 134 at the current time multiple times when the storage system 120 adds a new hard disk 134, and input the SMART data into a plurality of abnormality detection models respectively, so that the CPU 123 can be based on the plurality of abnormality. The output of the detection model predicts the health of the newly added hard disk 134 at the specified usage time, which will be described in detail below with reference to FIG. 4 .
  • FIG. 3 is a flowchart of a method for training an anomaly detection model provided by an embodiment of the present application. The method can be executed by the training device 210 in FIG. 2 , and includes the following steps:
  • Step S301 receive the SMART data of the hard disk in at least one usage time from the storage system
  • Step S302 an anomaly detection model is trained based on the SMART data.
  • step S301 the SMART data of the hard disk at least one usage time is received from the storage system 120.
  • the storage system 120 can select the SMART data of the hard disk according to the needs of model training and send it to the training device 210 .
  • All SMART data of the hard disk includes values of multiple indicators, for example, more than one hundred indicators, and the storage system 120 may select data of indicators related to the health degree and lifespan degradation of the hard disk from all the SMART data.
  • the selected indicators include, for example, power-on time, switch count, number of uncorrectable errors, number of newly added bad sectors, number of newly added bad blocks, total erasure times, and the like.
  • the storage system 120 may also select SMART data of multiple usage times in the predetermined usage period of the hard disk according to the model training strategy and send it to the training device 210 .
  • the storage system 120 may select SMART data of a plurality of hard disks in a predetermined period after activation to train an anomaly detection model. Since the health degree of the hard disk is usually the highest when it is newly activated (for example, the health degree at this time can be expressed as 1), the anomaly detection model is trained by using the SMART data of this period, that is, to correspond to the high degree of health. SMART data is used as most of the normal samples.
  • the abnormality degree of the hard disk output by the trained abnormality detection model is negatively correlated with the health degree of the hard disk. That is, the larger the abnormality degree is, the worse the health state of the hard disk is, that is, the smaller the health degree is. .
  • the storage system 120 may select the SMART data of the faulty hard disk in the predetermined usage period before the failure after the hard disk fails to train the anomaly detection model. Since the health of the hard disk is the lowest at the time of failure (for example, the health at this time can be expressed as 0), the anomaly detection model is trained by using the SMART data of this period, that is, to the SMART data corresponding to the low health. As most normal samples, the abnormality degree of the hard disk output by the trained abnormality detection model is positively correlated with the health degree of the hard disk.
  • the storage system 120 since the hard disk is usually in a healthy state when it is just activated, the storage system 120 usually does not collect SMART data within a predetermined period after the hard disk is activated, but after the hard disk has been used for a long period of time (eg, in the middle of the hard disk lifespan). time) to start collecting SMART data for monitoring the health status of the hard disk. Therefore, the hard disk SMART data stored in the storage system 120 usually lacks the data when the hard disk is just activated, and has more data before the failure occurs. In view of this actual situation, it is more suitable to adopt the method for training an anomaly detection model in this embodiment.
  • the abnormality degree output by the abnormality detection model is positively correlated with the health degree, for example, the abnormality degree can be directly regarded as the health degree, thereby reducing the calculation cost of the health degree.
  • this embodiment will be described in detail by taking this embodiment as an example.
  • the storage system 120 may further preprocess the selected SMART data, and send the preprocessed SMART data to the training device 210 .
  • the sampling time of the SMART data can be uniformly distributed through processing.
  • the SMART data of the hard disk be a set of SMART data per day, in the case of multiple sampling time points in a day, for the multiple sampling time points The average value of the sampled SMART data is taken as the SMART data corresponding to the day.
  • the SMART data of the day can be supplemented by the interpolation method.
  • step S302 an anomaly detection model is trained based on the SMART data.
  • multiple anomaly detection algorithms based on different principles are adopted, and multiple anomaly detection models are trained by using the SMART data obtained above.
  • the iforest anomaly detection model is trained by the Isolation Forest (iforest) algorithm.
  • the Isolation Forest algorithm is an unsupervised anomaly detection algorithm that does not require labeled samples for training.
  • the training samples may be a set of SMART data for a predetermined period of time before the hard disk fails, and the predetermined period may be a period of time less than a predetermined time period before the hard disk fails.
  • the training device 210 may use the N samples X to train a plurality of isolation trees, thereby obtaining the iforest anomaly detection model.
  • the samples in the newly generated node can be divided through a process similar to the above, and new sub-nodes can be generated until there is only one sample in the final generated sub-node (the sub-node cannot be further divided, the sub-node is the leaf node) Or the isolated tree has grown to a set level height, so the growth of the isolated tree is stopped, and the level height is the number of connecting edges between the nodes from the leaf node to the root node.
  • the training of the iforest anomaly detection model is completed. Therefore, based on the iforest anomaly detection model obtained by the training, the anomaly degree of the sample to be tested can be predicted by the following formula (1):
  • x represents the sample to be tested, which, like X, includes SMART data of multiple indicators related to health
  • is the number of samples used to train each isolated tree
  • h(x) is the sample x in each isolated tree.
  • Layer height E(h(x)) is the expected value of the layer height of the sample x in t isolated trees
  • c( ⁇ ) is the average value of the layer height of the isolated tree when the number of training samples ⁇ is given, which is used for the sample x.
  • the expected value E(h(x)) of the layer height is normalized. It can be seen from formula (1) that if the expected value of the layer height of the sample to be tested in the isolated forest model is smaller, the abnormality is higher. This is because the smaller the expected value of the layer height of the sample to be tested is, it means that the sample to be tested falls in a region where the distribution of training samples is sparse, so the abnormality of the sample to be tested compared with the training sample is higher.
  • the output of the anomaly detection model indicates the degree of anomaly of the sample to be tested compared to the plurality of samples with a lower degree of health , therefore, the higher the abnormality, the higher the health of the sample to be tested.
  • the LOF anomaly detection model is trained by a Local Outlier Factor (LOF) algorithm.
  • the sample x has a corresponding point x in the above space
  • N k (x) represents all points within the k-th distance from the point x, which can be called the k-th neighborhood of point x
  • ⁇ k (p) is the density of points in the k-th neighborhood of point p belonging to the range of N k (x)
  • ⁇ k (x) is the density of points in the k-th neighborhood of point x.
  • the K-means anomaly detection model is trained by an anomaly detection algorithm based on a K-means clustering algorithm.
  • the distance between the sample x to be tested and the centroid point can be calculated, and the degree of anomaly of the sample x can be determined based on the distance. Specifically, the larger the distance, the greater the distance between the sample x and the centroid. The further away, the higher the anomaly.
  • the abnormality can be calculated by the following formula (3):
  • pi is the sample to be tested
  • p j is the N training samples
  • Dis( pi ) represents the distance of the sample pi from the centroid
  • var(Dis(p j )) is the variance of the distances between the N training samples p j and the centroid.
  • the anomaly detection model trained in the embodiment of the present application is not limited to the above three anomaly detection models, but may be any other type of anomaly detection model.
  • the training device 210 sends the multiple anomaly detection models to the storage system 120 , so that the storage system 120 can predict the health of the hard disk through the multiple anomaly detection models.
  • the training device 210 may send the model structure and model parameters included in each anomaly detection model to the storage system 120 .
  • the training device 210 may send the model data such as the node composition in each isolated tree included in the model, the SMART index and split value corresponding to each non-leaf node, and the number of training samples ⁇ to the storage system 120, so that the storage system 120 can use the ifest anomaly detection model through the model data.
  • FIG. 4 is a flowchart of a hard disk health assessment method provided by an embodiment of the present application. The method can be executed by the storage system 120 in FIG. 2 , and the method includes:
  • Step S401 obtaining the data of multiple indicators related to the health degree of the hard disk at the specified use time
  • Step S402 inputting the data into multiple different anomaly detection models
  • Step S403 Determine the health of the hard disk at the specified usage time according to the outputs of the multiple abnormality detection models.
  • step S401 data of multiple indicators related to the degree of health of the hard disk at the specified usage time are obtained.
  • the storage system 120 may, after collecting a set of SMART data of the hard disk corresponding to the current usage time each time, select data of multiple indicators related to the health of the hard disk at the current usage time from the group of SMART data. In addition, after acquiring the data of the multiple indicators, the storage system 120 may further perform preprocessing on the data of the multiple indicators as described above, for example, the preprocessing is used to obtain data with uniform time distribution. It can be understood that the storage system 120 is not limited to selecting the data of a plurality of indicators immediately after collecting the SMART data of the hard disk to predict the health of the hard disk, and can also select the specified usage time from the SMART data of the pre-collected hard disk at any time. The data of multiple indicators is used to predict the health of the hard disk at the specified usage time.
  • step S402 the data is input into a plurality of different anomaly detection models.
  • the storage system 120 can input the data of multiple indicators of the hard disk 134 selected in the previous step at the specified usage time into multiple anomaly detection models, so that the anomaly degree of the data output by each model can be obtained.
  • step S403 the health degree of the hard disk at the specified usage time is determined according to the outputs of the multiple abnormality detection models.
  • the degree of abnormality of the hard disk output by each abnormality detection model can be correlated to the degree of health of the hard disk, so that the hard disk output from each abnormality detection model can be
  • the abnormality at the specified usage time determines the health of the hard disk at the specified usage time.
  • the storage system 120 may directly use the abnormality degree of the hard disk output by the abnormality detection model at the specified usage time as the health degree of the hard disk at the specified usage time.
  • the storage system 120 may convert the abnormality degree output by the abnormality detection model into a health degree or a value positively correlated with the health degree based on the correlation.
  • the three anomaly detection models obtained from the above training have their own advantages and disadvantages.
  • the advantage of the iForest anomaly detection model is that by integrating multiple binary trees, the algorithm has good robustness, and it is suitable for large-scale data sets and parallel computing, and is not sensitive to hyperparameters. Data sets with more samples or higher feature dimensions are less accurate.
  • the advantage of the LOF anomaly detection model is that it does not make too many prior assumptions about the original data distribution, and has a strong ability to discriminate local anomalies. too sensitive.
  • the advantage of the K-Means anomaly detection model is that the algorithm is simple and intuitive, and it has a certain ability to adapt to both local anomalies and all anomalies.
  • the disadvantage is that the algorithm itself is designed for data clustering and is more suitable for data integration with spherical distribution scene, and is relatively sensitive to hyperparameters.
  • the storage system 120 may fuse the outputs of the multiple models, and obtain the hard disk according to the fusion result. Therefore, the shortcomings of each anomaly detection model are balanced, and a stable and smooth variation curve of the health degree of the hard disk with respect to time can be obtained.
  • the storage system 120 may calculate a weighted sum of the outputs of the multiple anomaly detection models as shown in formula (4), so as to use the result of the weighted sum as the hard disk of health:
  • Score i is the output of each anomaly detection model
  • a i is the weight of each anomaly detection model
  • Score is the health degree of the hard disk obtained by fusing the outputs of each anomaly detection model. It can be understood that in the case where the output of the model is negatively correlated with the health degree, Score i in formula (4) may be a value positively correlated with the health degree obtained by converting the output of each anomaly detection model.
  • the size of the correlation between the output of each anomaly detection model and the health of the hard disk can be predetermined according to the dimensions of the model input data, the proportion of abnormal data, data distribution and other characteristics and the characteristics of each anomaly detection model , so as to determine the weight a i of each anomaly detection model in formula (4).
  • the weight a i of each anomaly detection model in formula (4). For example, in the case where the distribution of SMART data of a hard disk at multiple usage times approaches a spherical distribution, according to the characteristics of each anomaly detection model, it can be determined that the K-Means anomaly detection model can be more accurately predicted. Therefore, K - The weights of the Means anomaly detection model are set higher.
  • the iForest anomaly detection model can be The weight is set to be higher.
  • the weights a i of each abnormality detection model in formula (4) can be dynamically adjusted according to the characteristics of the different hard disks, such as the ratio of abnormal data, data distribution, etc. Improve forecast accuracy.
  • the storage system 120 may determine the health of each hard disk at each usage time by using the method shown in FIG. 4 .
  • the storage system 120 can construct a complete health degree curve of the hard disk from the start of activation to the failure (ie, the end of life) according to the health degree of the hard disk at each usage time, so as to refer to FIG. 7 as follows.
  • the health degree curve of the faulty hard disk is used as the benchmark health degree curve (or comparison health degree curve) to predict the health degree and lifespan of other hard disks (that is, the target disk) in the future, and the benchmark will be provided below.
  • the failed hard disk of the health degree curve is called the standard disk.
  • the storage system 120 may also use Gaussian smoothing
  • the method smoothes the health data of each target disk.
  • the time length of the smoothing window can be set, and the smoothing weight can be set for the samples of each use time in the smoothing window according to experience, and the smoothing window can be slid in the health data of the hard disk according to the use time, and according to the smoothing weight.
  • the health data in the sliding window is modified to smooth the data.
  • Table 1 shows an example form of smoothing window:
  • the use time length of the sliding window is set to 5 days, wherein the weight of the data on the first day in the sliding window is set to 2.28%, and the weight of the data on the second day in the sliding window is set as 13.59%, set the weight of the data on day 3 within the sliding window to 68.27%, and so on.
  • the Gaussian smoothing method assumes that the center point of each piece of data in the health degree curve has the closest relationship with the smoothing result, so it has the highest weight. As the distance from the center point increases, the relationship gradually changes. Small, that is, the weight gradually becomes smaller.
  • the storage system 120 may draw a health degree curve for the target disk based on the health degree of the target disk at each usage time after the above-mentioned smoothing process.
  • FIG. 5 is a schematic diagram of the health degree curve of each hard disk obtained through the above process.
  • FIG. 6 is an enlarged view of a fitness level curve in FIG. 5 .
  • the horizontal axis of the coordinates represents the time from when the hard disk is enabled (for example, the time unit is “day”)
  • the vertical axis of the coordinates represents the health degree of the hard disk. It can be seen from the plurality of health degree curves that the health degree of the hard disk obtained by the method provided by the embodiment of the present application basically decreases smoothly and stably with time.
  • the storage system 120 After the storage system 120 obtains the complete health degree curves from activation to failure of each benchmarking disk, these health degree curves can be added to the benchmarking data set, so as to predict the health degree and the degree of health of the hard disk in use in the future time. life. It can be understood that the storage system 120 is not limited to obtaining the health degree curves of the faulty hard disks included in itself as described above. For example, the storage system 120 may receive the health degree curves of the faulty hard disks in the other storage systems from other storage systems, and use the The fitness curve is added to the benchmarking dataset.
  • the storage system 120 can obtain the health degree curve of the hard disk 134 in use in the storage system 120 for a period of time, for example, the period from the start of the hard disk 134 to the present. Part of the health degree curve of the hard disk 134 is compared with the benchmark health degree curve in the above-mentioned benchmarking data set, so as to predict the health degree and life span of the hard disk 134 in the future.
  • FIG. 7 is a flowchart of a method for predicting a hard disk health degree provided by an embodiment of the present application. The method can be executed by the storage system 120 in FIG. 2, and includes the following steps:
  • Step S701 in the benchmarking data set, select the benchmarking health degree curve according to the similarity between the benchmarking health degree curve and the health degree corresponding to the usage time in the partial health degree curve of the hard disk 134 to be predicted;
  • Step S702 fitting the mapping relationship between the health degree in the selected benchmark health degree curve and the health degree corresponding to the usage time in the partial health degree curve of the hard disk 134 to be predicted;
  • Step S703 according to the selected benchmarking health degree curve and the mapping relationship, predict the health degree of the hard disk 134 at a certain time in the future.
  • the benchmarking health degree curve is selected according to the similarity between the benchmarking health degree curve and the health degree corresponding to the usage time in the partial health degree curve of the hard disk 134 to be predicted.
  • the CPU 123 may calculate the partial health degree curve of the hard disk 134 and each benchmarking health degree curve in the benchmarking data set, respectively. similarity of the curves.
  • the CPU 123 can calculate the Euclidean distance between the health degree in the partial health degree curve of the hard disk 134 and the health degree corresponding to the usage time in each of the benchmarked health degree curves, so as to calculate the difference between the partial health degree curve and each of the benchmarked health degree curves. similarity between.
  • the partial health degree curve of the hard disk 134 and any benchmarked health degree curve in the benchmarking data set may have two similar curves. If the two similar curves are aligned in time, the The health degree of the corresponding usage time in the two similar curves may be the health degree of the same usage time in the two similar curves.
  • the two similar curves may be misaligned (ie, time lengths are not equal) in time (ie, the x-axis).
  • their health degrees may have different decay rates in time, so the time spans of similar curves in the health degree curves of hard disks of different capacities are different.
  • a hard disk with a large capacity due to the low probability of use of each storage unit in the hard disk, its health decline rate may be slower than that of a hard disk with a small capacity, so the health degree of the hard disk with a large capacity Similar curves in the curve have longer time spans.
  • the CPU 123 may determine the health degree of the corresponding usage time in the above two similar curves through a dynamic time warping algorithm (Dynamic Time Warping, DTW). Specifically, the CPU 123 shortens or extends one of the two similar curves on the time axis, so that the two similar curves are aligned in time. After the processing, one of the two time-aligned similar curves is The health of the same usage time is the health of the corresponding usage time.
  • DTW Dynamic Time Warping
  • the CPU 123 can calculate the Euclidean distance between the health degrees corresponding to the usage time, thereby calculating the similarity between the two similar curves, that is, the partial health of the hard disk 134 The similarity of the health degree corresponding to the usage time between the health degree curve and the benchmark health degree curve.
  • one or more benchmarking health degree curves with the highest similarity may be selected from the benchmarking data set.
  • step S702 fit the mapping relationship between the health degree in the selected benchmark health degree curve and the health degree corresponding to the usage time in the partial health degree curve of the hard disk 134 to be predicted.
  • the health degree of the use time t can be obtained from the benchmark health degree curve, the health degree of the use time t' corresponding to the use time t can be obtained from the partial health degree curve of the hard disk 134, and the health degree of the time t and the time t' can be obtained.
  • the health degree of t constitutes a training sample, so that multiple training samples corresponding to multiple use time pairs (t, t') can be obtained to train the regression model to fit the use time t in the benchmark health degree curve.
  • the mapping relationship between the health degree x and the health degree y corresponding to the usage time t' in the health degree curve of the hard disk 134 can be obtained from the benchmark health degree curve.
  • the regression model has the linear regression model of the following formula (6):
  • a and b are coefficients that need to be determined through training samples.
  • the regression model can be trained through the least squares method to determine the coefficients a and b. It can be understood that, in this embodiment of the present application, the regression model is not limited to a linear regression model, but may be any other regression model, such as a polynomial regression model.
  • step S703 according to the selected benchmarking health degree curve and the mapping relationship, the health degree of the hard disk 134 at a certain time in the future is predicted.
  • the above-mentioned times t1 and t2 may be the same time, and in the two similar curves In the case that the times of t are not aligned, the time t2 corresponding to the time t1 can be determined by the DTW method.
  • m health degree y i (where i is calculated) can be calculated according to the m benchmarking health degree curves and corresponding mapping relationships 1 to m), the m health degrees yi can be weighted and summed as shown in formula (7) to obtain the health degree Y1 of the hard disk 134:
  • ki is the preset weight corresponding to each benchmarking health degree curve.
  • each benchmarking health degree curve can be determined according to the order of the similarity of each benchmarking health degree curve and a part of the health degree curve of the hard disk 134. corresponding weight.
  • the regression model obtained by the above training can transfer the process of changing the health degree over time in the benchmark health degree curve to the health degree curve of the target disk (ie the hard disk 134), Therefore, the regression model plays the role of transferring knowledge, which can also be called a transfer model.
  • the storage system 120 can predict the health of the hard disk 134 at multiple times in the future through the method shown in FIG. 7 . For example, it can predict the health of the hard disk 134 every day in the future, so as to predict the health curve of the hard disk 134 in the future.
  • the storage system 120 may preset a health degree threshold of the hard disk 134, where the health degree threshold corresponds to the health degree of the hard disk 134 when a failure occurs. Therefore, the storage system can determine the time when the health degree of the hard disk 134 reaches the threshold in the predicted future time health degree curve of the hard disk 134, and determine the remaining life of the hard disk according to the time.
  • FIG. 8 is a schematic diagram of a predicted hard disk health degree curve provided by an embodiment of the present application. As shown in FIG.
  • the horizontal axis represents the time since the hard disk is activated, and the vertical axis represents the health of the hard disk.
  • the lower point connecting line in FIG. 8 is the health degree curve C1 of the target disk
  • the upper point connecting line is the health degree curve C2 of the hard disk 134 to be predicted, wherein the solid line part in the curve C2 is based on the hard disk 134 itself.
  • the health degree curve determined by the SMART data for a period of time after activation, the dotted line part in the curve C2 is the health degree curve of the hard disk 134 predicted by the method shown in FIG. 7 in the future time.
  • the first choice is to determine the time t2 corresponding to the time t1 in the curve C1, obtain the health degree x1 at the time t2 in the curve C1, and substitute x1 into the above formula (6) , so that the health degree y1 at time t1 in the curve C2 can be calculated.
  • the health degree Y1 of the hard disk 134 can also be calculated by the above formula (7).
  • a threshold value of the degree of health is set in the curve C2
  • the threshold value corresponding to the time when the lifespan of the hard disk 134 to be predicted ends, so that after the future health degree curve of the hard disk 134 is predicted as described above , the time t3 corresponding to the threshold can be determined in the curve C1 , and the time t3 can be regarded as the end-of-life time of the hard disk 134 .
  • the method shown in FIG. 7 is only an implementation manner for predicting the health degree and lifespan of the hard disk in the future time in the embodiment of the present application, and the embodiment of the present application is not limited thereto.
  • the storage system 120 can directly use the benchmarking health degree when predicting the health degree of the hard disk 134 at the future time t1
  • the health degree at time t2 corresponding to time t1 in the health degree curve is taken as the health degree of the hard disk 134 at time t1 in the future.
  • FIG. 9 is an architecture diagram of a storage device provided by an embodiment of the present application.
  • the storage device can be used to execute any of the methods shown in FIG. 3 , FIG. 4 , or FIG. 7 , and the storage device includes:
  • an obtaining unit 91 configured to obtain data of multiple indicators related to the health of the hard disk at the specified usage time
  • an input unit 92 for inputting the data into a plurality of different models
  • the determining unit 93 is configured to determine the health degree of the hard disk at the specified usage time according to the outputs of the multiple models.
  • the determining unit 93 is specifically configured to determine the health of the hard disk at the specified usage time based on the weighted sum of the outputs of the multiple models.
  • each of the multiple models is obtained by training based on an anomaly detection algorithm, and each model adopts a different anomaly detection algorithm.
  • the number of the multiple models is three, and the anomaly detection algorithms adopted by the three models are an isolated forest algorithm, a local anomaly factor algorithm, and a K-means clustering algorithm, respectively.
  • the multiple models are sent to the storage device by a training device or obtained from the storage device, and the training device is used for training the multiple models.
  • the multiple models are trained by the following sampled data: sampled data of multiple indicators related to the degree of health of the faulty hard disk within a preset period before the end of its life.
  • the hard disk includes a target disk
  • the storage device further includes:
  • an acquiring or generating unit configured to acquire or generate a first data set of multiple benchmarking discs, the first data set including the health degrees of the benchmarking discs at multiple usage times;
  • a generating unit configured to generate a second data set of the target disk, where the second data set includes the health degree of the target disk at multiple usage times, and the time spans of multiple usage times in the first data set greater than the time span of the plurality of usage times in the second data set;
  • a selection unit configured to select a benchmarking disc according to the similarity of the health degrees of multiple usage times corresponding to the first data set and the second data set;
  • a prediction unit configured to predict the health degree of the target disk at a specified time in the future according to the selected first data set of the target disk.
  • the predicting unit is specifically configured to: fit a mapping relationship between the selected health degree of the first use time of the target disk and the health degree of the second use time of the target disk , the first usage time and the second usage time are corresponding times; the health degree of the target disk at multiple specified times in the future is predicted according to the mapping relationship and the first data set.
  • the determining unit 93 is further configured to, according to the predicted health degree of the target disk at multiple specified times in the future, determine the time when the health degree of the target disk reaches a threshold, and assign the health degree to the target disk. The time when the threshold value is reached is taken as the end-of-life time of the target disk.
  • a third aspect of the present application provides a storage device, including a processor and a memory, where executable computer program instructions are stored in the memory, and the processor executes the executable computer program instructions to execute FIG. 3 and FIG. 4 . or any of the methods shown in Figure 7.
  • a fourth aspect of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores computer program instructions, and when the computer program instructions are executed in a computer or a processor, the computer program instructions cause the computer to Or the processor executes any one of the methods shown in FIG. 3 , FIG. 4 or FIG. 7 .
  • a fifth aspect of the present application provides a computer program product, comprising computer program instructions, when the computer program instructions are executed in a computer or a processor, the computer or processor is made to perform the first aspect or a possible implementation of the first aspect method described.
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be downloaded from a website site, computer, server or data center Transmission to another website site, computer, server, or data center by wire (eg, coaxial cable, optical fiber, digital subscriber line, or wireless (eg, infrared, wireless, microwave, etc.) means.
  • the computer-readable storage medium may be a computer Any available medium that can be accessed or a data storage device such as a server, data center, etc., that contains one or more of the available mediums integrated.
  • the disclosed apparatus and method may be implemented in other manners without exceeding the scope of the present application.
  • the above-described embodiments are only illustrative.
  • the division of the modules or units is only a logical function division.
  • multiple units or components may be combined. Either it can be integrated into another system, or some features can be omitted, or not implemented.
  • the unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units .
  • Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. Those of ordinary skill in the art can understand and implement it without creative effort.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Evolutionary Biology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种硬盘健康评估方法和装置,所述方法包括:获取硬盘在指定使用时间的与健康度相关的多个指标的数据(S401);将所述数据输入多个不同的模型(S402);根据所述多个模型的输出确定所述硬盘在所述指定使用时间的健康度(S403)。该硬盘健康度评估方法通过对多个模型的输出进行融合,得到稳定且准确的健康度指标。

Description

一种硬盘健康评估方法和存储设备
本申请要求于2021年4月26日提交中国专利局、申请号为202110453844.1、申请名称为“固态硬盘健康度评价和寿命预测方法”的中国专利申请、以及于2021年7月16日提交中国专利局、申请号为202110812127.3、申请名称为“一种硬盘健康评估方法和存储设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及存储技术领域,具体涉及一种硬盘健康评估方法和存储设备。
背景技术
伴随大数据、云计算、人工智能技术的飞速发展,对高可靠性的存储系统的需求日益增多。而存储系统中硬盘的可靠性成为制约存储系统可靠性的重要因素之一,硬盘故障会导致用户数据丢失损坏、系统读写性能降低、存储系统瘫痪等。因此,如何准确地预测硬盘健康度和寿命,成为日益关注的问题。
在硬盘中通常根据自动监视分析及报告技术(Self-Monitoring,Analysis and Reporting Technology,SMART),对硬盘中多个参数进行实时监测并记录为SMART数据,从而可基于该SMART数据对每块硬盘进行监控。
在相关技术中,通常通过欧式距离法或者线性评估法等方法评估硬盘的健康度。其中,在欧式距离法中,基于硬盘的SMART数据与阈值数据之间的距离度量硬盘健康度,在线性评估方法中,根据构建的硬盘健康度与时间的线性函数来预测硬盘的健康度。然而,通过上述方法获取的硬盘健康度都不能稳定地指示硬盘的实际健康度,具有较大的误差。
发明内容
本申请实施例旨在提供一种硬盘健康度评估方案,通过对多个异常检测模型的基于硬盘SMART数据的输出值进行融合,得到稳定且准确的健康度指标。
为实现上述目的,本申请第一方面提供一种硬盘健康评估方法,所述方法由存储设备执行,包括:获取硬盘在指定使用时间的与健康度相关的多个指标的数据;将所述数据输入多个不同的模型;根据所述多个模型的输出确定所述硬盘在所述指定使用时间的健康度。一。
通过根据多个不同模型的输出确定硬盘的健康度,综合各个模型的特点,提供了稳定且准确的健康度指标。
在第一方面的一种可能的实现方式中,所述根据所述多个模型的输出确定所述硬盘在所述指定使用时间的健康度具体包括,基于所述多个模型的输出的加权和确定所述硬盘在所述指定使用时间的健康度。
通过对多个模型的输出进行加权求和,对多个模型的输出进行了融合,各个模型的输出的权重可以是相等的,或者可以是不相等的,或者可以是根据不同的场景动态 调整。
在第一方面的一种可能的实现方式中,所述多个模型中每个模型基于异常检测算法训练得到,且每个模型所采用的异常检测算法不同。
通过基于不同的异常检测算法训练得到不同的模型,由于异常检测算法为非监督学习,因此不需要对样本进行人工标注,节省了人工成本,同时,异常检测模型可提供较高的预测准确性。
在第一方面的一种可能的实现方式中,所述多个模型的数量为三个,所述三个模型所采用的异常检测算法分别为孤立森林算法、局部异常因子算法、K均值聚类算法。
在第一方面的一种可能的实现方式中,所述多个模型由训练设备发送给所述存储设备或者由所述存储设备训练得到,所述训练设备用于训练所述多个模型。
在第一方面的一种可能的实现方式中,所述多个模型通过以下采样数据进行训练:故障硬盘在寿命结束之前的预设使用时段内的与健康度相关的多个指标的采样数据。
通过采用故障硬盘在寿命结束之前的一段时间内的多个指标的采样数据训练异常检测模型,由于实际中存储设备通常采样硬盘使用寿命后半段的SMART数据,因此,该采样数据更容易获取,另外,由于所述采样数据对应于硬盘的较低健康度,从而使得异常检测模型输出的异常度可与健康度正相关,例如,可直接将该异常度用作为健康度,从而减少了健康度的计算成本。
在第一方面的一种可能的实现方式中,所述硬盘包括目标盘,所述方法还包括:获取或生成多个对标盘的第一数据集,所述第一数据集包括所述对标盘在多个使用时间的健康度;生成所述目标盘的第二数据集,所述第二数据集包括所述目标盘在多个使用时间的健康度,所述第一数据集中的多个使用时间的时间跨度大于所述第二数据集中的多个使用时间的时间跨度;根据所述第一数据集与所述第二数据集对应的多个使用时间的健康度的相似度,选取对标盘;根据所述选取的对标盘的第一数据集,预测所述目标盘在未来的指定时间的健康度。。
通过根据对标盘的数据预测目标盘的健康度,可减少计算成本,并准确地预测目标盘在未来某个时间的健康度。
在第一方面的一种可能的实现方式中,所述根据所述选取的对标盘的第一数据集,预测所述目标盘在未来的指定时间的健康度包括:拟合所述选取的对标盘的第一使用时间的健康度与所述目标盘的第二使用时间的健康度之间的映射关系,所述第一使用时间与所述第二使用时间为对应的时间;根据所述映射关系和所述第一数据集预测所述目标盘在未来多个指定时间的健康度。
通过拟合对标盘的健康度与目标盘的对应使用时间的健康度之间的映射关系,可进一步提高预测的目标盘在未来时间的健康度的准确性。
在第一方面的一种可能的实现方式中,所述方法还包括,根据所预测的所述目标盘在未来多个指定时间的健康度,确定所述目标盘的健康度达到阈值的时间,将健康度达到阈值的时间作为所述目标盘的寿命结束时间。
通过根据预测的目标盘在未来时间的健康度预测目标盘的寿命,可较准确地预测目标盘的寿命,从而可预先进行数据备份等操作,以防止目标盘故障导致的各种问题。
在第一方面的一种可能的实现方式中,所述硬盘为固态硬盘。
本申请第二方面提供一种存储设备,包括:获取单元,用于获取硬盘在指定使用时间的与健康度相关的多个指标的数据;输入单元,用于将所述数据输入多个不同的模型;确定单元,用于根据所述多个模型的输出确定所述硬盘在所述指定使用时间的健康度。
在第二方面的一种可能的实现方式中,所述确定单元具体用于,基于所述多个模型的输出的加权和确定所述硬盘在所述指定使用时间的健康度。
在第二方面的一种可能的实现方式中,所述多个模型中每个模型基于异常检测算法训练得到,且每个模型所采用的异常检测算法不同。
在第二方面的一种可能的实现方式中,所述多个模型的数量为三个,所述三个模型所采用的异常检测算法分别为孤立森林算法、局部异常因子算法、K均值聚类算法。
在第二方面的一种可能的实现方式中,所述多个模型由训练设备发送给所述存储设备或者由所述存储设备训练得到,所述训练设备用于训练所述多个模型。
在第二方面的一种可能的实现方式中,所述多个模型通过以下采样数据进行训练:故障硬盘在寿命结束之前的预设使用时段内的与健康度相关的多个指标的采样数据。
在第二方面的一种可能的实现方式中,所述硬盘包括目标盘,所述存储设备还包括:获取或生成单元,用于获取或生成多个对标盘的第一数据集,所述第一数据集包括所述对标盘在多个使用时间的健康度;生成单元,用于生成所述目标盘的第二数据集,所述第二数据集包括所述目标盘在多个使用时间的健康度,所述第一数据集中的多个使用时间的时间跨度大于所述第二数据集中的多个使用时间的时间跨度;选取单元,用于根据所述第一数据集与所述第二数据集对应的多个使用时间的健康度的相似度,选取对标盘;预测单元,用于根据所述选取的对标盘的第一数据集,预测所述目标盘在未来的指定时间的健康度。
在第二方面的一种可能的实现方式中,所述预测单元具体用于:拟合所述选取的对标盘的第一使用时间的健康度与所述目标盘的第二使用时间的健康度之间的映射关系,所述第一使用时间与所述第二使用时间为对应的时间;根据所述映射关系和所述第一数据集预测所述目标盘在未来多个指定时间的健康度。
在第二方面的一种可能的实现方式中,所述确定单元还用于,根据所预测的所述目标盘在未来多个指定时间的健康度,确定所述目标盘的健康度达到阈值的时间,将健康度达到阈值的时间作为所述目标盘的寿命结束时间。
本申请第三方面提供一种存储设备,其特征在于,包括处理器和存储器,所述存储器中存储有可执行计算机程序指令,所述处理器执行所述可执行计算机程序指令以实现第一方面或第一方面可能的实施方式所述的方法。
本申请第四方面提供一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序指令,当所述计算机程序指令在计算机或处理器中执行时,使得所述计算机或处理器执行第一方面或第一方面可能的实施方式所述的方法。
本申请第五方面提供一种计算机程序产品,包括计算机程序指令,当所述计算机程序指令在计算机或处理器中运行时,使得所述计算机或处理器执行第一方面或第一方面可能的实施方式所述的方法。
附图说明
通过结合附图描述本申请实施例,可以使得本申请实施例更加清楚:
图1A为本申请实施例所应用的盘控分离结构的集中式存储系统120的架构图;
图1B为本申请实施例所应用的盘控一体结构的集中式存储系统120的架构图;
图1C为本申请实施例所应用的分布式存储系统的架构图;
图2是本申请实施例提供的一种用于训练硬盘异常检测模型的系统架构示意图;
图3为本申请实施例提供的训练异常检测模型的方法流程图;
图4为本申请实施例提供的一种硬盘健康评估方法流程图;
图5为本申请实施例提供的各个硬盘的健康度曲线的示意图
图6为图5中的一个健康度曲线的放大图;
图7为本申请实施例提供的预测硬盘健康度的方法流程图;
图8为本申请实施例提供的预测的硬盘健康度曲线的示意图;
图9为本申请实施例提供的一种存储设备的架构图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
本申请实施例提供的硬盘健康评估方案可应用于存储系统中。存储系统包括集中式存储系统和分布式存储系统。所述集中式存储系统是指由一台或多台主设备组成中心节点,数据集中存储于这个中心节点中,并且整个系统的所有数据处理业务都集中部署在这个中心节点上。所述分布式存储系统是指将数据分散存储在多台独立的存储节点上的系统。用户可通过应用程序来向存储节点存取数据。运行这些应用程序的计算机被称为“应用服务器”。应用服务器可以是物理机,也可以是虚拟机。物理应用服务器包括但不限于桌面电脑、服务器、笔记本电脑以及移动设备。应用服务器可通过光纤交换机访问存储节点以存取数据。其中,交换机只是一个可选设备,应用服务器也可以直接通过网络与存储节点通信。
图1A为本申请实施例所应用的盘控分离结构的集中式存储系统120的架构图。该存储系统120与多个主机200连接,所述多个主机200例如为应用服务器,其都与存储系统120连接,以向存储系统120存取数据。图1A所示的集中式存储系统的特点是具有统一的入口,从主机200来的数据都要经过这个入口,这个入口例如为存储系统120中的引擎121。
如图1A所示,引擎121中有一个或多个控制器,图1A以引擎包含两个控制器为例予以说明。控制器0与控制器1之间具有镜像通道,那么当控制器0将一份数据写入其内存124后,可以通过所述镜像通道将所述数据的副本发送给控制器1,控制器1将所述副本存储在自己本地的内存124中。由此,控制器0和控制器1互为备份,当控制器0发生故障时,控制器1可以接管控制器0的业务,当控制器1发生故障时,控制器0可以接管控制器1的业务,从而避免硬件故障导致整个存储系统120的不可用。当引擎121中部署有4个控制器时,任意两个控制器之间都具有镜像通道,因此任意两个控制器互为备份。
引擎121还包含前端接口125和后端接口126,其中前端接口125用于与应用服 务器通信,从而为应用服务器提供存储服务。而后端接口126用于与硬盘134通信,以扩充存储系统的容量。通过后端接口126,引擎121可以连接更多的硬盘134,从而形成一个非常大的存储资源池。
在硬件上,如图1A所示,控制器0至少包括处理器123、内存124。处理器123是一个中央处理器(central processing unit,CPU),用于处理来自存储系统外部(服务器或者其他存储系统)的数据访问请求,也用于处理存储系统内部生成的请求。示例性的,处理器123通过前端端口接收服务器发送的写数据请求时,会将这些写数据请求中的数据暂时保存在内存124中。当内存124中的数据总量达到一定阈值时,处理器123通过后端端口126将内存124中存储的数据发送给硬盘134进行持久化存储。
内存124是指与处理器123直接交换数据的内部存储器,它可以随时读写数据,而且速度很快,作为操作系统或其他正在运行中的程序的临时数据存储器。内存124包括至少两种存储器,例如内存既可以是随机存取存储器,也可以是只读存储器(Read Only Memory,ROM)。举例来说,随机存取存储器是动态随机存取存储器(Dynamic Random Access Memory,DRAM),或者存储级存储器(Storage Class Memory,SCM)。DRAM是一种半导体存储器,与大部分随机存取存储器(Random Access Memory,RAM)一样,属于一种易失性存储器(volatile memory)设备。SCM是一种同时结合传统储存装置与存储器特性的复合型储存技术,存储级存储器能够提供比硬盘更快速的读写速度,但运算速度上比DRAM慢,在成本上也比DRAM更为便宜。然而,DRAM和SCM在本实施例中只是示例性的说明,内存还可以包括其他随机存取存储器,例如静态随机存取存储器(Static Random Access Memory,SRAM)等。而对于只读存储器,举例来说,可以是可编程只读存储器(Programmable Read Only Memory,PROM)、可抹除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)等。另外,内存124还可以是双列直插式存储器模块或双线存储器模块(Dual In-lineMemoryModule,简称DIMM),即由动态随机存取存储器(DRAM)组成的模块,还可以是固态硬盘(Solid State Disk,SSD)。实际应用中,控制器0中可配置多个内存124,以及不同类型的内存124。本实施例不对内存124的数量和类型进行限定。此外,可对内存124进行配置使其具有保电功能。保电功能是指系统发生掉电又重新上电时,内存124中存储的数据也不会丢失。具有保电功能的内存被称为非易失性存储器。
内存124中存储有软件程序,处理器123运行内存124中的软件程序可实现对硬盘的管理。所述对硬盘的管理例如将硬盘抽象化为存储资源池,然后划分为LUN提供给服务器使用等。这里的LUN其实就是在服务器上看到的硬盘。当然,一些集中式存储系统本身也是文件服务器,可以为服务器提供共享文件服务。
控制器1(以及其他图1A中未示出的控制器)的硬件组件和软件结构与控制器0类似,这里不再赘述。
在图1A所示的盘控分离的存储式系统中,引擎121可以不具有硬盘槽位,硬盘134需要放置在硬盘阵列130中,后端接口126与硬盘阵列130通信。后端接口126以适配卡的形态存在于引擎121中,一个引擎121上可以同时使用两个或两个以上后端接口126来连接多个硬盘阵列。或者,适配卡也可以集成在主板上,此时适配卡可 通过PCIE总线与处理器123通信。
需要说明的是,图1A中只示出了一个引擎121,然而在实际应用中,存储系统中可包含两个或两个以上引擎121,多个引擎121之间做冗余或者负载均衡。
硬盘阵列130包括控制单元131和若干个硬盘134。控制单元131可具有多种形态。一种情况下,硬盘阵列130属于智能盘框,如图1A所示,控制单元131包括CPU和内存。CPU用于执行地址转换以及读写数据等操作。内存用于临时存储将要写入硬盘134的数据,或者从硬盘134读取出来将要发送给控制器的数据。另一种情况下,控制单元131是一个可编程的电子部件,例如数据处理单元(data processing unit,DPU)。DPU具有CPU的通用性和可编程性,但更具有专用性,可以在网络数据包,存储请求或分析请求上高效运行。DPU通过较大程度的并行性(需要处理大量请求)与CPU区别开来。可选的,这里的DPU也可以替换成图形处理单元(graphics processing unit,GPU)、嵌入式神经网络处理器(neural-network processing units,NPU)等处理芯片。通常情况下,控制单元131的数量可以是一个,也可以是两个或两个以上。当硬盘阵列130包含至少两个控制单元131时,硬盘134与控制单元131之间具有归属关系,每个控制单元只能访问归属于它的硬盘,因此这往往涉及到在控制单元131之间转发读/写数据请求,导致数据访问的路径较长。另外,如果存储空间不足,在硬盘阵列130中增加新的硬盘134时需要重新绑定硬盘134与控制单元131之间的归属关系,操作复杂,导致存储空间的扩展性较差。因此在另一种实施方式中,控制单元131的功能可以卸载到网卡104上。换言之,在该种实施方式中,硬盘阵列130内部不具有控制单元131,而是由网卡104来完成数据读写、地址转换以及其他计算功能。此时,网卡104是一个智能网卡。它可以包含CPU和内存。CPU用于执行地址转换以及读写数据等操作。内存用于临时存储将要写入硬盘134的数据,或者从硬盘134读取出来将要发送给控制器的数据。也可以是一个可编程的电子部件,例如数据处理单元(data processing unit,DPU)。DPU具有CPU的通用性和可编程性,但更具有专用性,可以在网络数据包,存储请求或分析请求上高效运行。DPU通过较大程度的并行性(需要处理大量请求)与CPU区别开来。可选的,这里的DPU也可以替换成图形处理单元(graphics processing unit,GPU)、嵌入式神经网络处理器(neural-network processing units,NPU)等处理芯片。硬盘阵列130中的网卡104和硬盘134之间没有归属关系,网卡104可访问该硬盘阵列130中任意一个硬盘134,因此在存储空间不足时扩展硬盘会较为便捷。
所述硬盘134可以为SSD,或者可以为机械硬盘(即磁盘)。SSD与传统磁盘相比具有启动快、快速读写、读取时间固定、工作温度范围广及无噪音等特点,同时固态硬盘不会出现由机械部件活动导致的机械故障,耐冲击、振动和碰撞,具有较高的安全性和可靠性。本申请实施例提供的硬盘健康评估方法适用于对SSD进行健康评估,可以理解,本申请实施例提供的硬盘健康评估方法也适用于对磁盘进行健康评估。
按照引擎121与硬盘阵列130之间通信协议的类型,硬盘阵列130可能是SAS硬盘阵列,也可能是NVMe硬盘阵列以及其他类型的硬盘阵列。SAS硬盘阵列,采用SAS3.0协议,每个框支持25块SAS硬盘。引擎121通过板载SAS接口或者SAS接口模块与硬盘阵列130连接。NVMe硬盘阵列,更像一个完整的计算机系统,NVMe硬盘插在NVMe 硬盘阵列内。NVMe硬盘阵列再通过RDMA端口与引擎121连接。
可以理解,图1A中虽然示出了具有盘控分离结构的集中式存储系统,但并不用于限制本申请实施例的应用范围,例如,本申请实施例还可以应用于图1B所示的盘控一体结构的集中式存储系统。在盘控一体结构的集中式存储系统中,与盘控分离结构不同在于,引擎121具有硬盘槽位,硬盘134可直接部署在引擎121中,后端接口126属于可选配置,当系统的存储空间不足时,可通过后端接口126连接更多的硬盘或硬盘阵列。
本申请实施例还可以应用于图1C所示的分布式的存储系统。所述分布式的存储系统包括存储节点集群。其中,存储节点集群包括一个或多个存储节点20(图1C中示出了三个存储节点20a、20b及20C,但不限于三个存储节点),各个存储节点20之间可以互联。每个存储节点20与多个主机200连接。每个主机200连接多个存储节点20,并且与多个存储节点20交互,以将数据分布存储在多个存储节点中20,从而实现数据的可靠存储。每个存储节点20至少包括处理器201、内存202、及硬盘203。其中,处理器201、内存202级硬盘203与图1A中的处理器123、内存124及硬盘134的结构及功能相同,具体请参看图1A中的相关描述,在此不再赘述。
在存储系统中,可为每个硬盘定期采样SMART数据,从而可根据SMART数据监控硬盘的健康度,预测硬盘故障。其中,该SMART数据例如包括通电时间、开关计数、不可校正错误数、新增坏扇区数、新增坏块数、总擦除次数等多项指标的值。每项SMART指标通常设置有阈值,如果硬盘的某项SMART数据接近该阈值,则表示硬盘将变得不可靠,例如可能导致数据丢失或硬盘故障。
在相关技术中,评估硬盘健康度的方法包括二值法、欧氏距离法及线性评估法等方法。其中,在二值法中,硬盘的健康度包括健康和故障两个状态,即该方法可以预测硬盘是否发生故障、何时发生故障等,但是该方法不能定量的预测硬盘实时的健康度。在欧式距离法中,根据硬盘的SMART数据与对应SMART指标的阈值之间的距离确定硬盘的健康度,然而,限于硬盘实际运行数据的特性,单一的基于距离度量的方法难以获得稳定且准确的健康评价指标。在线性评估方法中,建立硬盘健康度和时间的线性相关函数,然而,硬盘的实际健康度同时受到读写频次、环境温度等工况因素的影响,与时间因素没有严格意义上的强相关性,因此,该方法得到的健康度具有较大的误差。
本申请实施例提供了一种有效地评估硬盘健康状况的方法,在存储系统中使用多个异常检测模型基于硬盘的SMART数据进行预测,并对该多个异常检测模型的输出进行融合,从而得到较准确的硬盘健康度,其中,所述多个异常检测模型基于硬盘的SMART数据训练获得。该方法可由图1A所示的存储系统120、图1B所示的存储系统120或者图1C所示的存储系统执行,下文中以图1A中存储系统120为例进行描述。
图2是本申请实施例提供的一种用于训练异常检测模型的系统架构示意图。所述异常检测模型具有与异常检测算法对应的模型参数,通过基于异常检测算法使用多个训练样本进行训练,从而可确定异常检测模型的模型参数。训练好的异常检测模型可基于待测样本的特征输出该样本的异常度,该异常度也即该待测样本与训练样本集中的大多数样本的差异程度。从而,在通过使用与高健康度对应的多个硬盘样本来训练 异常检测模型的情况中,该异常检测模型输出的待测硬盘样本的异常度与硬盘健康度负相关,或者,在通过使用与低健康度对应的多个硬盘样本来训练异常检测模型的情况中,该异常检测模型输出的待测硬盘样本的异常度与硬盘健康度正相关,从而可使用该经训练的异常检测模型预测硬盘的健康度。
如图2所示,该系统架构包括:训练设备210,用于从存储系统120获取硬盘的SMART数据,并使用该SMART数据训练多个不同的异常检测模型,其中,对多个异常检测模型的训练过程将在下文结合图3进行描述。训练设备210在训练得到多个异常检测模型之后,将该多个模型发送给存储系统120,从而存储系统120可以使用该多个异常检测模型对硬盘健康度预测。可以理解,训练设备210可与多个存储系统连接,以获取各个存储系统中硬盘的SMART数据,以进行模型训练。在另一种实施方式中,存储系统120可将硬盘的SMART数据传输至存储设备的数据库中,训练设备210可从数据库中读取硬盘数据以进行模型训练。在另一种实施方式中,存储系统120可通过自身的CPU123对模型训练,从而得到异常检测模型。在另一种实施方式中,存储系统120可通过插接在存储系统120上的计算芯片(如现场可编程门阵列(Field Programmable Gate Array,FPGA)芯片)进行对模型的训练,从而得到异常检测模型。下文中,将以图2所示的系统架构为例进行描述。
在图2所示的存储系统120中,CPU123和内存124可以为图1A中控制器0中的CPU和内存,也可以为控制器1中的CPU和内存,下文中以控制器0为例进行描述。
存储系统120在运行期间可以定期采集其包括的每块硬盘134的SMART数据,并可根据模型训练的需要将特定SMART数据发送给训练设备210,这将在下文参考图3详细描述。
存储系统120在从训练设备210接收到多个异常检测模型之后,可将该多个异常检测模型存储在持久性存储,例如硬盘134中。存储系统120当需要使用多个异常检测模型220进行对硬盘健康度的预测时,可从硬盘134读取多个异常检测模型,并将该多个异常检测模型存入内存124中,以供CPU123读取并运行该多个模型。CPU123可定期获取各个硬盘134的在指定使用时间(例如当前时间)的SMART数据,并将该SMART数据分别输入多个异常检测模型,之后,CPU123对该多个异常检测模型的输出进行融合,从而可获取每个硬盘134在指定使用时间的健康度。或者CPU123可在存储系统120增加新的硬盘134时,多次获取该新的硬盘134在当前时间的SMART数据,并将该SMART数据分别输入多个异常检测模型,从而CPU123可基于该多个异常检测模型的输出预测该新增加的硬盘134在指定使用时间的健康度,该部分内容将在下文参考图4详细描述。
图3为本申请实施例提供的训练异常检测模型的方法流程图,该方法可由图2中的训练设备210执行,包括以下步骤:
步骤S301,从存储系统接收硬盘在至少一个使用时间的SMART数据;
步骤S302,基于所述SMART数据训练异常检测模型。
下面将详细描述图3所示方法中的各个步骤。
首先,在步骤S301,从存储系统120接收硬盘在至少一个使用时间的SMART数据。
存储系统120可根据模型训练的需要选取硬盘的SMART数据发送给训练设备210。 硬盘的全部SMART数据包括多个指标的值,例如一百多个指标,存储系统120可从全部SMART数据中选取与硬盘健康度和寿命退化相关的指标的数据。所述选取的指标例如包括通电时间、开关计数、不可校正错误数、新增坏扇区数、新增坏块数、总擦除次数等。
存储系统120还可以根据模型训练的策略选取硬盘预定使用时段中多个使用时间的SMART数据发送给训练设备210。
具体是,在一种实施方式中,存储系统120可选取多个硬盘在启用之后的预定时段中的SMART数据来训练异常检测模型。由于硬盘在新启用时的健康度通常是最高的(例如此时的健康度可表示为1),因此,通过使用该时段的SMART数据来训练异常检测模型,即,以对应于高健康度的SMART数据作为大多数的正常样本,训练得到的异常检测模型输出的硬盘的异常度与该硬盘的健康度负相关,即,异常度越大,说明该硬盘健康状态越差,即健康度越小。
在另一种实施方式中,存储系统120可在硬盘发生故障之后选取该故障硬盘在发生故障之前的预定使用时段中的SMART数据来训练异常检测模型。由于硬盘在发生故障时的健康度为最低(例如此时的健康度可表示为0),因此,通过使用该时段的SMART数据来训练异常检测模型,即,以对应于低健康度的SMART数据作为大多数的正常样本,训练得到的异常检测模型输出的硬盘的异常度与该硬盘的健康度正相关,即,异常度越大,说明该硬盘的健康状态越好,即健康度越高。在实际中,由于硬盘在刚启用时通常健康状态良好,因此,存储系统120通常在硬盘启用之后预定时段内并不采集SMART数据,而是在硬盘使用较长一段时间之后(例如硬盘寿命的中间时间)开始采集SMART数据,以用于对硬盘的健康状态进行监控。因此,存储系统120中存储的硬盘SMART数据通常缺乏硬盘刚启用时的数据,而发生故障之前的数据较多。针对该实际情况,更适于采用该实施方式中的训练异常检测模型的方法。另外,由于在该实施方式中,异常检测模型输出的异常度与健康度正相关,例如可直接将该异常度视为健康度,从而减少了健康度的计算成本。下文中将以该实施方式为例进行详细描述。
存储系统120在选取硬盘的SMART数据之后,还可以对该选取的SMART数据进行预处理,并将经预处理的SMART数据发送给训练设备210。例如,可通过处理使得SMART数据的采样时间是均匀分布的,例如,为了使得硬盘的SMART数据为每天一组SMART数据,在一天有多个采样时间点的情况下,对该多个采样时间点采样的SMART数据求均值作为与该天对应的SMART数据,在一天的采样样本缺失的情况下,可通过插值方法补全该天的SMART数据。
在步骤S302,基于所述SMART数据训练异常检测模型。
本申请实施例中采用基于不同原理的多种异常检测算法,通过上述获取的SMART数据训练多个异常检测模型。
在一种实施方式中,通过孤立森林(Isolation Forest,iforest)算法来训练iforest异常检测模型。孤立森林算法是一种无监督异常检测算法,即不需要带标签的样本进行训练。每个训练样本例如为对硬盘在某个时间采集的一组SMART数据X={x 1,x 2,…,x n},该组SMART数据中包括多个与健康度相关的指标(即指标1到n)的数据。如上文所述,训练样本可以为硬盘在发生故障之前预定时段中的某个时间的一 组SMART数据,该预定时段可以为距离硬盘发生故障的时间小于预定时长的一段时间。
训练设备210可使用N个样本X来训练多个孤立树,从而得到iforest异常检测模型。在训练一棵孤立树时,可从N个样本随机选取部分(例如Ψ个)样本,将该Ψ个样本放入孤立树的根节点,随机指定与根节点对应的一个SMART指标q,并随机确定该指标对应的分裂值p,该分裂值p为根节点当前Ψ个样本中的指标q的最大值和最小值之间的值;基于该指标q和分裂值p将Ψ个样本分割到该根节点的两个子节点中,例如将指标q的值小于p的样本分入左边的子节点,将指标q的值大于p的样本分入右边的子节点。之后,可通过与上面类似的过程对新生成的节点中的样本进行分割,并生成新的子节点,直到最后生成的子节点中只有一个样本(无法继续分割,该子节点即为叶子节点)或孤立树已经生长到了设定层高,从而停止对孤立树的生长,所述层高即为从叶子节点到根节点之间的节点间的连接边的个数。
在如上所述获得t棵孤立树(t为预定数目,例如100)之后,对iforest异常检测模型的训练完成。从而可基于该训练得到的iforest异常检测模型,通过如下公式(1)预测待测试样本的异常度:
Figure PCTCN2021118513-appb-000001
其中,x表示待测试样本,其与X一样包括与健康度相关的多个指标的SMART数据,ψ为用于训练各个孤立树的样本数,h(x)为样本x在每棵孤立树的层高,E(h(x))为样本x在t棵孤立树的层高期望值,c(ψ)为给定训练样本数ψ时的孤立树的层高的平均值,用于对样本x的层高期望值E(h(x))进行标准化处理。通过公式(1)可以看出,如果待测试样本在孤立森林模型中的层高期望值越小,异常度越高。这是因为,待测试样本的层高期望值越小,表示该待测试样本落在了训练样本分布稀疏的区域,因此该待测试样本与训练样本相比的异常度越高。
也就是说,在通过硬盘发生故障之前预定时段的多个样本训练异常检测模型的情况下,该异常检测模型的输出指示待测试样本与所述健康度较低的多个样本相比的异常度,因此,异常度越高,即表示该待测试样本的健康度越高。
在另一种实施方式中,通过局部异常因子(Local Outlier Factor,LOF)算法来训练LOF异常检测模型。LOF算法为基于空间中点密度的算法。与上述iForest算法类似地,为了训练LOF模型,训练设备210可获取多个样本X={x1,x2,…,xn},并将该多个样本分布到包括维度1至n的空间中,从而,对于待测试的样本x,可通过如下的公式(2)计算样本x的LOF分数:
Figure PCTCN2021118513-appb-000002
其中,样本x在上述空间中具有对应的点x,N k(x)表示与点x的距离为第k距离以内的所有点,可称为点x的第k邻域,ρ k(p)为属于N k(x)范围中的点p的第k邻域中的点的密度,ρ k(x)为点x的第k邻域中的点密度。根据公式(2)可以得出,如果样本x的LOF分数大于1,表示点x的密度小于其邻域中点的密度,样本x可能为异常样本,即异常度较高,如果样本x的LOF分数小于等于1,表示点x的密度大于等于其邻域 中点的密度,样本x的异常度较小。
在又一种实施方式中,通过基于K均值(K-means)聚类算法的异常检测算法训练K均值异常检测模型。类似地,训练设备210可从数据库220获取N个样本X={x1,x2,…,xn},通过聚类算法对这N个样本进行聚类从而得到类的质心点(即质心样本),从而得到异常检测模型。在使用该异常检测模型对待测试样本x进行预测时,可计算待测试样本x与质心点之间的距离,并基于该距离确定样本x的异常度,具体是,距离越大,样本x距离质心越远,因此异常度越高。在另一种实施方式中,考虑在各个类的大小不一致的情况中,为了对各个类中的待测试样本计算一致的相似度,可通过如下的公式(3)计算异常度:
Figure PCTCN2021118513-appb-000003
其中,p i为待测试样本,p j为N个训练样本,Dis(p i)表示样本p i距离质心的距离,
Figure PCTCN2021118513-appb-000004
表示N个训练样本p j分别距离质心的距离的均值,var(Dis(p j))为N个训练样本p j分别距离质心的距离的方差。
可以理解,本申请实施例中所训练的异常检测模型不限于上述三种异常检测模型,而可以为任意其他类型的异常检测模型。
训练设备210在通过图3所示方法得到多个异常检测模型之后,将该多个异常检测模型发送给存储系统120,从而存储系统120可通过该多个异常检测模型预测硬盘的健康度。具体是,训练设备210可将各个异常检测模型中包括的模型结构和模型参数发送给存储系统120。例如,对于上述iforest异常检测模型,训练设备210可将该模型中包括的各个孤立树中的节点构成、各个非叶子节点对应的SMART指标和分裂值、训练样本数ψ等模型数据发送给存储系统120,从而存储系统120可通过该模型数据进行对iforest异常检测模型的使用。
图4为本申请实施例提供的一种硬盘健康评估方法的流程图,该方法可由图2中的存储系统120执行,该方法包括:
步骤S401,获取硬盘在指定使用时间的与健康度相关的多个指标的数据;
步骤S402,将所述数据输入多个不同的异常检测模型;
步骤S403,根据多个异常检测模型的输出确定所述硬盘在指定使用时间的健康度。
下文将详细描述图4所示方法的各个步骤。
首先,在步骤S401,获取硬盘在指定使用时间的与健康度相关的多个指标的数据。
存储系统120可在每次采集硬盘的与当前使用时间对应的一组SMART数据之后,从该组SMART数据中选取该硬盘在当前使用时间的与健康度相关的多个指标的数据。另外,存储系统120在获取该多个指标的数据之后,还可以如上文所述,对该多个指标的数据进行预处理,所述预处理例如用于获得时间分布均匀的数据。可以理解,存储系统120不限于在采集硬盘的SMART数据之后立即选取多个指标的数据以用于预测硬盘的健康度,也可以在任意时间从预先采集的硬盘的SMART数据中选取指定使用时间的多个指标的数据以用于预测硬盘在指定使用时间的健康度。
在步骤S402,将所述数据输入多个不同的异常检测模型。
如图2所示,存储系统120可将通过上一步骤选取的硬盘134在指定使用时间的 多个指标的数据输入多个异常检测模型,从而可得到各个模型输出的该数据的异常度。
在步骤S403,根据多个异常检测模型的输出确定硬盘在指定使用时间的健康度。
从上述参考图3的描述可以看出,通过选择训练样本,可使得每个异常检测模型输出的硬盘的异常度与硬盘的健康度有一定的相关性,从而可根据各个异常检测模型输出的硬盘在指定使用时间的异常度确定硬盘在指定使用时间的健康度。例如,当异常检测模型输出的异常度与健康度负相关的情况中,存储系统120可以将异常检测模型输出的硬盘在指定使用时间的异常度直接作为硬盘在指定使用时间的健康度。或者,当异常检测模型输出的异常度与健康度负相关的情况中,存储系统120可基于该相关性,将异常检测模型输出的异常度转换为健康度或者与健康度正相关的值。
另外,上述训练得到的三种异常检测模型各有优缺点。具体是,iForest异常检测模型的优点是通过集成多棵二叉树,算法鲁棒性较好,且适合大规模数据集和并行计算,且对超参数不敏感,其缺点是对某些特殊分布、异常样本较多、或者特征维度较高的数据集准确性欠佳。LOF异常检测模型的优点是对原始数据分布不做过多的先验假设,且具有很强的局部异常判别能力,其缺点是,计算量较大,不适用于大数据集场景,对超参数过于敏感。K-Means异常检测模型的优点是,算法简单直观,且对局部异常与全部异常同时具备一定的适应能力,其缺点是,算法本身是针对数据聚类而设计,更适用于数据集成球形分布的场景,且对超参数相对敏感。
在本申请实施例中,存储系统120在将硬盘在指定使用时间的多个指标的数据分别输入多个异常检测模型之后,可对该多个模型的输出进行融合,并根据融合的结果得到硬盘的健康度,从而均衡了各个异常检测模型的缺点,可得到硬盘的稳定平滑的健康度相对于时间的变化曲线。
具体是,在各个异常检测模型的输出与健康度正相关的情况中,存储系统120可如公式(4)所示对多个异常检测模型的输出求加权和,从而将加权和的结果作为硬盘的健康度:
Figure PCTCN2021118513-appb-000005
其中,Score i为各个异常检测模型的输出,a i为各个异常检测模型的权重,Score为通过对各个异常检测模型的输出进行融合所得到的硬盘的健康度。可以理解,在模型的输出与健康度负相关的情况中,公式(4)中的Score i可以为对各个异常检测模型的输出进行转换所获取的与健康度正相关的值。
在一种实施方式中,假设各个异常检测模型的输出与硬盘健康度的相关性是基本相同的,则在上述公式(4)中,可以将各个异常检测模型的权重a i可以是设置为相同的,从而上述公式(4)变为如下公式(5)所示:
Figure PCTCN2021118513-appb-000006
即,对多个异常检测模型的输出求均值。
在另一种实施方式中,可根据模型输入数据的维度、异常数据的比例、数据分布等特征和各个异常检测模型的特性,预先确定各个异常检测模型的输出与硬盘健康度的相关性的大小,从而确定公式(4)中各个异常检测模型的权重a i。例如,在硬盘在多个使用时间的SMART数据的分布趋近球形分布的情况中,根据各个异常检测模型的特点,可确定通过K-Means异常检测模型可更准确地预测,因此,可将K-Means异常 检测模型的权重设置为较高。在待预测的硬盘的数据较多,数据的维度也较多的情况中,根据各个异常检测模型的特点,可确定iForest异常检测模型更适于该情况下的预测,因此可将iForest异常检测模型的权重设置为较高。
在又一种实施方式中,在对不同的硬盘进行预测时,可根据不同硬盘的例如异常数据的比例、数据分布等特征,动态调整公式(4)中各个异常检测模型的权重a i,以提高预测准确性。
存储系统120可通过图4所示方法确定各个硬盘的各个使用时间的健康度。当某个硬盘发生故障之后,存储系统120可通过该硬盘在各个使用时间的健康度构建该硬盘的从开始启用到发生故障(即寿命结束)的完整健康度曲线,从而可如下文参考图7所述,将该故障硬盘的健康度曲线作为对标健康度曲线(或比对健康度曲线)用于预测其他硬盘(即目标盘)在未来时间的健康度和寿命,下文中将提供对标健康度曲线的故障硬盘称为对标盘。
具体是,存储系统120在获取各个对标盘的各个使用时间的健康度之后,由于该健康度数据中存在因测量不准或测量过程中的噪声形成的波动,存储系统120还可以采用高斯平滑方法对各个对标盘的健康度数据进行平滑处理。具体是,可设置平滑窗的时间长度,在平滑窗内按照经验对各个使用时间的样本设置平滑权重,并将该平滑窗在硬盘的健康度数据中根据使用时间进行滑动,并根据平滑权重对滑动窗内的健康度数据进行修改,从而起到平滑数据的作用。表1示出了平滑窗的示例形式:
表1
Figure PCTCN2021118513-appb-000007
如表1所示,假设将滑动窗的使用时间长度设置为5天,其中,将滑动窗内第1天的数据的权重设置为2.28%,将滑动窗内第2天的数据的权重设置为13.59%,将滑动窗内第3天的数据的权重设置为68.27%,等等。从表1可以看出,高斯平滑方法中假设健康度曲线中每段数据的中心点与平滑结果的关系最为密切,因此具有最高的权重,随着与中心点的距离逐渐增大,关系逐渐变小,即权重逐渐变小。之后,存储系统120可基于对标盘的经过上述平滑处理的各个使用时间的健康度绘制对标盘的健康度曲线。
图5为通过上述过程获取的各个硬盘的健康度曲线的示意图。图6为图5中的一个健康度曲线的放大图。如图6所示,在该健康度曲线中,坐标的横轴表示从硬盘启用时开始计时的时间(例如时间单位为“天”),坐标的纵轴表示硬盘的健康度。从该多个健康度曲线可以看出,通过本申请实施例提供的方法获取的硬盘的健康度基本上随着时间平滑而稳定的下降。
存储系统120在获取各个对标盘的从启用到发生故障的完整健康度曲线之后,可将这些健康度曲线加入到对标数据集中,以用于预测正在使用的硬盘在未来时间的健康度和寿命。可以理解,存储系统120不限于如上文所述获取自身包括的故障硬盘的健康度曲线,例如,存储系统120可从其他存储系统接收该其他存储系统中的故障硬盘的健康度曲线,并将该健康度曲线加入到对标数据集中。
存储系统120通过同样的过程可获取存储系统120中正在使用的硬盘134所使用 的一段时间的健康度曲线,该一段时间例如为从硬盘134启用开始到当前的一段时间,存储系统120可通过将硬盘134的部分健康度曲线与上述对标数据集中的对标健康度曲线进行比对,从而预测硬盘134在未来时间的健康度和寿命。
图7为本申请实施例提供的预测硬盘健康度的方法流程图,该方法可由图2中的存储系统120执行,包括以下步骤:
步骤S701,在对标数据集中,根据对标健康度曲线与待预测的硬盘134的部分健康度曲线中对应使用时间的健康度的相似度,选取对标健康度曲线;
步骤S702,拟合选取的对标健康度曲线中的健康度与待预测的硬盘134的部分健康度曲线中的对应使用时间的健康度之间的映射关系;
步骤S703,根据选取的对标健康度曲线和映射关系,预测硬盘134在未来某个时间的健康度。
下文将详细描述图7所示方法的各个步骤。
首先,在步骤S701,在对标数据集中,根据对标健康度曲线与待预测的硬盘134的部分健康度曲线中对应使用时间的健康度的相似度,选取对标健康度曲线。
存储系统120在如上所述获取硬盘134的部分健康度曲线和对标数据集中多个对标健康度曲线之后,CPU123可分别计算硬盘134的部分健康度曲线与对标数据集中各个对标健康度曲线的相似度。CPU123可计算硬盘134的部分健康度曲线中的健康度与各个对标健康度曲线中的对应使用时间的健康度之间的欧式距离,从而计算该部分健康度曲线与各个对标健康度曲线之间的相似度。
具体是,通常,硬盘134的部分健康度曲线与对标数据集中任一对标健康度曲线中可能具有两个相似的一段曲线,如果该两个相似的曲线在时间上是对齐的,则该两个相似的曲线中的对应使用时间的健康度可以为该两个相似曲线中相同使用时间的健康度。
在一些实际场景中,该两个相似的曲线在时间(即x轴)上可能是不对齐的(即时间长度不相等)。例如,对于不同容量的硬盘,其健康度在时间上的衰退速度可能是不同的,因此不同容量的硬盘的健康度曲线中的相似曲线的时间跨度不同。例如,对于容量较大的硬盘,由于硬盘中每个存储单元的使用概率较低,其健康度的衰退速度相比于容量较小的硬盘可能更慢,因此该容量较大的硬盘的健康度曲线中的相似曲线的时间跨度更长。为此,CPU123可通过动态时间规整算法(Dynamic Time Warping,DTW)确定上述两个相似曲线中对应使用时间的健康度。具体是,CPU123对该两段相似曲线中的一个在时间轴上进行缩短或延伸,以使得该两段相似曲线在时间上对齐,在经过该处理之后,该两段时间对齐的相似曲线中的相同使用时间的健康度即为对应使用时间的健康度。在获取两个相似曲线的对应使用时间的健康度之后,CPU123可计算该对应使用时间的健康度之间的欧式距离,从而计算两个相似曲线之间的相似度,也即硬盘134的部分健康度曲线与对标健康度曲线之间对应使用时间的健康度的相似度。
在计算硬盘134的部分健康度曲线与对标数据集中各个对标健康度曲线之间的相似度之后,可从对标数据集中选取相似度最高的一个或多个对标健康度曲线。
在步骤S702,拟合选取的对标健康度曲线中的健康度与待预测的硬盘134的部分 健康度曲线中的对应使用时间的健康度之间的映射关系。
可从对标健康度曲线中获取使用时间t的健康度,从硬盘134的部分健康度曲线中获取与使用时间t对应的使用时间t’的健康度,将时间t的健康度和时间t’的健康度构成一个训练样本,如此可获取与多个使用时间对(t,t’)对应的多个训练样本,用于训练回归模型,以拟合对标健康度曲线中的使用时间t的健康度x与硬盘134的健康度曲线中的对应使用时间t’的健康度y之间的映射关系。其中,如上文所述,在对标健康度曲线与硬盘134的部分健康度曲线中的两个相似曲线时间对齐或者经过DTW处理使得两个相似曲线中时间对齐的情况中,时间t和时间t’为相同的时间。所述回归模型例如具有如下公式(6)的线性回归模型:
y=a+b·x    (6),
其中,a和b为需要通过训练样本进行训练确定的系数,例如可通过最小二乘法进行对该回归模型的训练,以确定系数a和b。可以理解,在本申请实施例中,该回归模型不限于线性回归模型,而可以为其它任意形式的回归模型,如多项式回归模型等。
在步骤S703,根据选取的对标健康度曲线和映射关系,预测硬盘134在未来某个时间的健康度。
在拟合硬盘134的部分健康度曲线中的健康度与对标健康度曲线中的对应使用时间的健康度之间的映射关系之后,可基于该映射关系和对标健康度曲线,预测硬盘134在未来某个时间t1的健康度。具体是,为了预测硬盘134在未来某个时间t1的健康度y1,可获取对标健康度曲线中的与时间t1对应的使用时间t2的健康度x1,将x1代入公式(6),从而可预测硬盘134的健康度y1=a+b·x1。其中,与上文类似地,在硬盘134的部分健康度曲线与对标健康度曲线中的两个相似曲线时间对齐的情况中,上述时间t1和t2可以为相同的时间,在两个相似曲线的时间不对齐的情况中,可通过DTW方法确定与时间t1对应的时间t2。
在前述步骤中确定多个(例如m个)对标健康度曲线的情况中,可类似地,根据m个对标健康度曲线和对应的映射关系,计算m个健康度y i(其中i为1至m),可如公式(7)所示对该m个健康度y i进行加权求和,从而得到硬盘134的健康度Y1:
Figure PCTCN2021118513-appb-000008
其中,k i为预设的各个对标健康度曲线对应的权重,例如,可根据各个对标健康度曲线与硬盘134的部分健康度曲线的相似度的大小排序,确定各个对标健康度曲线对应的权重。
从上文的预测过程可以看出,通过如上所述训练得到的回归模型可以将对标健康度曲线中的健康度随时间变化的过程迁移到目标盘(即硬盘134)的健康度曲线中,从而该回归模型起到了迁移知识的作用,也可以称为迁移模型。
存储系统120可通过图7所示方法预测硬盘134在未来多个时间的健康度,例如,可预测硬盘134在未来每天的健康度,从而预测硬盘134在未来的健康度曲线。存储系统120可预设硬盘134的健康度阈值,该健康度阈值对应于硬盘134在发生故障时的健康度。从而,存储系统可以在预测的硬盘134的未来时间的健康度曲线中确定硬盘134的健康度达到所述阈值的时间,并根据该时间确定硬盘的剩余寿命。图8为本 申请实施例提供的预测的硬盘健康度曲线的示意图。如图8中所示,横轴表示从硬盘启用之后开始计时的时间,纵轴表示硬盘的健康度。假设图8中下方的点连接线为对标盘的健康度曲线C1,上方的点连接线为待预测的硬盘134的健康度曲线C2,其中曲线C2中的实线部分为基于硬盘134自身的启用后一段时间的SMART数据确定的健康度曲线,曲线C2中的虚线部分为通过图7所示方法预测的硬盘134在未来时间的健康度曲线。具体是,例如,为了预测曲线C2中的时间t1的健康度,首选可确定曲线C1中与时间t1对应的时间t2,获取曲线C1中时间t2的健康度x1,将x1代入上述公式(6),从而可计算曲线C2中时间t1的健康度y1,在根据选取的多个对标健康度曲线预测硬盘134的健康度时,还可以通过上述公式(7)计算硬盘134的健康度Y1。
如图8中所示,假设在曲线C2中设置健康度的阈值,该阈值对应于待预测的硬盘134的寿命结束的时间,从而,在如上所述预测了硬盘134的未来的健康度曲线之后,可在曲线C1中确定该阈值对应的时间t3,并将时间t3视为硬盘134的寿命结束时间。
可以理解,图7所示的方法仅仅为本申请实施例中用于预测硬盘在未来时间的健康度和寿命的一种实施方式,本申请实施例不限于此。例如,在另一种实施方式中,存储系统120在选取与待预测的硬盘134对应的对标健康度曲线之后,在预测硬盘134在未来的时间t1的健康度时,可直接以该对标健康度曲线中与时间t1对应的时间t2的健康度作为硬盘134在未来的时间t1的健康度。
图9为本申请实施例提供的存储设备的架构图,所述存储设备可用于执行图3、图4或图7所示的任一方法,所述存储设备包括:
获取单元91,用于获取硬盘在指定使用时间的与健康度相关的多个指标的数据;
输入单元92,用于将所述数据输入多个不同的模型;
确定单元93,用于根据所述多个模型的输出确定所述硬盘在所述指定使用时间的健康度。
在一种实施方式中,所述确定单元93具体用于,基于所述多个模型的输出的加权和确定所述硬盘在所述指定使用时间的健康度。
在一种实施方式中,所述多个模型中每个模型基于异常检测算法训练得到,且每个模型所采用的异常检测算法不同。
在一种实施方式中,所述多个模型的数量为三个,所述三个模型所采用的异常检测算法分别为孤立森林算法、局部异常因子算法、K均值聚类算法。
在一种实施方式中,所述多个模型由训练设备发送给所述存储设备或者由所述存储设备训练得到,所述训练设备用于训练所述多个模型。
在一种实施方式中,所述多个模型通过以下采样数据进行训练:故障硬盘在寿命结束之前的预设时段内的与健康度相关的多个指标的采样数据。
在一种实施方式中,所述硬盘包括目标盘,所述存储设备还包括:
获取或生成单元,用于获取或生成多个对标盘的第一数据集,所述第一数据集包括所述对标盘在多个使用时间的健康度;
生成单元,用于生成所述目标盘的第二数据集,所述第二数据集包括所述目标盘在多个使用时间的健康度,所述第一数据集中的多个使用时间的时间跨度大于所述第二数据集中的多个使用时间的时间跨度;
选取单元,用于根据所述第一数据集与所述第二数据集对应的多个使用时间的健康度的相似度,选取对标盘;
预测单元,用于根据所述选取的对标盘的第一数据集,预测所述目标盘在未来的指定时间的健康度。
在一种实施方式中,所述预测单元具体用于:拟合所述选取的对标盘的第一使用时间的健康度与所述目标盘的第二使用时间的健康度之间的映射关系,所述第一使用时间与所述第二使用时间为对应的时间;根据所述映射关系和所述第一数据集预测所述目标盘在未来多个指定时间的健康度。
在一种实施方式中,所述确定单元93还用于,根据所预测的所述目标盘在未来多个指定时间的健康度,确定所述目标盘的健康度达到阈值的时间,将健康度达到阈值的时间作为所述目标盘的寿命结束时间。
本申请第三方面提供一种存储设备,包括处理器和存储器,所述存储器中存储有可执行计算机程序指令,所述处理器执行所述可执行计算机程序指令以用于执行图3、图4或图7所示的任一方法。
本申请第四方面提供一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序指令,当所述计算机程序指令在计算机或处理器中执行时,使得所述计算机或处理器执行图3、图4或图7所示的任一方法。
本申请第五方面提供一种计算机程序产品,包括计算机程序指令,当所述计算机程序指令在计算机或处理器中运行时,使得所述计算机或处理器执行第一方面或第一方面可能的实现方式所述的方法。
需要理解,本文中的“第一”,“第二”等描述,仅仅为了描述的简单而对相似概念进行区分,并不具有其他限定作用。
本领域的技术人员可以清楚地了解到,本申请提供的各实施例的描述可以相互参照,为描述的方便和简洁,例如关于本申请实施例提供的各装置、设备的功能以及执行的步骤可以参照本申请方法实施例的相关描述,各方法实施例之间、各装置实施例之间也可以互相参照。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本发明实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,在没有超过本申请的范围内,可以通过其他的方式实现。例如,以上所描述的实施例仅仅是示 意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。
另外,所描述装置和方法以及不同实施例的示意图,在不超出本申请的范围内,可以与其它系统,模块,技术或方法结合或集成。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电子、机械或其它的形式。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应所述以权利要求的保护范围为准。

Claims (20)

  1. 一种硬盘健康评估方法,其特征在于,所述方法由存储设备执行,包括:
    获取硬盘在指定使用时间的与健康度相关的多个指标的数据;
    将所述数据输入多个不同的模型;
    根据所述多个模型的输出确定所述硬盘在所述指定使用时间的健康度。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述多个模型的输出确定所述硬盘在所述指定使用时间的健康度具体包括,基于所述多个模型的输出的加权和确定所述硬盘在所述指定使用时间的健康度。
  3. 根据权利要求1或2所述的方法,其特征在于,所述多个模型中每个模型基于异常检测算法训练得到,且每个模型所采用的异常检测算法不同。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述多个模型的数量为三个,所述三个模型所采用的异常检测算法分别为孤立森林算法、局部异常因子算法、K均值聚类算法。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述多个模型由训练设备发送给所述存储设备或者由所述存储设备训练得到,所述训练设备用于训练所述多个模型。
  6. 根据权利要求5所述的方法,其特征在于,所述多个模型通过以下采样数据进行训练:故障硬盘在寿命结束之前的预设使用时段内的与健康度相关的多个指标的采样数据。
  7. 根据权利要求1-6任一项所述的方法,所述硬盘包括目标盘,其特征在于,所述方法还包括:
    获取或生成多个对标盘的第一数据集,所述第一数据集包括所述对标盘在多个使用时间的健康度;
    生成所述目标盘的第二数据集,所述第二数据集包括所述目标盘在多个使用时间的健康度,所述第一数据集中的多个使用时间的时间跨度大于所述第二数据集中的多个使用时间的时间跨度;
    根据所述第一数据集与所述第二数据集对应的多个使用时间的健康度的相似度,选取对标盘;
    根据所述选取的对标盘的第一数据集,预测所述目标盘在未来的指定时间的健康度。
  8. 根据权利要求7所述的方法,其特征在于,所述根据所述选取的对标盘的第一数据集,预测所述目标盘在未来的指定时间的健康度包括:
    拟合所述选取的对标盘的第一使用时间的健康度与所述目标盘的第二使用时间的健康度之间的映射关系,所述第一使用时间与所述第二使用时间为对应的时间;根据所述映射关系和所述第一数据集预测所述目标盘在未来多个指定时间的健康度。
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括,根据所预测的所述目标盘在未来多个指定时间的健康度,确定所述目标盘的健康度达到阈值的时间,将健康度达到阈值的时间作为所述目标盘的寿命结束时间。
  10. 一种存储设备,其特征在于,所述存储设备包括:
    获取单元,用于获取硬盘在指定使用时间的与健康度相关的多个指标的数据;
    输入单元,用于将所述数据输入多个不同的模型;
    确定单元,用于根据所述多个模型的输出确定所述硬盘在所述指定使用时间的健康度。
  11. 根据权利要求10所述的存储设备,其特征在于,所述确定单元具体用于,基于所述多个模型的输出的加权和确定所述硬盘在所述指定使用时间的健康度。
  12. 根据权利要求10或11所述的存储设备,其特征在于,所述多个模型中每个模型基于异常检测算法训练得到,且每个模型所采用的异常检测算法不同。
  13. 根据权利要求10-12任一项所述的存储设备,其特征在于,所述多个模型的数量为三个,所述三个模型所采用的异常检测算法分别为孤立森林算法、局部异常因子算法、K均值聚类算法。
  14. 根据权利要求10-13任一项所述的存储设备,其特征在于,所述多个模型由训练设备发送给所述存储设备或者由所述存储设备训练得到,所述训练设备用于训练所述多个模型。
  15. 根据权利要求14所述的存储设备,其特征在于,所述多个模型通过以下采样数据进行训练:故障硬盘在寿命结束之前的预设使用时段内的与健康度相关的多个指标的采样数据。
  16. 根据权利要求10-15任一项所述的存储设备,所述硬盘包括目标盘,其特征在于,所述存储设备还包括:
    获取或生成单元,用于获取或生成多个对标盘的第一数据集,所述第一数据集包括所述对标盘在多个使用时间的健康度;
    生成单元,用于生成所述目标盘的第二数据集,所述第二数据集包括所述目标盘在多个使用时间的健康度,所述第一数据集中的多个使用时间的时间跨度大于所述第二数据集中的多个使用时间的时间跨度;
    选取单元,用于根据所述第一数据集与所述第二数据集对应的多个使用时间的健康度的相似度,选取对标盘;
    预测单元,用于根据所述选取的对标盘的第一数据集,预测所述目标盘在未来的指定时间的健康度。
  17. 根据权利要求16所述的存储设备,其特征在于,所述预测单元具体用于:
    拟合所述选取的对标盘的第一使用时间的健康度与所述目标盘的第二使用时间的健康度之间的映射关系,所述第一使用时间与所述第二使用时间为对应的时间;根据所述映射关系和所述第一数据集预测所述目标盘在未来多个指定时间的健康度。
  18. 根据权利要求17所述的存储设备,其特征在于,所述确定单元还用于,根据所预测的所述目标盘在未来多个指定时间的健康度,确定所述目标盘的健康度达到阈值的时间,将健康度达到阈值的时间作为所述目标盘的寿命结束时间。
  19. 一种存储设备,其特征在于,包括处理器和存储器,所述存储器中存储有可执行计算机程序指令,所述处理器执行所述可执行计算机程序指令以实现权利要求1-9任意一项所述的方法。
  20. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计 算机程序指令,当所述计算机程序指令在计算机或处理器中执行时,使得所述计算机或处理器执行权利要求1-9中任一项的所述的方法。
PCT/CN2021/118513 2021-04-26 2021-09-15 一种硬盘健康评估方法和存储设备 WO2022227373A1 (zh)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110453844.1 2021-04-26
CN202110453844 2021-04-26
CN202110812127.3A CN115248757A (zh) 2021-04-26 2021-07-16 一种硬盘健康评估方法和存储设备
CN202110812127.3 2021-07-16

Publications (1)

Publication Number Publication Date
WO2022227373A1 true WO2022227373A1 (zh) 2022-11-03

Family

ID=83697123

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/118513 WO2022227373A1 (zh) 2021-04-26 2021-09-15 一种硬盘健康评估方法和存储设备

Country Status (2)

Country Link
CN (1) CN115248757A (zh)
WO (1) WO2022227373A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115774652A (zh) * 2023-02-13 2023-03-10 浪潮通用软件有限公司 一种基于聚类算法的群控设备健康监测方法、设备及介质
CN117407661A (zh) * 2023-12-14 2024-01-16 深圳前海慧联科技发展有限公司 一种用于设备状态检测的数据增强方法
CN117520104A (zh) * 2024-01-08 2024-02-06 中国民航大学 一种预测硬盘异常状态的系统
CN117573420A (zh) * 2024-01-16 2024-02-20 武汉麓谷科技有限公司 一种zns固态硬盘掉电数据保存方法

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150277797A1 (en) * 2014-03-31 2015-10-01 Emc Corporation Monitoring health condition of a hard disk
US20160125959A1 (en) * 2014-10-31 2016-05-05 Infineon Technologies Ag Health state of non-volatile memory
CN108845760A (zh) * 2018-05-28 2018-11-20 郑州云海信息技术有限公司 一种硬盘维护方法、装置、设备及可读存储介质
CN110119344A (zh) * 2019-04-10 2019-08-13 河南文正电子数据处理有限公司 基于s.m.a.r.t参数的硬盘健康状态分析方法
CN111966569A (zh) * 2019-05-20 2020-11-20 中国电信股份有限公司 硬盘健康度评估方法和装置、计算机可读存储介质
CN112214369A (zh) * 2020-10-23 2021-01-12 华中科技大学 基于模型融合的硬盘故障预测模型建立方法及其应用
CN112364567A (zh) * 2020-11-18 2021-02-12 浙江大学 一种基于退化轨迹相似度一致检验的剩余寿命预测方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150277797A1 (en) * 2014-03-31 2015-10-01 Emc Corporation Monitoring health condition of a hard disk
US20160125959A1 (en) * 2014-10-31 2016-05-05 Infineon Technologies Ag Health state of non-volatile memory
CN108845760A (zh) * 2018-05-28 2018-11-20 郑州云海信息技术有限公司 一种硬盘维护方法、装置、设备及可读存储介质
CN110119344A (zh) * 2019-04-10 2019-08-13 河南文正电子数据处理有限公司 基于s.m.a.r.t参数的硬盘健康状态分析方法
CN111966569A (zh) * 2019-05-20 2020-11-20 中国电信股份有限公司 硬盘健康度评估方法和装置、计算机可读存储介质
CN112214369A (zh) * 2020-10-23 2021-01-12 华中科技大学 基于模型融合的硬盘故障预测模型建立方法及其应用
CN112364567A (zh) * 2020-11-18 2021-02-12 浙江大学 一种基于退化轨迹相似度一致检验的剩余寿命预测方法

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115774652A (zh) * 2023-02-13 2023-03-10 浪潮通用软件有限公司 一种基于聚类算法的群控设备健康监测方法、设备及介质
WO2024169123A1 (zh) * 2023-02-13 2024-08-22 浪潮通用软件有限公司 一种基于聚类算法的群控设备健康监测方法、设备及介质
CN117407661A (zh) * 2023-12-14 2024-01-16 深圳前海慧联科技发展有限公司 一种用于设备状态检测的数据增强方法
CN117407661B (zh) * 2023-12-14 2024-02-27 深圳前海慧联科技发展有限公司 一种用于设备状态检测的数据增强方法
CN117520104A (zh) * 2024-01-08 2024-02-06 中国民航大学 一种预测硬盘异常状态的系统
CN117520104B (zh) * 2024-01-08 2024-03-29 中国民航大学 一种预测硬盘异常状态的系统
CN117573420A (zh) * 2024-01-16 2024-02-20 武汉麓谷科技有限公司 一种zns固态硬盘掉电数据保存方法
CN117573420B (zh) * 2024-01-16 2024-06-04 武汉麓谷科技有限公司 一种zns固态硬盘掉电数据保存方法

Also Published As

Publication number Publication date
CN115248757A (zh) 2022-10-28

Similar Documents

Publication Publication Date Title
WO2022227373A1 (zh) 一种硬盘健康评估方法和存储设备
US11379285B1 (en) Mediation for synchronous replication
WO2021008285A1 (zh) 分布式系统的数据同步方法、装置、介质、电子设备
US20230385141A1 (en) Multi-factor cloud service storage device error prediction
JP6507167B2 (ja) 分散ストレージシステム上でデータを分散させること
US20200125941A1 (en) Artificial intelligence and machine learning infrastructure
CN112470142A (zh) 在存储系统的中介器服务之间进行切换
CN111133409A (zh) 确保人工智能基础设施中的再现性
EP4232907A1 (en) Using data similarity to select segments for garbage collection
US9712427B1 (en) Dynamic server-driven path management for a connection-oriented transport using the SCSI block device model
US20210141688A1 (en) Using a machine learning module to determine when to perform error checking of a storage unit
US9591099B1 (en) Server connection establishment over fibre channel using a block device access model
US9407601B1 (en) Reliable client transport over fibre channel using a block device access model
US12019532B2 (en) Distributed file system performance optimization for path-level settings using machine learning
Xiong et al. HaDaap: a hotness‐aware data placement strategy for improving storage efficiency in heterogeneous Hadoop clusters
WO2021159687A1 (zh) 数据重构方法、存储设备及存储介质
US10439881B2 (en) Method and apparatus for predicting storage distance
US9270786B1 (en) System and method for proxying TCP connections over a SCSI-based transport
US9509797B1 (en) Client communication over fibre channel using a block device access model
US9473590B1 (en) Client connection establishment over fibre channel using a block device access model
US20230141749A1 (en) Failure prediction method and device for a storage device
US9473591B1 (en) Reliable server transport over fibre channel using a block device access model
US9514151B1 (en) System and method for simultaneous shared access to data buffers by two threads, in a connection-oriented data proxy service
US9531765B1 (en) System and method for maximizing system data cache efficiency in a connection-oriented data proxy service
US11513982B2 (en) Techniques for recommending configuration changes using a decision tree

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21938837

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21938837

Country of ref document: EP

Kind code of ref document: A1