WO2016101786A1 - 一种预测非易失性存储介质发生故障的方法及装置 - Google Patents

一种预测非易失性存储介质发生故障的方法及装置 Download PDF

Info

Publication number
WO2016101786A1
WO2016101786A1 PCT/CN2015/096690 CN2015096690W WO2016101786A1 WO 2016101786 A1 WO2016101786 A1 WO 2016101786A1 CN 2015096690 W CN2015096690 W CN 2015096690W WO 2016101786 A1 WO2016101786 A1 WO 2016101786A1
Authority
WO
WIPO (PCT)
Prior art keywords
volatile storage
storage media
preset fault
storage medium
fault threshold
Prior art date
Application number
PCT/CN2015/096690
Other languages
English (en)
French (fr)
Inventor
孔伟康
李定
李强
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2016101786A1 publication Critical patent/WO2016101786A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16ZINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS, NOT OTHERWISE PROVIDED FOR
    • G16Z99/00Subject matter not provided for in other main groups of this subclass

Definitions

  • the present invention relates to the field of computer technologies, and in particular, to a method and apparatus for predicting a failure of a non-volatile storage medium.
  • Hard disk is still the most commonly used storage medium for storing data, and is widely used in various data centers. Therefore, the prediction of hard disk failure has become an important means to ensure data reliability today, and it has gradually become an important part of data center management software. .
  • the data center detects the health of each hard disk, enables the hard disk alarm and isolates when the hard disk fails or is about to fail, and then starts data reconstruction.
  • the current DFP (Disk Failure Prediction) technology determines whether certain indicators of the hard disk reach the preset threshold. If the target is not up to standard, an alarm is issued, and the hard disk is considered to be faulty. In order to reduce the repair rate, the hard disk manufacturer generally sets a low alarm threshold, which results in a very low fault prediction rate of the hard disk. However, if the alarm threshold of the hard disk manufacturer is referred to, the predicted failure of the hard disk is low. In order to improve the accuracy of predicting hard disk failures, the data center using the hard disk will reset the alarm threshold, thereby improving the accuracy of predicting the failure of the hard disk.
  • DFP disk Failure Prediction
  • the alarm thresholds of all the hard disks in the data center are the same, but the status of the hard disks in the data center is different, some hard disks are used for a long time, and some hard disks are used for a short time. Therefore, the above methods still have low accuracy. Defects.
  • the embodiment of the invention provides a method and a device for predicting a failure of a non-volatile storage medium, which are used to solve the defect that the accuracy of predicting a hard disk failure is low in the prior art.
  • a method for predicting a failure of a non-volatile storage medium comprising:
  • the initial preset fault thresholds corresponding to any two non-volatile storage media having different status values are different.
  • the method further includes:
  • a non-volatile storage medium takes over the work of the predicted non-volatile storage medium that will fail;
  • the number of hot spare non-volatile storage media that replaces all of the predicted non-volatile storage media that will fail is the same as the total number of all non-volatile storage media.
  • the method further includes:
  • any two non-volatile storage media in the non-volatile storage medium of the data center the amplitude values of the initial preset fault thresholds corresponding to any two non-volatile storage media are the same.
  • the hot standby non-volatile storage medium is used to succeed according to the corresponding first preset fault threshold After the value predicts all non-volatile storage media that will fail, it also includes:
  • the status value of any non-volatile storage medium is smaller than the second preset fault gate corresponding to any non-volatile storage medium. At the limit value, it is predicted that any one of the non-volatile storage media will fail;
  • the preset number of the hot standby non-volatile storage media is used to replace all non-volatile storages that are predicted to be faulty according to the corresponding second preset fault threshold respectively.
  • the second preset fault threshold is less than or equal to an initial preset fault threshold of the corresponding non-volatile storage medium for each non-volatile storage medium corresponding to the second preset fault threshold.
  • an apparatus for predicting a failure of a non-volatile storage medium including:
  • a computing unit configured to calculate, for each of the at least two non-volatile storage media of the data center, a status value of the any one of the non-volatile storage media, the status Values are used to characterize the health of any of the non-volatile storage media;
  • a predicting unit configured to predict that any one of the non-volatile storage media will fail when the status value is less than an initial preset fault threshold corresponding to the any one of the non-volatile storage media
  • the initial preset fault thresholds corresponding to any two non-volatile storage media having different status values are different.
  • the determining unit, the determining unit, and the replacing unit are further included:
  • the determining unit is configured to determine a total number of predicted non-volatile storage media that will fail
  • the determining unit is configured to determine that the determined total number of all non-volatile storage media is less than or equal to the number of hot standby non-volatile storage media of the data center;
  • the relay unit is configured to use the heat when the determining unit determines that the determined total number of all non-volatile storage media is less than or equal to the number of hot standby non-volatile storage media of the data center A hot standby non-volatile storage medium in the non-volatile storage medium takes over the work of the predicted non-volatile storage medium that will fail;
  • the number of hot spare non-volatile storage media that replaces all of the predicted non-volatile storage media that will fail is the same as the total number of all non-volatile storage media.
  • the determining unit is further configured to: determine that the determined total number of all non-volatile storage media is greater than the data When the number of hot standby non-volatile storage media is centered, for any of the non-volatile storage media, respectively:
  • the prediction unit is configured to determine that a status value of the any one of the non-volatile storage media is less than the When it is determined that the first preset fault threshold corresponding to the non-volatile storage medium, further predicting that any one of the non-volatile storage media will fail;
  • the successor unit determines that the total number of all non-volatile storage media that will be predicted to be faulty according to the corresponding first preset fault threshold value is equal to or smaller than the hot standby non-volatileness of the data center.
  • the hot standby non-volatile storage medium is used to replace all non-volatile storage media that are predicted to be faulty according to the corresponding first preset fault threshold.
  • any two non-volatile storage media in the non-volatile storage medium of the data center the amplitude values of the initial preset fault thresholds corresponding to any two non-volatile storage media are the same.
  • the supplementary unit is further configured to supplement a preset number of hot standby non-volatile storage media;
  • Each of the data centers reduces a first preset fault threshold corresponding to the non-volatile storage medium after the initial preset fault threshold, and obtains a second preset fault threshold;
  • the prediction unit is further configured to: for any non-volatile storage medium that increases the first preset fault threshold, the status value of any non-volatile storage medium is smaller than any non-volatile storage medium. Predicting that any one of the non-volatile storage media will fail when the second preset fault threshold is exceeded;
  • the determining unit is further configured to determine a total number of all non-volatile storage media that are predicted to be faulty according to the corresponding second preset fault threshold, respectively, which is less than or equal to the preset number of the preset The number of hot standby non-volatile storage media;
  • the relay unit is further configured to: at the determining unit, determine a total number of all non-volatile storage media that are predicted to be faulty according to the corresponding second preset fault threshold, respectively, less than or equal to the supplementary When the preset number of hot standby non-volatile storage media is used, the preset number of the hot standby non-volatile storage media are used to replace the predicted second preset fault thresholds respectively. All non-volatile storage media that will fail;
  • the second preset fault threshold is less than or equal to an initial preset fault threshold of the corresponding non-volatile storage medium for each non-volatile storage medium corresponding to the second preset fault threshold.
  • the alarm thresholds of all non-volatile storage media in the data center are the same, but the operating conditions of different non-volatile storage media may be different if all non-volatile storage media In the case of the same alarm threshold, the predicted non-volatile storage medium has a low accuracy.
  • the non-volatile storage media having different status values respectively correspond to different initial presets.
  • the fault thresholds that is, the non-volatile storage media having different status values respectively correspond to different alarm thresholds, thus improving the accuracy of the predicted failed non-volatile storage medium.
  • FIG. 1 is a flowchart of predicting a failure of a non-volatile storage medium in an embodiment of the present invention
  • FIG. 3A is a schematic structural diagram of an apparatus for predicting a failure of a nonvolatile storage medium according to an embodiment of the present invention
  • FIG. 3B is another schematic structural diagram of an apparatus for predicting a failure of a nonvolatile storage medium according to an embodiment of the present invention.
  • system and “network” are used interchangeably herein.
  • the term “and/or” in this context is merely an association describing the associated object, indicating that there may be three relationships, for example, A and / or B, which may indicate that A exists separately, and both A and B exist, respectively. B these three situations.
  • the letter “/” in this article generally indicates that the contextual object is an "or" relationship.
  • a process for predicting a failure of a non-volatile storage medium is as follows:
  • Step 100 Performing separately for any one of at least two non-volatile storage media of the data center;
  • Step 110 Calculate a status value of any non-volatile storage medium, and the status value is used to represent the running status of any non-volatile storage medium;
  • Step 120 When it is determined that the status value is less than an initial preset fault threshold corresponding to any non-volatile storage medium, predicting that any non-volatile storage medium will fail; any two non-volatile states having different status values
  • the initial storage fault thresholds corresponding to the storage media are different.
  • the data center is pre-stored with a certain number of hot standby non-volatile storage media. Therefore, in the embodiment of the present invention, after predicting that any non-volatile storage medium will fail, the following operations are also included:
  • the hot standby non-volatile storage medium in the hot standby non-volatile storage medium is used. Replacing the predicted operation of all non-volatile storage media that will fail;
  • the number of hot standby non-volatile storage media that take over all of the predicted non-volatile storage media that will fail is the same as the total number of all non-volatile storage media.
  • hard disk 1 and hard disk 2 work.
  • the total number of all non-volatile storage media determined may be greater than the number of hot-standby non-volatile storage media in the data center, and the operations performed at this time and all non-volatile storage determined.
  • the total number of media is less than or equal to the number of hot spare non-volatile storage media in the data center
  • the operations performed at the time are different.
  • the specific implementation process is as follows:
  • All non-volatile storage media that are predicted to fail according to the corresponding first preset fault threshold are respectively taken over using the hot standby non-volatile storage medium.
  • hard disks there are 10 hard disks in the data center. It is predicted that there are 5 hard disks: hard disk 1, hard disk 2, hard disk 3, hard disk 4, and hard disk 5 will be faulty. If there are 3 hot spare hard disks in the data center, reduce 10 hard disks.
  • the initial preset fault threshold corresponding to each hard disk in each of the hard disks is reduced.
  • the initial default fault thresholds for the 10 hard disks are: X1, X2, X3, X4, X5, X6, X7, X8.
  • the first preset fault threshold after the first reduction is: Y1, Y2, Y3, Y4, Y5, Y6, Y7, Y8, Y9, Y10, and Y1 is smaller than X1, and Y2 is smaller than X2.
  • Y3 is smaller than X3, Y4 is smaller than X4,
  • Y5 is smaller than X5, Y6 is smaller than X6, Y7 is smaller than X7, Y8 is smaller than X8, Y9 is smaller than X9, Y10 is smaller than X10, and the total number of failed hard disks is predicted according to the first preset fault threshold.
  • the first preset fault threshold is lowered. If the predicted total number of failed disks is still greater than the number of hot spare disks, the first preset fault threshold is lowered. Until the predicted total number of failed hard disks is less than or equal to the number of hot spare hard disks, Then use the hot spare hard disk to take over all the hard disks that will be faulty according to the final prediction.
  • all the data centers are Any two non-volatile storage media in the non-volatile storage medium have the same amplitude value of the initial preset fault threshold corresponding to any two non-volatile storage media.
  • the data center has five hard disks: hard disk 1, hard disk 2, hard disk 3, hard disk 4, and hard disk 5.
  • the corresponding initial preset fault thresholds are X1, X2, X3, X4, and X5, respectively.
  • the first preset fault threshold obtained by the threshold is 70% X1, 70% X2, 70% X3, 70% X4, 70% X5, respectively.
  • the initial preset fault gate corresponding to the non-volatile storage medium is to be reduced.
  • the limit value is obtained, and the first preset fault threshold is obtained, so that some non-volatile storage media that have failed may be filtered out, and the existing hot spare non-volatile storage medium may be used to replace the first searched one.
  • the failed non-volatile storage medium is then supplemented with the hot standby non-volatile storage medium, and then the reduced initial preset fault threshold is increased, that is, the first preset fault threshold is increased, thus, The first unfiltered non-volatile storage medium that has failed is filtered out, and then looped until the predicted non-volatile storage medium determined according to the initial preset fault threshold value is given. Filter out. Specifically, when implementing, you can adopt the following methods:
  • the hot standby non-volatile storage medium is used to replace all non-volatile storage media that are predicted to be faulty according to the corresponding first preset fault threshold, the following operations are also included:
  • the status value of any non-volatile storage medium is smaller than the second preset fault gate corresponding to any non-volatile storage medium.
  • the limit is reached, it is predicted that any non-volatile storage medium will fail;
  • the second pre- The fault threshold is set to be less than or equal to the initial preset fault threshold of the corresponding non-volatile storage medium.
  • the data center there are 10 hard disks in the data center: hard disk 1, hard disk 2, hard disk 3, hard disk 4, hard disk 5, hard disk 6, hard disk 7, hard disk 8, hard disk 9, and hard disk 10.
  • the corresponding initial preset fault thresholds are respectively There are 3 hot spare disks for X1, X2, X3, X4, X5, X6, X7, X8, X9, X6, X7, X8, X9, X10.
  • Hard disk 1 The hard disk 8 will lower the initial preset fault threshold, and the first preset fault threshold obtained by lowering the initial preset fault threshold is 50% X1, 50% X2, 50% X3, 50% X4, 50, respectively.
  • %X5 50%X6, 50%X7, 50%X8, 50%X9, 50%X10, there are 3 failed hard disks predicted according to the first preset fault threshold: hard disk 1, hard disk 2 and hard disk 3. Replace the hot spare hard disk with the hard disk 1, the hard disk 2, and the hard disk 3. After the replacement, add three hot standby disks.
  • the faulty hard disk predicted according to the second preset fault threshold has 3: hard disk 4, hard disk 5 and hard disk 6, then replace the hot spare hard disk with the hard disk 4, the hard disk 5 and hard disk 6, then, add 3 hot standby hard disk, and increase the second preset fault threshold to get the third preset fault threshold, 80% X1, 80% X2, 80% X3, 80% X4 80% X5, 80% X6, 80% X7, 80% X8, 80% X9, 80% X10, the faulty hard disk predicted according to the third preset fault threshold has 2: hard disk 7, hard disk 8, Replace the hot spare hard disk with the hard disk 7 and the hard disk 8. Replace the hot spare hard disk with the hard disk 7 and the hard disk 8.
  • the initial preset fault threshold corresponding to the non-volatile storage medium is related to the power-on time of the non-volatile storage medium, and the judging condition is relaxed as the power-on time increases. If the initial preset fault threshold is increased and the judgment condition is loose, the initial preset fault threshold increases as the power-on time increases. If the initial preset fault threshold decreases, the judgment condition is If loose, the initial preset fault threshold decreases as the power-on time increases.
  • any one of the at least two non-volatile storage media of the data center performing: calculating a status value of any non-volatile storage medium The status value is used to characterize the operating condition of any non-volatile storage medium; when determining that the status value is less than an initial preset fault threshold corresponding to any non-volatile storage medium, predicting any non-volatile storage The media will fail; any two non-volatile storage media with different status values correspond to The initial preset fault thresholds are different.
  • the non-volatile storage media with different status values respectively correspond to different initial preset fault thresholds, that is, non-volatile storage media with different status values. Corresponding to different alarm thresholds, respectively, thus improving the accuracy of the predicted failed non-volatile storage medium.
  • FIG. 2 The architecture diagram is as shown in FIG. 2:
  • Step 200 The data center has 10 hard disks: a hard disk 1, a hard disk 2, ..., a hard disk 10, and calculates a status value of each of the 10 hard disks;
  • Step 210 For any one of the 10 hard disks, use the hard disk whose status value is smaller than the corresponding initial preset fault threshold as the predicted hard disk that will fail, and any two non-volatile storages with different status values.
  • the initial preset fault thresholds corresponding to the media are different;
  • Step 220 Determine the total number of all the hard disks that are predicted to be faulty, and determine whether the total number of all the hard disks determined is less than or equal to the number of hot standby disks in the data center; if yes, go to step 230; otherwise, go to step 240. ;
  • Step 230 Using the hot standby hard disk to replace the work of the predicted all hard disks that will fail;
  • the number of hot spare disks that take over the predicted operation of all hard disks that will fail is the same as the total number of all the hard disks.
  • Step 240 Reduce an initial preset fault threshold corresponding to each of the 10 hard disks, and obtain a first preset fault threshold.
  • the initial preset fault threshold values corresponding to the two hard disks are reduced by the same amplitude value.
  • Step 250 Determine whether the number of all failed hard disks predicted according to the first preset fault threshold is less than or equal to the hot standby hard disk of the data center, and if yes, go to step 260, otherwise, return to step 240;
  • Step 260 Replace the hot standby hard disk with the work of the failed hard disk predicted according to the first preset fault threshold, and supplement the preset number of hot standby hard disks;
  • Step 270 Increase the first preset fault threshold, obtain a second preset fault threshold, and set the status.
  • the hard disk whose value is smaller than the corresponding second preset fault threshold is used as the predicted hard disk that will fail;
  • Step 280 Determine whether the predicted number of failed hard disks is 0 and/or the second preset fault threshold is an initial preset fault threshold. If yes, the process ends, otherwise, return to step 220.
  • an embodiment of the present invention provides an apparatus for predicting a failure of a non-volatile storage medium, where the apparatus includes a computing unit 30 and a prediction unit 31, where:
  • the calculating unit 30 is configured to: calculate, for each of the at least two non-volatile storage media of the data center, a status value of any non-volatile storage medium, where the status value is used Characterizing the health of any non-volatile storage medium;
  • the predicting unit 31 is configured to predict that any non-volatile storage medium will fail when the status value is less than an initial preset fault threshold corresponding to any non-volatile storage medium;
  • the initial preset fault thresholds corresponding to any two non-volatile storage media having different status values are different.
  • the determining unit, the determining unit, and the replacing unit are further included:
  • the determining unit is configured to determine a total number of predicted non-volatile storage media that will fail
  • the determining unit is configured to determine that the determined total number of all non-volatile storage media is less than or equal to the number of hot standby non-volatile storage media of the data center;
  • the relay unit is configured to use the heat in the hot standby non-volatile storage medium when the determining unit determines that the total number of all non-volatile storage media determined is less than or equal to the number of hot-standby non-volatile storage media of the data center
  • the non-volatile storage medium takes over the work of the predicted non-volatile storage medium that will fail
  • the number of hot standby non-volatile storage media that take over all of the predicted non-volatile storage media that will fail is the same as the total number of all non-volatile storage media.
  • the determining unit is further configured to: when determining that the total number of all the non-volatile storage media determined is greater than the number of the hot-standby non-volatile storage media of the data center, for any non-volatile Sex storage media, respectively:
  • the prediction unit 31 is configured to further predict that any non-volatile storage medium will occur when the status value of any non-volatile storage medium is less than the first preset fault threshold corresponding to any non-volatile storage medium. malfunction;
  • the successor unit determines that the total number of all non-volatile storage media that will be predicted to be faulty according to the corresponding first preset fault threshold is equal to or less than the number of hot standby non-volatile storage media of the data center
  • the hot standby non-volatile storage medium is used to replace all non-volatile storage media that are predicted to be faulty according to the corresponding first preset fault threshold.
  • an initial preset fault corresponding to any two non-volatile storage media optionally, for any two non-volatile storage media in all non-volatile storage media of the data center, an initial preset fault corresponding to any two non-volatile storage media respectively
  • the magnitude of the threshold reduction is the same.
  • a supplementing unit for supplementing a preset number of hot standby non-volatile storage media; and improving each non-volatile memory of the data center after the initial preset fault threshold is lowered
  • the first preset fault threshold corresponding to the storage medium respectively obtains a second preset fault threshold
  • the prediction unit 31 is further configured to: for any non-volatile storage medium that increases the first preset fault threshold, the status value of any non-volatile storage medium is smaller than that of any non-volatile storage medium. When the second preset fault threshold is used, it is predicted that any non-volatile storage medium will fail;
  • the determining unit is further configured to determine a total number of all non-volatile storage media that are predicted to be faulty according to the corresponding second preset fault threshold, respectively, and less than or equal to the preset preset number of hot spares The number of lossy storage media;
  • the relay unit is further configured to: at the determining unit, determine a total number of all non-volatile storage media that are predicted to be faulty according to the corresponding second preset fault threshold, respectively, less than or equal to a supplemental preset amount of heat When the number of non-volatile storage media is prepared, a supplemental preset number of hot standby non-volatile storage media is used to replace all non-volatiles that are predicted to be faulty according to the corresponding second preset fault threshold respectively.
  • the second pre- The fault threshold is set to be less than or equal to the initial preset fault threshold of the corresponding non-volatile storage medium.
  • FIG. 3B is another schematic structural diagram of an apparatus for predicting failure of a non-volatile storage medium according to an embodiment of the present invention, including at least one processor 301, a communication bus 302, a memory 303, and at least one communication interface 304. .
  • the communication bus 302 is used to implement the connection and communication between the above components, and the communication interface 304 is used to connect and communicate with external devices.
  • the memory 303 is configured to store executable program code, and the processor 301 executes the program code for:
  • the initial preset fault thresholds corresponding to any two non-volatile storage media having different status values are different.
  • the processor 301 is further configured to: after predicting that any non-volatile storage medium will fail, further comprising:
  • the hot standby non-volatile storage medium in the hot standby non-volatile storage medium is used. Replacing the predicted operation of all non-volatile storage media that will fail;
  • the number of hot standby non-volatile storage media that take over all of the predicted non-volatile storage media that will fail is the same as the total number of all non-volatile storage media.
  • the processor 301 is further configured to: after determining the total number of the predicted non-volatile storage media that will be faulty, the method further includes:
  • the total number of all non-volatile storage media determined to be determined is greater than the hot standby non-volatileness of the data center When storing the number of media, for any non-volatile storage medium, respectively:
  • All non-volatile storage media that are predicted to fail according to the corresponding first preset fault threshold are respectively taken over using the hot standby non-volatile storage medium.
  • the processor 301 is further configured to: use any two non-volatile storage media in the non-volatile storage medium of the data center, and any two non-volatile storage media.
  • the amplitude values of the corresponding initial preset fault thresholds are the same.
  • the processor 301 is further configured to: use the hot standby non-volatile storage medium to replace all non-volatiles that are predicted to be faulty according to the corresponding first preset fault threshold respectively. After storing the media, it also includes:
  • the status value of any non-volatile storage medium is smaller than the second preset fault gate corresponding to any non-volatile storage medium.
  • the limit is reached, it is predicted that any non-volatile storage medium will fail;
  • the second preset fault threshold is less than or equal to an initial preset fault threshold of the corresponding non-volatile storage medium for each non-volatile storage medium corresponding to the second preset fault threshold.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus functions in one or more blocks of a flow or a flow diagram and/or block diagram of a flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions in one or more blocks of the flowchart or in a flow or block of the flowchart.

Landscapes

  • Techniques For Improving Reliability Of Storages (AREA)
  • Debugging And Monitoring (AREA)

Abstract

一种预测非易失性存储介质发生故障的方法及装置:针对数据中心的至少两个非易失性存储介质中的任意一非易失性存储介质,分别执行(100):计算任意一非易失性存储介质的状况值,状况值用于表征任意一非易失性存储介质的运行状况(110);确定状况值小于与任意一非易失性存储介质对应的初始预设故障门限值时,预测任意一非易失性存储介质将发生故障;状况值不同的任意两个非易失性存储介质分别对应的初始预设故障门限值不同(120),在该方案中,状况值不同的非易失性存储介质分别对应不同的初始预设故障门限值,即状况值不同的非易失性存储介质分别对应不同的报警门槛,因此,提高了预测出的发生故障的非易失性存储介质的准确度。

Description

一种预测非易失性存储介质发生故障的方法及装置
本申请要求于2014年12月25日提交中国专利局、申请号为201410822384.5、发明名称为“一种预测非易失性存储介质发生故障的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及计算机技术领域,特别涉及一种预测非易失性存储介质发生故障的方法及装置。
背景技术
数据存储越来越重要,保证数据可靠性也有非常重要的意义。硬盘仍然是储存数据最常用的存储介质,广泛应用于各类数据中心中,因此,硬盘故障的预测成为现今保证数据可靠性的重要手段,其已经慢慢成为数据中心管理软件中的重要组成部分。数据中心通过检测各个硬盘的运行状况,当硬盘失效或即将失效时启用硬盘报警并隔离,然后开始数据重构。
目前的DFP(Disk Failure Prediction,硬盘故障预测)技术是判断硬盘的某些指标是否达到预设门限值,若不达标,则发出报警,认为硬盘将要故障。而硬盘厂商为了减少返修率,一般设置的报警门槛非常低,导致硬盘整体的故障预测率极低,但是,若参考硬盘厂商的报警门槛的话,预测出的硬盘发生故障的准确度较低。为了提高预测硬盘故障的准确度,使用硬盘的数据中心会重新设定报警门槛,从而提高预测硬盘发生故障的准确度。
上述方法中数据中心的所有硬盘的报警门槛都相同,但是,数据中心的硬盘的状况是不同的,有些硬盘使用时间较长,有些硬盘使用时间较短,因此,上述方法仍然存在准确度较低的缺陷。
发明内容
本发明实施例提供一种预测非易失性存储介质发生故障的方法及装置,用以解决现有技术中存在的预测硬盘发生故障的准确度较低的缺陷。
本发明实施例提供的具体技术方案如下:
第一方面,提供一种预测非易失性存储介质发生故障的方法,包括:
针对数据中心的至少两个非易失性存储介质中的任意一非易失性存储介质,分别执行:
计算所述任意一非易失性存储介质的状况值,所述状况值用于表征所述任意一非易失性存储介质的运行状况;
确定所述状况值小于与所述任意一非易失性存储介质对应的初始预设故障门限值时,预测所述任意一非易失性存储介质将发生故障;
状况值不同的任意两个非易失性存储介质分别对应的初始预设故障门限值不同。
结合第一方面,在第一种可能的实现方式中,预测所述任意一非易失性存储介质将发生故障之后,还包括:
确定预测出的将发生故障的所有非易失性存储介质的总数目;
判定确定的所述所有非易失性存储介质的总数目小于或者等于所述数据中心的热备非易失性存储介质的数目时,使用所述热备非易失性存储介质中的热备非易失性存储介质接替所述预测出的将发生故障的所有非易失性存储介质的工作;
接替所述预测出的将发生故障的所有非易失性存储介质工作的热备非易失性存储介质的数目与所述所有非易失性存储介质的总数目相同。
结合第一方面的第一种可能的实现方式,在第二种可能的实现方式中,确定预测出的将发生故障的所有非易失性存储介质的总数目之后,还包括:
判定确定的所述所有非易失性存储介质的总数目大于所述数据中心的热备非易失性存储介质的数目时,针对所述任意一非易失性存储介质,分别执行:
降低所述任意一非易失性存储介质对应的初始预设故障门限值,得到第一预设故障门限值,并
确定所述任意一非易失性存储介质的状况值小于所述任意一非易失性存储介质对应的第一预设故障门限值时,进一步预测所述任意一非易失性存储介质将发生故障;
当判断出分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目等于或者小于所述数据中心的热备非易失性存储介质的数目时,使用所述热备非易失性存储介质接替分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质。
结合第一方面的第二种可能的实现方式,在第三种可能的实现方式中,针对所述数据中心的所有非易失性存储介质中的任意两个非易失性存储介质,对所述任意两个非易失性存储介质分别对应的初始预设故障门限值降低的幅度值相同。
结合第一方面的第二种或者第三种可能的实现方式,在第四种可能的实现方式中,使用所述热备非易失性存储介质接替分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质之后,还包括:
补充预设数量的热备非易失性存储介质;
提高所述数据中心的每一个降低了初始预设故障门限值后的非易失性存储介质分别对应的第一预设故障门限值,得到第二预设故障门限值;
针对提高了第一预设故障门限值的任意一非易失性存储介质,在任意一非易失性存储介质的状况值小于任意一非易失性存储介质对应的第二预设故障门限值时,预测所述任意一非易失性存储介质将发生故障;
当判断出分别根据对应的第二预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目,小于或者等于补充的所述预设数量的热备非易失性存储介质的数目时,使用补充的所述预设数量的所述热备非易失性存储介质接替分别根据对应第二预设故障门限值预测出的将发生故障的所有非易失性存储介质;
其中,针对每一个对应第二预设故障门限值的非易失性存储介质,第二预设故障门限值小于或者等于对应非易失性存储介质的初始预设故障门限值。
第二方面,提供一种预测非易失性存储介质发生故障的装置,包括:
计算单元,用于针对数据中心的至少两个非易失性存储介质中的任意一非易失性存储介质,分别执行:计算所述任意一非易失性存储介质的状况值,所述状况值用于表征所述任意一非易失性存储介质的运行状况;
预测单元,用于确定所述状况值小于与所述任意一非易失性存储介质对应的初始预设故障门限值时,预测所述任意一非易失性存储介质将发生故障;
状况值不同的任意两个非易失性存储介质分别对应的初始预设故障门限值不同。
结合第二方面,在第一种可能的实现方式中,还包括确定单元、判断单元和接替单元:
所述确定单元用于确定预测出的将发生故障的所有非易失性存储介质的总数目;
所述判断单元用于判定确定的所述所有非易失性存储介质的总数目小于或者等于所述数据中心的热备非易失性存储介质的数目;
所述接替单元用于在所述判断单元判定确定的所述所有非易失性存储介质的总数目小于或者等于所述数据中心的热备非易失性存储介质的数目时,使用所述热备非易失性存储介质中的热备非易失性存储介质接替所述预测出的将发生故障的所有非易失性存储介质的工作;
接替所述预测出的将发生故障的所有非易失性存储介质工作的热备非易失性存储介质的数目与所述所有非易失性存储介质的总数目相同。
结合第二方面的第一种可能的实现方式,在第二种可能的实现方式中,所述确定单元还用于:判定确定的所述所有非易失性存储介质的总数目大于所述数据中心的热备非易失性存储介质的数目时,针对所述任意一非易失性存储介质,分别执行:
降低所述任意一非易失性存储介质对应的初始预设故障门限值,得到第一预设故障门限值;
所述预测单元用于确定所述任意一非易失性存储介质的状况值小于所述任 意一非易失性存储介质对应的第一预设故障门限值时,进一步预测所述任意一非易失性存储介质将发生故障;
所述接替单元当判断出分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目等于或者小于所述数据中心的热备非易失性存储介质的数目时,使用所述热备非易失性存储介质接替分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质。
结合第二方面的第二种可能的实现方式,在第三种可能的实现方式中,针对所述数据中心的所有非易失性存储介质中的任意两个非易失性存储介质,对所述任意两个非易失性存储介质分别对应的初始预设故障门限值降低的幅度值相同。
结合第二方面的第二种或者第三种可能的实现方式,在第四种可能的实现方式中,还包括补充单元,用于补充预设数量的热备非易失性存储介质;提高所述数据中心的每一个降低了初始预设故障门限值后的非易失性存储介质分别对应的第一预设故障门限值,得到第二预设故障门限值;
所述预测单元还用于针对提高了第一预设故障门限值的任意一非易失性存储介质,在任意一非易失性存储介质的状况值小于任意一非易失性存储介质对应的第二预设故障门限值时,预测所述任意一非易失性存储介质将发生故障;
所述判断单元还用于判断出分别根据对应的第二预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目,小于或者等于补充的所述预设数量的热备非易失性存储介质的数目;
所述接替单元还用于在所述判断单元判断出分别根据对应的第二预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目,小于或者等于补充的所述预设数量的热备非易失性存储介质的数目时,使用补充的所述预设数量的所述热备非易失性存储介质接替分别根据对应第二预设故障门限值预测出的将发生故障的所有非易失性存储介质;
其中,针对每一个对应第二预设故障门限值的非易失性存储介质,第二预设故障门限值小于或者等于对应非易失性存储介质的初始预设故障门限值。
本发明有益效果如下:
现有技术中,数据中心的所有非易失性存储介质对应的报警门槛都是相同的,但是,不同的非易失性存储介质的运行状况可能是不同的,若所有非易失性存储介质均对应的同一个报警门槛的话,预测出的发生故障的非易失性存储介质的准确性较低,本发明实施例中,状况值不同的非易失性存储介质分别对应不同的初始预设故障门限值,也就是说,状况值不同的非易失性存储介质分别对应不同的报警门槛,因此,提高了预测出的发生故障的非易失性存储介质的准确度。
附图说明
图1为本发明实施例中预测非易失性存储介质发生故障的流程图;
图2为本发明实施例中预测硬盘发生故障的实施例;
图3A为本发明实施例中预测非易失性存储介质发生故障的装置的一种结构示意图;
图3B为本发明实施例中预测非易失性存储介质发生故障的装置的另一种结构示意图。
具体实施方式
为使本发明实施例的目的、技术方案和优点更加清楚,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
另外,本文中术语“系统”和“网络”在本文中常被可互换使用。本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字母“/”,一般表示前后关联对象是一种“或”的关系。
下面结合说明书附图对本发明优选的实施方式进行详细说明,应当理解,此处所描述的优选实施例仅用于说明和解释本发明,并不用于限定本发明,并且在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
下面结合附图对本发明优选的实施方式进行详细说明。
参阅图1所示,本发明实施例中,预测非易失性存储介质发生故障的一种流程如下:
步骤100:针对数据中心的至少两个非易失性存储介质中的任意一非易失性存储介质,分别执行;
步骤110:计算任意一非易失性存储介质的状况值,状况值用于表征任意一非易失性存储介质的运行状况;
步骤120:确定状况值小于与任意一非易失性存储介质对应的初始预设故障门限值时,预测任意一非易失性存储介质将发生故障;状况值不同的任意两个非易失性存储介质分别对应的初始预设故障门限值不同。
数据中心预存有一定数目的热备非易失性存储介质,因此,本发明实施例中,预测任意一非易失性存储介质将发生故障之后,还包括如下操作:
确定预测出的将发生故障的所有非易失性存储介质的总数目;
判定确定的所有非易失性存储介质的总数目小于或者等于数据中心的热备非易失性存储介质的数目时,使用热备非易失性存储介质中的热备非易失性存储介质接替预测出的将发生故障的所有非易失性存储介质的工作;
接替预测出的将发生故障的所有非易失性存储介质工作的热备非易失性存储介质的数目与所有非易失性存储介质的总数目相同。
例如:数据中心有10个硬盘,预测出共有2个硬盘:硬盘1和硬盘2将发生故障,若数据中心的热备硬盘有3个,则使用3个热备硬盘中的任意两个硬盘接替硬盘1和硬盘2的工作。
当然,在实际应用中,确定的所有非易失性存储介质的总数目可能大于数据中心的热备非易失性存储介质的数目,此时所执行的操作与确定的所有非易失性存储介质的总数目小于或者等于数据中心的热备非易失性存储介质的数目 时所执行的操作是不同的,具体实现过程如下:
确定预测出的将发生故障的所有非易失性存储介质的总数目之后,还包括操作:
判定确定的所有非易失性存储介质的总数目大于数据中心的热备非易失性存储介质的数目时,针对任意一非易失性存储介质,分别执行:
降低任意一非易失性存储介质对应的初始预设故障门限值,得到第一预设故障门限值,并
确定任意一非易失性存储介质的状况值小于降低后的任意一非易失性存储介质对应的第一预设故障门限值时,进一步预测任意一非易失性存储介质将发生故障;
当判断出分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目小于或者等于数据中心的热备非易失性存储介质的数目时,使用热备非易失性存储介质接替分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质。
例如:数据中心有10个硬盘,预测出共有5个硬盘:硬盘1、硬盘2、硬盘3、硬盘4和硬盘5将发生故障,若数据中心的热备硬盘有3个,则降低10个硬盘中每一个硬盘分别所对应的初始预设故障门限值,若降低之前,10个硬盘分别对应的初始预设故障门限值为:X1、X2、X3、X4、X5、X6、X7、X8、X9、X10,第一次降低后的第一预设故障门限值为:Y1、Y2、Y3、Y4、Y5、Y6、Y7、Y8、Y9、Y10,且Y1小于X1,Y2小于X2,Y3小于X3,Y4小于X4,Y5小于X5,Y6小于X6,Y7小于X7,Y8小于X8,Y9小于X9,Y10小于X10,根据第一预设故障门限值后预测出发生故障的硬盘的总数目仍大于热备硬盘数目,则降低第一预设故障门限值,若此时预测出的发生故障的硬盘的总数目仍大于热备硬盘数目时,再降低第一预设故障门限值,直至预测出的发生故障的硬盘的总数目小于或者等于热备硬盘数目时,此时直接使用热备硬盘接替分别根据最终预测出的将发生故障的所有硬盘。
本发明实施例中,可选的,为了降低实现的复杂度,针对数据中心的所有 非易失性存储介质中的任意两个非易失性存储介质,对任意两个非易失性存储介质分别对应的初始预设故障门限值降低的幅度值相同。
例如:数据中心有5个硬盘:硬盘1、硬盘2、硬盘3、硬盘4、硬盘5,对应的初始预设故障门限值分别为X1、X2、X3、X4、X5,降低初始预设故障门限值得到的第一预设故障门限值分别为70%X1、70%X2、70%X3、70%X4、70%X5。
本发明实施例中,当确定的所有非易失性存储介质的总数目大于数据中心的热备非易失性存储介质的数目时,要降低非易失性存储介质对应的初始预设故障门限值,得到第一预设故障门限值,这样,可以将部分将发生故障的非易失性存储介质给筛选出来,用现有的热备非易失性存储介质替换先查找出来的将发生故障的非易失性存储介质,然后,补充热备非易失性存储介质,再提高已经降低后的初始预设故障门限值,也就是提高第一预设故障门限值,这样,再将第一次未筛选出来的将发生故障的非易失性存储介质给筛选出来,如此循环,直至将根据初始预设故障门限值判断出来的预测发生故障的非易失性存储介质给筛选出来。具体在实现时,可以采用如下方式:
例如,使用热备非易失性存储介质接替分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质之后,还包括如下操作:
补充预设数量的热备非易失性存储介质;
提高数据中心的每一个降低了初始预设故障门限值后的非易失性存储介质分别对应的第一预设故障门限值,得到第二预设故障门限值;
针对提高了第一预设故障门限值的任意一非易失性存储介质,在任意一非易失性存储介质的状况值小于任意一非易失性存储介质对应的第二预设故障门限值时,预测任意一非易失性存储介质将发生故障;
当判断出分别根据对应的第二预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目,小于或者等于补充的预设数量的热备非易失性存储介质的数目时,使用补充的预设数量的热备非易失性存储介质接替分别根据对应第二预设故障门限值预测出的将发生故障的所有非易失性存储介质;
其中,针对每一个对应第二预设故障门限值的非易失性存储介质,第二预 设故障门限值小于或者等于对应非易失性存储介质的初始预设故障门限值。例如:数据中心有10个硬盘:硬盘1、硬盘2、硬盘3、硬盘4、硬盘5、硬盘6、硬盘7、硬盘8、硬盘9、硬盘10,对应的初始预设故障门限值分别为X1、X2、X3、X4、X5、X6、X7、X8、X9、X10,热备盘有3个,根据初始预设故障门限值筛选出来的将发生故障的硬盘有8个:硬盘1-硬盘8,将降低初始预设故障门限值,降低初始预设故障门限值得到的第一预设故障门限值分别为50%X1、50%X2、50%X3、50%X4、50%X5、50%X6、50%X7、50%X8、50%X9、50%X10,根据第一预设故障门限值预测出来的发生故障的硬盘有3个:硬盘1、硬盘2和硬盘3,则将热备硬盘替换硬盘1、硬盘2和硬盘3,替换后,补充3个热备硬盘。提高第一预设故障门限值,得到第二预设故障门限值:60%X1、60%X2、60%X3、60%X4、60%X5、60%X6、60%X7、60%X8、60%X9、60%X10,根据第二预设故障门限值预测出来的发生故障的硬盘有3:硬盘4、硬盘5和硬盘6,则将补充的热备硬盘替换硬盘4、硬盘5和硬盘6,然后,再补充3热备硬盘,并提高第二预设故障门限值,得到第三预设故障门限值,80%X1、80%X2、80%X3、80%X4、80%X5、80%X6、80%X7、80%X8、80%X9、80%X10,根据第三预设故障门限值预测出来的发生故障的硬盘有2:硬盘7、硬盘8,则将补充的热备硬盘替换硬盘7、硬盘8,则将补充的热备硬盘替换硬盘7、硬盘8。
本发明实施例中,非易失性存储介质对应的初始预设故障门限值与该非易失性存储介质的上电时间相关,随着上电时间的增加,判断条件是放宽松的,若初始预设故障门限值增加,判断条件是放宽松的话,则随着上电时间的增加,初始预设故障门限值增加,若初始预设故障门限值减小,判断条件是放宽松的话,则随着上电时间的增加,初始预设故障门限值降低。
综上所述,本发明实施例中,针对数据中心的至少两个非易失性存储介质中的任意一非易失性存储介质,分别执行:计算任意一非易失性存储介质的状况值,状况值用于表征任意一非易失性存储介质的运行状况;确定状况值小于与任意一非易失性存储介质对应的初始预设故障门限值时,预测任意一非易失性存储介质将发生故障;状况值不同的任意两个非易失性存储介质分别对应的 初始预设故障门限值不同,在该方案中,状况值不同的非易失性存储介质分别对应不同的初始预设故障门限值,也就是说,状况值不同的非易失性存储介质分别对应不同的报警门槛,因此,提高了预测出的发生故障的非易失性存储介质的准确度。
为了更好地理解本发明实施例,以下给出具体应用场景,针对预测非易失性存储介质发生故障的过程,作出进一步详细描述,架构图如图2所示:
步骤200:数据中心有10个硬盘:硬盘1、硬盘2、……、硬盘10,计算10个硬盘中的每一个硬盘的状况值;
步骤210:针对10个硬盘中的任意一硬盘,将状况值小于对应的初始预设故障门限值的硬盘作为预测出的将发生故障的硬盘,状况值不同的任意两个非易失性存储介质分别对应的初始预设故障门限值不同;
步骤220:确定预测出的将发生故障的所有硬盘的总数目,并判断确定出的所有硬盘的总数目是否小于或者等于数据中心的热备硬盘数目;若是,执行步骤230,否则,执行步骤240;
步骤230:使用热备硬盘接替所述预测出的将发生故障的所有硬盘的工作;
在该步骤中,接替所述预测出的将发生故障的所有硬盘工作的热备硬盘的数目与所述所有硬盘的总数目相同。
步骤240:降低10个硬盘分别对应的初始预设故障门限值,得到第一预设故障门限值;
在该步骤中,针对所述数据中心的所有硬盘中的任意两个硬盘,对所述任意两个硬盘分别对应的初始预设故障门限值降低的幅度值相同。
步骤250:判断根据第一预设故障门限值预测出的所有发生故障的硬盘的数目是否小于或者等于数据中心的热备硬盘,若是,执行步骤260,否则,返回步骤240;
步骤260:将热备硬盘接替根据第一预设故障门限值预测出的发生故障的硬盘的工作,并补充预设数量的热备硬盘;
步骤270:提高第一预设故障门限值,得到第二预设故障门限值,将状况 值小于对应的第二预设故障门限值的硬盘作为预测出的将发生故障的硬盘;
步骤280:判断预测出的发生故障的硬盘的数目是否为0和/或第二预设故障门限值为初始预设故障门限值,若是,结束流程,否则,返回步骤220。
基于上述相应方法的技术方案,参阅图3A所示,本发明实施例提供一种预测非易失性存储介质发生故障的装置,该装置包括计算单元30、预测单元31,其中:
计算单元30,用于针对数据中心的至少两个非易失性存储介质中的任意一非易失性存储介质,分别执行:计算任意一非易失性存储介质的状况值,状况值用于表征任意一非易失性存储介质的运行状况;
预测单元31,用于确定状况值小于与任意一非易失性存储介质对应的初始预设故障门限值时,预测任意一非易失性存储介质将发生故障;
状况值不同的任意两个非易失性存储介质分别对应的初始预设故障门限值不同。
本发明实施例中,进一步的,还包括确定单元、判断单元和接替单元:
确定单元用于确定预测出的将发生故障的所有非易失性存储介质的总数目;
判断单元用于判定确定的所有非易失性存储介质的总数目小于或者等于数据中心的热备非易失性存储介质的数目;
接替单元用于在判断单元判定确定的所有非易失性存储介质的总数目小于或者等于数据中心的热备非易失性存储介质的数目时,使用热备非易失性存储介质中的热备非易失性存储介质接替预测出的将发生故障的所有非易失性存储介质的工作;
接替预测出的将发生故障的所有非易失性存储介质工作的热备非易失性存储介质的数目与所有非易失性存储介质的总数目相同。
本发明实施例中,进一步的,确定单元还用于:判定确定的所有非易失性存储介质的总数目大于数据中心的热备非易失性存储介质的数目时,针对任意一非易失性存储介质,分别执行:
降低任意一非易失性存储介质对应的初始预设故障门限值,得到第一预设故障门限值;
预测单元31用于确定任意一非易失性存储介质的状况值小于任意一非易失性存储介质对应的第一预设故障门限值时,进一步预测任意一非易失性存储介质将发生故障;
接替单元当判断出分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目等于或者小于数据中心的热备非易失性存储介质的数目时,使用热备非易失性存储介质接替分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质。
本发明实施例中,可选的,针对数据中心的所有非易失性存储介质中的任意两个非易失性存储介质,对任意两个非易失性存储介质分别对应的初始预设故障门限值降低的幅度值相同。
本发明实施例中,进一步的,还包括补充单元,用于补充预设数量的热备非易失性存储介质;提高数据中心的每一个降低了初始预设故障门限值后的非易失性存储介质分别对应的第一预设故障门限值,得到第二预设故障门限值;
预测单元31还用于针对提高了第一预设故障门限值的任意一非易失性存储介质,在任意一非易失性存储介质的状况值小于任意一非易失性存储介质对应的第二预设故障门限值时,预测任意一非易失性存储介质将发生故障;
判断单元还用于判断出分别根据对应的第二预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目,小于或者等于补充的预设数量的热备非易失性存储介质的数目;
接替单元还用于在判断单元判断出分别根据对应的第二预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目,小于或者等于补充的预设数量的热备非易失性存储介质的数目时,使用补充的预设数量的热备非易失性存储介质接替分别根据对应第二预设故障门限值预测出的将发生故障的所有非易失性存储介质;
其中,针对每一个对应第二预设故障门限值的非易失性存储介质,第二预 设故障门限值小于或者等于对应非易失性存储介质的初始预设故障门限值。
如图3B所示,为本发明实施例提供的预测非易失性存储介质发生故障的装置的另一种结构示意图,包括至少一个处理器301,通信总线302,存储器303以及至少一个通信接口304。
其中,通信总线302用于实现上述组件之间的连接并通信,通信接口304用于与外部设备连接并通信。
其中,存储器303用于存储有可执行的程序代码,处理器301通过执行这些程序代码,以用于:
针对数据中心的至少两个非易失性存储介质中的任意一非易失性存储介质,分别执行:
计算任意一非易失性存储介质的状况值,状况值用于表征任意一非易失性存储介质的运行状况;
确定状况值小于与任意一非易失性存储介质对应的初始预设故障门限值时,预测任意一非易失性存储介质将发生故障;
状况值不同的任意两个非易失性存储介质分别对应的初始预设故障门限值不同。
进一步的,本发明实施例中,处理器301还用于,预测任意一非易失性存储介质将发生故障之后,还包括:
确定预测出的将发生故障的所有非易失性存储介质的总数目;
判定确定的所有非易失性存储介质的总数目小于或者等于数据中心的热备非易失性存储介质的数目时,使用热备非易失性存储介质中的热备非易失性存储介质接替预测出的将发生故障的所有非易失性存储介质的工作;
接替预测出的将发生故障的所有非易失性存储介质工作的热备非易失性存储介质的数目与所有非易失性存储介质的总数目相同。
进一步的,本发明实施例中,处理器301还用于,确定预测出的将发生故障的所有非易失性存储介质的总数目之后,还包括:
判定确定的所有非易失性存储介质的总数目大于数据中心的热备非易失性 存储介质的数目时,针对任意一非易失性存储介质,分别执行:
降低任意一非易失性存储介质对应的初始预设故障门限值,得到第一预设故障门限值,并
确定任意一非易失性存储介质的状况值小于任意一非易失性存储介质对应的第一预设故障门限值时,进一步预测任意一非易失性存储介质将发生故障;
当判断出分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目等于或者小于数据中心的热备非易失性存储介质的数目时,使用热备非易失性存储介质接替分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质。
可选的,本发明实施例中,处理器301还用于,针对数据中心的所有非易失性存储介质中的任意两个非易失性存储介质,对任意两个非易失性存储介质分别对应的初始预设故障门限值降低的幅度值相同。
进一步的,本发明实施例中,处理器301还用于,使用热备非易失性存储介质接替分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质之后,还包括:
补充预设数量的热备非易失性存储介质;
提高数据中心的每一个降低了初始预设故障门限值后的非易失性存储介质分别对应的第一预设故障门限值,得到第二预设故障门限值;
针对提高了第一预设故障门限值的任意一非易失性存储介质,在任意一非易失性存储介质的状况值小于任意一非易失性存储介质对应的第二预设故障门限值时,预测任意一非易失性存储介质将发生故障;
当判断出分别根据对应的第二预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目,小于或者等于补充的预设数量的热备非易失性存储介质的数目时,使用补充的预设数量的热备非易失性存储介质接替分别根据对应第二预设故障门限值预测出的将发生故障的所有非易失性存储介质;
其中,针对每一个对应第二预设故障门限值的非易失性存储介质,第二预设故障门限值小于或者等于对应非易失性存储介质的初始预设故障门限值。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中的功能的步骤。
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。
显然,本领域的技术人员可以对本发明实施例进行各种改动和变型而不脱离本发明实施例的精神和范围。这样,倘若本发明实施例的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (10)

  1. 一种预测非易失性存储介质发生故障的方法,其特征在于,包括:
    针对数据中心的至少两个非易失性存储介质中的任意一非易失性存储介质,分别执行:
    计算所述任意一非易失性存储介质的状况值,所述状况值用于表征所述任意一非易失性存储介质的运行状况;
    确定所述状况值小于与所述任意一非易失性存储介质对应的初始预设故障门限值时,预测所述任意一非易失性存储介质将发生故障;
    状况值不同的任意两个非易失性存储介质分别对应的初始预设故障门限值不同。
  2. 如权利要求1所述的方法,其特征在于,预测所述任意一非易失性存储介质将发生故障之后,还包括:
    确定预测出的将发生故障的所有非易失性存储介质的总数目;
    判定确定的所述所有非易失性存储介质的总数目小于或者等于所述数据中心的热备非易失性存储介质的数目时,使用所述热备非易失性存储介质中的热备非易失性存储介质接替所述预测出的将发生故障的所有非易失性存储介质的工作;
    接替所述预测出的将发生故障的所有非易失性存储介质工作的热备非易失性存储介质的数目与所述所有非易失性存储介质的总数目相同。
  3. 如权利要求2所述的方法,其特征在于,确定预测出的将发生故障的所有非易失性存储介质的总数目之后,还包括:
    判定确定的所述所有非易失性存储介质的总数目大于所述数据中心的热备非易失性存储介质的数目时,针对所述任意一非易失性存储介质,分别执行:
    降低所述任意一非易失性存储介质对应的初始预设故障门限值,得到第一预设故障门限值,并
    确定所述任意一非易失性存储介质的状况值小于所述任意一非易失性存储 介质对应的第一预设故障门限值时,进一步预测所述任意一非易失性存储介质将发生故障;
    当判断出分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目等于或者小于所述数据中心的热备非易失性存储介质的数目时,使用所述热备非易失性存储介质接替分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质。
  4. 如权利要求3所述的方法,其特征在于,针对所述数据中心的所有非易失性存储介质中的任意两个非易失性存储介质,对所述任意两个非易失性存储介质分别对应的初始预设故障门限值降低的幅度值相同。
  5. 如权利要求3或4所述的方法,其特征在于,使用所述热备非易失性存储介质接替分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质之后,还包括:
    补充预设数量的热备非易失性存储介质;
    提高所述数据中心的每一个降低了初始预设故障门限值后的非易失性存储介质分别对应的第一预设故障门限值,得到第二预设故障门限值;
    针对提高了第一预设故障门限值的任意一非易失性存储介质,在任意一非易失性存储介质的状况值小于任意一非易失性存储介质对应的第二预设故障门限值时,预测所述任意一非易失性存储介质将发生故障;
    当判断出分别根据对应的第二预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目,小于或者等于补充的所述预设数量的热备非易失性存储介质的数目时,使用补充的所述预设数量的所述热备非易失性存储介质接替分别根据对应第二预设故障门限值预测出的将发生故障的所有非易失性存储介质;
    其中,针对每一个对应第二预设故障门限值的非易失性存储介质,第二预设故障门限值小于或者等于对应非易失性存储介质的初始预设故障门限值。
  6. 一种预测非易失性存储介质发生故障的装置,其特征在于,包括:
    计算单元,用于针对数据中心的至少两个非易失性存储介质中的任意一非 易失性存储介质,分别执行:计算所述任意一非易失性存储介质的状况值,所述状况值用于表征所述任意一非易失性存储介质的运行状况;
    预测单元,用于确定所述状况值小于与所述任意一非易失性存储介质对应的初始预设故障门限值时,预测所述任意一非易失性存储介质将发生故障;
    状况值不同的任意两个非易失性存储介质分别对应的初始预设故障门限值不同。
  7. 如权利要求6所述的装置,其特征在于,还包括确定单元、判断单元和接替单元:
    所述确定单元用于确定预测出的将发生故障的所有非易失性存储介质的总数目;
    所述判断单元用于判定确定的所述所有非易失性存储介质的总数目小于或者等于所述数据中心的热备非易失性存储介质的数目;
    所述接替单元用于在所述判断单元判定确定的所述所有非易失性存储介质的总数目小于或者等于所述数据中心的热备非易失性存储介质的数目时,使用所述热备非易失性存储介质中的热备非易失性存储介质接替所述预测出的将发生故障的所有非易失性存储介质的工作;
    接替所述预测出的将发生故障的所有非易失性存储介质工作的热备非易失性存储介质的数目与所述所有非易失性存储介质的总数目相同。
  8. 如权利要求7所述的装置,其特征在于,所述确定单元还用于:判定确定的所述所有非易失性存储介质的总数目大于所述数据中心的热备非易失性存储介质的数目时,针对所述任意一非易失性存储介质,分别执行:
    降低所述任意一非易失性存储介质对应的初始预设故障门限值,得到第一预设故障门限值;
    所述预测单元用于确定所述任意一非易失性存储介质的状况值小于所述任意一非易失性存储介质对应的第一预设故障门限值时,进一步预测所述任意一非易失性存储介质将发生故障;
    所述接替单元当判断出分别根据对应的第一预设故障门限值预测出的将发 生故障的所有非易失性存储介质的总数目等于或者小于所述数据中心的热备非易失性存储介质的数目时,使用所述热备非易失性存储介质接替分别根据对应的第一预设故障门限值预测出的将发生故障的所有非易失性存储介质。
  9. 如权利要求8所述的装置,其特征在于,针对所述数据中心的所有非易失性存储介质中的任意两个非易失性存储介质,对所述任意两个非易失性存储介质分别对应的初始预设故障门限值降低的幅度值相同。
  10. 如权利要求8或9所述的装置,其特征在于,还包括补充单元,用于补充预设数量的热备非易失性存储介质;提高所述数据中心的每一个降低了初始预设故障门限值后的非易失性存储介质分别对应的第一预设故障门限值,得到第二预设故障门限值;
    所述预测单元还用于针对提高了第一预设故障门限值的任意一非易失性存储介质,在任意一非易失性存储介质的状况值小于任意一非易失性存储介质对应的第二预设故障门限值时,预测所述任意一非易失性存储介质将发生故障;
    所述判断单元还用于判断出分别根据对应的第二预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目,小于或者等于补充的所述预设数量的热备非易失性存储介质的数目;
    所述接替单元还用于在所述判断单元判断出分别根据对应的第二预设故障门限值预测出的将发生故障的所有非易失性存储介质的总数目,小于或者等于补充的所述预设数量的热备非易失性存储介质的数目时,使用补充的所述预设数量的所述热备非易失性存储介质接替分别根据对应第二预设故障门限值预测出的将发生故障的所有非易失性存储介质;
    其中,针对每一个对应第二预设故障门限值的非易失性存储介质,第二预设故障门限值小于或者等于对应非易失性存储介质的初始预设故障门限值。
PCT/CN2015/096690 2014-12-25 2015-12-08 一种预测非易失性存储介质发生故障的方法及装置 WO2016101786A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410822384.5 2014-12-25
CN201410822384.5A CN105787242B (zh) 2014-12-25 2014-12-25 一种预测非易失性存储介质发生故障的方法及装置

Publications (1)

Publication Number Publication Date
WO2016101786A1 true WO2016101786A1 (zh) 2016-06-30

Family

ID=56149223

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/096690 WO2016101786A1 (zh) 2014-12-25 2015-12-08 一种预测非易失性存储介质发生故障的方法及装置

Country Status (2)

Country Link
CN (2) CN105787242B (zh)
WO (1) WO2016101786A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060034008A1 (en) * 2004-08-02 2006-02-16 Hitachi Global Storage Technologies Netherlands B.V. Failure prediction method for magnetic disk devices, and a magnetic disk device using the same
CN101201786A (zh) * 2006-12-13 2008-06-18 中兴通讯股份有限公司 一种故障日志监控方法及装置
CN101872641A (zh) * 2009-12-28 2010-10-27 杭州海康威视数字技术股份有限公司 硬盘录像机中的硬盘失效预警方法及装置
CN102129397A (zh) * 2010-12-29 2011-07-20 深圳市永达电子股份有限公司 一种自适应磁盘阵列故障预测方法及系统
CN103197995A (zh) * 2012-01-04 2013-07-10 百度在线网络技术(北京)有限公司 硬盘故障检测方法及装置

Family Cites Families (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6574754B1 (en) * 2000-02-14 2003-06-03 International Business Machines Corporation Self-monitoring storage device using neural networks
US7480828B2 (en) * 2004-06-10 2009-01-20 International Business Machines Corporation Method, apparatus and program storage device for extending dispersion frame technique behavior using dynamic rule sets
CN100498961C (zh) * 2004-07-01 2009-06-10 华为技术有限公司 硬盘检测装置及方法
US7523359B2 (en) * 2005-03-31 2009-04-21 International Business Machines Corporation Apparatus, system, and method for facilitating monitoring and responding to error events
US7376499B2 (en) * 2005-09-16 2008-05-20 Gm Global Technology Operations, Inc. State-of-health monitoring and fault diagnosis with adaptive thresholds for integrated vehicle stability system
US7627405B2 (en) * 2006-11-17 2009-12-01 Gm Global Technology Operations, Inc. Prognostic for loss of high-voltage isolation
CN101604548B (zh) * 2009-03-26 2012-06-27 成都市华为赛门铁克科技有限公司 一种固态硬盘及数据存储方法
CN101764846B (zh) * 2009-12-18 2012-07-11 西南交通大学 一种远程集中式磁盘阵列运行监控系统的实现方法
US20120102367A1 (en) * 2010-10-26 2012-04-26 International Business Machines Corporation Scalable Prediction Failure Analysis For Memory Used In Modern Computers
CN102033717B (zh) * 2010-12-07 2013-05-08 清华大学 基于磁盘阵列的数据存储方法及系统
US9146855B2 (en) * 2012-01-09 2015-09-29 Dell Products Lp Systems and methods for tracking and managing non-volatile memory wear
CN103580934B (zh) * 2012-07-18 2018-09-04 深圳市腾讯计算机系统有限公司 一种云业务监测方法和装置
CN103455397A (zh) * 2013-09-06 2013-12-18 杭州华为数字技术有限公司 一种系统自检的方法、设备及系统
CN104020963B (zh) * 2014-06-04 2017-05-17 浙江宇视科技有限公司 一种防止误判硬盘读写错误的方法和装置
CN104092440B (zh) * 2014-07-21 2017-07-28 阳光电源股份有限公司 光伏系统直流电弧故障检测方法、装置、处理器及其系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060034008A1 (en) * 2004-08-02 2006-02-16 Hitachi Global Storage Technologies Netherlands B.V. Failure prediction method for magnetic disk devices, and a magnetic disk device using the same
CN101201786A (zh) * 2006-12-13 2008-06-18 中兴通讯股份有限公司 一种故障日志监控方法及装置
CN101872641A (zh) * 2009-12-28 2010-10-27 杭州海康威视数字技术股份有限公司 硬盘录像机中的硬盘失效预警方法及装置
CN102129397A (zh) * 2010-12-29 2011-07-20 深圳市永达电子股份有限公司 一种自适应磁盘阵列故障预测方法及系统
CN103197995A (zh) * 2012-01-04 2013-07-10 百度在线网络技术(北京)有限公司 硬盘故障检测方法及装置

Also Published As

Publication number Publication date
CN105787242A (zh) 2016-07-20
CN109933448B (zh) 2021-04-20
CN109933448A (zh) 2019-06-25
CN105787242B (zh) 2019-02-26

Similar Documents

Publication Publication Date Title
US9733844B2 (en) Data migration method, data migration apparatus, and storage device
CN105843699B (zh) 用于错误监视与校正的动态随机存取存储器设备与方法
US8984335B2 (en) Core diagnostics and repair
CN109284207A (zh) 硬盘故障处理方法、装置、服务器和计算机可读介质
US9734015B2 (en) Pre-boot self-healing and adaptive fault isolation
WO2021135272A1 (zh) 一种内存异常的处理方法、系统、电子设备及存储介质
CN110908838B (zh) 一种数据处理方法、装置及电子设备和存储介质
US10176065B2 (en) Intelligent failure prediction and redundancy management in a data storage system
CN110704228B (zh) 一种固态硬盘异常处理方法及系统
CN106371807A (zh) 一种扩展处理器指令集的方法及装置
US20140095921A1 (en) Information processing apparatus, startup program, and startup method
JP2006079418A (ja) 記憶制御装置、制御方法及びプログラム
CN107967195A (zh) 一种基于双控存储的故障修复方法及系统
CN106168918A (zh) 扩展纠错编码数据存储
JP6794805B2 (ja) 故障情報管理プログラム、起動試験方法及び並列処理装置
JP5918661B2 (ja) 設備診断装置および設定変更督促方法
CN109271270A (zh) 存储系统中底层硬件的故障排除方法、系统及相关装置
WO2016101786A1 (zh) 一种预测非易失性存储介质发生故障的方法及装置
CN107861829A (zh) 一种磁盘故障检测的方法、系统、装置及存储介质
CN103890713A (zh) 用于管理处理系统内的寄存器信息的装置及方法
CN104571943A (zh) 控制非易失性数据存储子系统的方法和系统
JP2011034219A (ja) 故障検出方法及び監視装置
CN111130856A (zh) 一种服务器配置方法、系统、设备及计算机可读存储介质
CN113590287B (zh) 任务处理方法、装置、设备、存储介质及调度系统
CN109542687B (zh) 一种raid级别转换方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15871855

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15871855

Country of ref document: EP

Kind code of ref document: A1