CN109658975A - A kind of active data restorative procedure and system towards correcting and eleting codes - Google Patents

A kind of active data restorative procedure and system towards correcting and eleting codes Download PDF

Info

Publication number
CN109658975A
CN109658975A CN201811300732.7A CN201811300732A CN109658975A CN 109658975 A CN109658975 A CN 109658975A CN 201811300732 A CN201811300732 A CN 201811300732A CN 109658975 A CN109658975 A CN 109658975A
Authority
CN
China
Prior art keywords
hard disk
disk
time
hard
fault
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811300732.7A
Other languages
Chinese (zh)
Other versions
CN109658975B (en
Inventor
杨雅辉
杨洪章
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN201811300732.7A priority Critical patent/CN109658975B/en
Publication of CN109658975A publication Critical patent/CN109658975A/en
Application granted granted Critical
Publication of CN109658975B publication Critical patent/CN109658975B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/38Response verification devices
    • G11C29/42Response verification devices using error correcting codes [ECC] or parity check
    • GPHYSICS
    • G11INFORMATION STORAGE
    • G11CSTATIC STORES
    • G11C29/00Checking stores for correct operation ; Subsequent repair; Testing stores during standby or offline operation
    • G11C29/04Detection or location of defective memory elements, e.g. cell constructio details, timing of test signals
    • G11C29/08Functional testing, e.g. testing during refresh, power-on self testing [POST] or distributed testing
    • G11C29/12Built-in arrangements for testing, e.g. built-in self testing [BIST] or interconnection details
    • G11C29/44Indication or identification of errors, e.g. for repair
    • G11C29/4401Indication or identification of errors, e.g. for repair for self repair

Abstract

The present invention discloses a kind of active data restorative procedure and system towards correcting and eleting codes.This method includes: that taken at regular intervals respectively sets the information of hard disk and carries out feature extraction, label to it, obtains a training dataset;A hard disk failure degree identification model is obtained using training dataset training;It acquires the information of each hard disk in correcting and eleting codes system and carries out feature extraction, be then enter into hard disk failure degree identification model, obtain the corresponding fault degree of each hard disk;If the fault degree of hard disk A is higher than fault threshold x, then hard disk A is started and is actively repaired: traversing all data blocks in hard disk A, the hard disk serial number for reading its affiliated band, the hard disk that the corresponding hard disk of these hard disk serial numbers is repaired jointly as participation, referred to as B disk group;Data block all in set T is repaired into the disk C minimum to a fault degree by hard disk A, by the method for encoding and decoding calculating by data blocks all in set R by B disk group reparation to the disk C.

Description

A kind of active data restorative procedure and system towards correcting and eleting codes
Technical field
The present invention relates to reliability field, specially a kind of active data restorative procedure and system towards correcting and eleting codes.
Background technique
More copies (Replica) and correcting and eleting codes (Erasure Code) are that two kinds of most commonly seen storage systems are fault-tolerant (Fault Tolerance) method.The advantages of more copies is that data reparation (recovery) is convenient, the time that degrades is short, low-quality disk is occurring When, directly by the content copy of other copies to healthy disk, but the disadvantage is that it is with high costs.The advantages of correcting and eleting codes, is cost Lower, but a disadvantage is that data are repaired, complicated, the time that degrades is long, need to be by the operation of encoding and decoding, in health when there is low-quality disk The data of loss (damage) are constructed in disk.Both the above common methods are traditional Passive fault-tolerant control method, i.e. data reparation It is carried out in the case where low-quality disk has occurred and that, system enters degrading state at this time, and normal read and write access service will receive influence. In degrading state, system must concentrate one's energy repair data --- in typical this of two-pack, once occur another piece Low-quality disk, the then risk that systems face data are permanently lost.Obviously, the system degradation time is longer, and the risk that low-quality disk occurs again is got over The reliability of height, system is lower.
In recent years, Active Fault Tolerant method (Chinese invention patent CN102521058, CN107391301, CN108205424, CN102981930) become new research hotspot and technological trend.Existing Active Fault Tolerant method is will to be on the verge of faulty disk (pre- Failure Disk) in data copy into healthy disk, low-quality disk appearance before in advance data reparation, avoid system from entering drop Grade state.But the prior art is difficult to meet actual commercial demand, chief reason include: (1) prior art allow be on the verge of therefore Barrier disk completely independent undertakes data reparation, and the generation of the data access meeting acceleration disturbance in the short time in continuum is often led The also non-complete copy of data is caused, low-quality disk has just occurred, and system still can enter degrading state, finally still needs to visibly moved by tradition Wrong method solves the problems, such as, it is difficult to the effect for fault-toleranr technique of taking the initiative.(2) prior art can predict hard disk to a certain extent Whether failure can occur, and the conclusion provided is " good disk " or " low-quality disk ", and when (i.e. event will occur for unpredictable failure The severity of barrier), thus it is multiple be on the verge of faulty disk and predicted simultaneously when, can not provide and reasonable repair priority.(3) The prior art to the accuracy rate of failure predication can not accomplish absolutely, occur judge by accident situation when, can not timely error correction, need Equal certain times and then freshly harvested system index is used to update prediction model, there is hysteresis quality, cause forecasting inaccuracy, The opportunity actively repaired is not proper.
Summary of the invention
The technical problems to be solved by the invention: existing Active Fault Tolerant technology presence can not effectively avoid system degradation, lack Weary priority, the problems such as opportunity is not proper, cause effect undesirable, it is difficult to meet practical commercial demand.
For the technical problems in the prior art, the purpose of the present invention is to provide a kind of actives towards correcting and eleting codes Data recovery method and system.Main innovative point includes: that (1) is combined reparation by polydisc and copied and encoding and decoding calculating The reparation means combined, had both accelerated reparation speed, and system is effectively avoided to enter degrading state, and balanced network transmission pressure Power.(2) by establishing hard disk failure Degree Model, to be precisely controlled active repair time, to be on the verge of faulty disk same multiple When predicting, can priority processing severity it is highest.(3) by feedback mechanism, for unreasonable, ill-considered active Repairing trigger timing, (hard disk still has the case where remaining life after including but not limited to repairing, and repairs and fail to complete just The case where having already appeared low-quality disk), can timely adjusting parameter, to can precisely be utilized under the premise of avoiding system degradation It is on the verge of the remaining life cycle of faulty hard disk.In summary innovative point, the present invention can reach the ideal effect of Active Fault Tolerant technology Fruit --- Data Migration can be completely used for just by Accurate Prediction, last life cycle by being on the verge of faulty disk, and system can be kept away Exempt to degrade, does not also waste the life cycle of hard disk.
Remarks 1: for copy system, whether Active Fault Tolerant or Passive fault-tolerant control, are direct copyings, are actively held Wrong advantage is unobvious.For correcting and eleting codes system, Active Fault Tolerant can direct copying, and Passive fault-tolerant control needs encoding and decoding to calculate, Obviously Active Fault Tolerant is with the obvious advantage, has more positive sense for correcting and eleting codes system.Therefore, the present invention is only for correcting and eleting codes system The scene of system.The present invention is suitable for all correcting and eleting codes algorithms, including but not limited to: RS code, LRCs code, SHEC code, MSR code, MBR Code, Hitchkiker-XOR code etc..
Remarks 2: present invention meaning hard disk, including but not limited to: mechanical hard disk, solid state hard disk.Its applicable hardware interface Including but not limited to: SATA interface, SAS interface, PCIe interface, M.2 interface, U.2 interface etc..
Technical solution of the present invention:
A kind of active data restorative procedure towards correcting and eleting codes, step include:
1) in a set period of time, taken at regular intervals respectively sets the information of hard disk;
2) feature extraction is carried out to the information at each hard disk each time point of acquisition, generates a sample data;
3) it is marked according to sample data of the virtual condition of hard disk to corresponding hard disk, obtains a training dataset;Institute Label is stated to include healthy disk, low-quality disk and be on the verge of faulty disk;Wherein, healthy disk corresponds to a mark value a, low-quality disk corresponds to a mark value b, It is on the verge of faulty disk and corresponds to multiple mark values between a~b, if sample data acquisition time when disk state is low-quality disk is T, the sample data of the hard disk in the set period of time before time t are labeled as being on the verge of faulty disk and apart from the time The mark value of the t closer hard disk sample data is closer to low-quality disk mark value;Sample data label when disk state is low-quality disk For the corresponding mark value b of low-quality disk, remaining sample data is labeled as the corresponding mark value a of healthy disk;
4) setting fault threshold x repairs ratio y with the data that faulty disk undertakes are on the verge of, and utilizes training dataset training Obtain a hard disk failure degree identification model;
5) it acquires the information of each hard disk in correcting and eleting codes system and carries out feature extraction, be then enter into the hard disk event Barrier degree identification model obtains the corresponding fault degree of each hard disk;
6) fault degree of a hard disk A is higher than fault threshold x if it exists, then starts to hard disk A and actively repair, i.e., Execute step 7), 8);The minimum hard disk of a fault degree is chosen, disk C is denoted as;
7) all data blocks in hard disk A are traversed, the hard disk serial number of its affiliated band is read, by these hard disk serial numbers pair The hard disk that the hard disk answered is repaired jointly as participation, referred to as B disk group;The quantity of hard disk is used by correcting and eleting codes system in B disk group Correcting and eleting codes algorithm determines;
8) data block all in set T is repaired by hard disk A to the disk C, it will by the method that encoding and decoding calculate All data blocks are by B disk group reparation to the disk C in set R;Wherein, it is to be repaired that p data block is shared in hard disk A, this is firmly Disk A undertakes the reparation of p*y data block, is denoted as set T, and B disk group undertakes the reparation of p-p*y data block, is denoted as set R.
Further, after the completion of hard disk A is actively repaired, hard disk A progress continuously, cover type is written, until Until when hard disk A breaks down or the write time is up to residue life cycle threshold value z, record write time h;If h=z sentences It is set to and mistake is identified to hard disk A fault degree, feeds back to hard disk failure degree identification model;If h > z/n raises hard disk failure Fault threshold x and/or up-regulation in degree identification model repair proportion threshold value y;If h < z/n, hard disk failure degree is kept to know The parameter constant of other model, n are the integer greater than 1.
Further, the information of the hard disk includes hard disk SMART information;The feature include following message one kind or Multiple combinations: it is original read error rate, disc starting average time, hard disk remap sector number, magnetic head tracking error rate, Hourage, the error count of hardware unrepairable, the magnetic head separation disc piece of hard disk power-up be excessively high to lead to write the number of failure, in hard disk Error count, the unstable sector number that portion's temperature, hardware ECC are repaired.
Further, the form of the hard disk SMART information includes but is not limited to: current Raw value, current Normal value, The difference of current Raw value and the value before the difference of the value, current Normal value and time L before time L;Wherein, time L is The integral multiple of time N, N are the time interval of information collection.
Further, if hard disk A is in active repair process, low-quality disk occurs for other hard disks in the correcting and eleting codes system, Then correcting and eleting codes system enters degrading state, suspends to the active reparation of hard disk A, using the method for Passive fault-tolerant control, until degrading State terminates;If hard disk A, in active repair process, low-quality disk occurs for hard disk A, does not then complete and repair in statistics set T Data block set S, the data block of set S is then transferred into B disk group reparation, fault threshold x is lowered, lowers and repair proportion threshold value y。
Further, after the completion of hard disk A is actively repaired, the metadata information for being related to hard disk A is uniformly revised as this Disk C;In step 6), if there is the fault degree of multiple hard disks to be higher than fault threshold x, handled one by one according to fault degree, it is excellent The highest disk of fault degree is started as hard disk A first and is actively repaired.
A kind of active data repair system towards correcting and eleting codes, which is characterized in that including information acquisition module, data processing Module, hard disk failure Degree Model training module, active repair module;Wherein,
Information acquisition module, the information for each hard disk of taken at regular intervals;
Data processing module, the information for each hard disk each time point to acquisition carry out feature extraction, generate one Sample data;And the sample data of corresponding hard disk is marked according to the virtual condition of hard disk, obtain a training dataset; The label includes healthy disk, low-quality disk and is on the verge of faulty disk;Wherein, the corresponding mark value a of healthy disk, the corresponding mark value of low-quality disk B, it is on the verge of faulty disk and corresponds to multiple mark values between a~b, if sample data acquisition time when disk state is low-quality disk For t, the sample data of the hard disk in the set period of time before time t is labeled as being on the verge of faulty disk and when apart from this Between the closer hard disk sample data of t mark value closer to low-quality disk mark value;Sample data mark when disk state is low-quality disk It is denoted as the corresponding mark value b of low-quality disk, remaining sample data is labeled as the corresponding mark value a of healthy disk;
Hard disk failure Degree Model training module, for according to the fault threshold x of setting and being on the verge of the number that faulty disk undertakes According to ratio y is repaired, a hard disk failure degree identification model is obtained using training dataset training;
Active repair module, for utilizing each hard disk pair in hard disk failure degree identification model identification correcting and eleting codes system The fault degree answered, the fault degree of a hard disk A is higher than fault threshold x if it exists, then actively repairs to hard disk A starting It is multiple, the minimum hard disk of a fault degree is chosen, disk C is denoted as;Wherein active restorative procedure are as follows: traverse all in hard disk A Data block reads the hard disk serial number of its affiliated band, and the corresponding hard disk of these hard disk serial numbers is hard as participating in repairing jointly Disk, referred to as B disk group;Data block all in set T is repaired by hard disk A to the disk C, the method calculated by encoding and decoding By data blocks all in set R by B disk group reparation to the disk C;Wherein, it is to be repaired that p data block is shared in hard disk A, it should Hard disk A undertakes the reparation of p*y data block, is denoted as set T, and B disk group undertakes the reparation of p-p*y data block, is denoted as set R;B The correcting and eleting codes algorithm that the quantity of hard disk is used by correcting and eleting codes system in disk group determines.
Further, after the completion of the active repair module actively repairs hard disk A, to hard disk A carry out continuously, Be written to cover type, until when hard disk A breaks down or the write time reaches residue life cycle threshold value z, record write-in Time h;If h=z, it is judged to identifying mistake to hard disk A fault degree, feeds back to hard disk failure degree identification model;If h > Z/n, then the fault threshold x and/or up-regulation raised in hard disk failure degree identification model repair proportion threshold value y;If h < z/n, The parameter constant of hard disk failure degree identification model is kept, n is the integer greater than 1.
Further, the information of the hard disk includes hard disk SMART information;The feature include following message one kind or Multiple combinations: it is original read error rate, disc starting average time, hard disk remap sector number, magnetic head instruction road error rate, Hourage, the error count of hardware unrepairable, the magnetic head separation disc piece of hard disk power-up be excessively high to lead to write the number of failure, in hard disk Error count, the unstable sector number that portion's temperature, hardware ECC are repaired;The form of the hard disk SMART information includes but not It is limited to: difference, current Normal value and the time of the value before current Raw value, current Normal value, current Raw value and time L The difference of the value before L;Wherein, time L is the integral multiple of time N, and N is the time interval of information collection.
It further, further include a system metadata module, the metadata information for that will be related to hard disk A is unified to be modified For the disk C;If there is the fault degree of multiple hard disks to be higher than fault threshold x, handled one by one according to fault degree, preferentially by event The highest disk of barrier degree starts as hard disk A actively to be repaired.
A kind of active data restorative procedure towards correcting and eleting codes of the present invention, comprising the following steps:
Step 1: information collection.
Hard disk and system related information are acquired with time interval N.
Time interval N is either fixed (the typical value of N includes but is not limited to: 1 second, 1 minute, 30 minutes, 1 Hour, 3 hours, 6 hours, 12 hours, 24 hours, 48 hours, 72 hours, 1 week etc.), be also possible to it is unfixed (such as System business peak period frequency acquisition reduces, and increases in system business free time frequency acquisition).It preferably, should be with fixation It is spaced and is densely acquired, N is less than or equal to 3 hours.
The information of acquisition need to include at least: hard disk SMART information, it is preferable that may also include that hard-disk capacity, hard disk IO letter Breath, CPU information, memory information etc..
Sampling instrument includes but is not limited to: Open-Source Tools packet smartmontools, instruction fdisk, instruction iostat ,/ Proc/stat, vmstat etc..
Acquisition mode includes but is not limited to: the automatic data collection mode, manual acquisition mode, automatically with the acquisition combined manually Mode.
Step 2: data processing.
Step 2.1: data prediction.
Using the acquisition information in a period of time M continuous in step 1 as raw data set, it is assumed that it is hard to share g block in system Disk pre-processes it then amounting to g*M/N sample.Preferably, continuous time M was more than or equal to for 2 weeks.
Pretreatment mode includes but is not limited to: the sample of imperfect, exception, mistake being deleted, is mended for incomplete sample Neat data (mode, average, take previous time point, take latter time point, the median for taking surrounding time point etc.).
Step 2.2: characteristic value selection.
The number of SMART information: one of SMART information of 10 numbers or multiple combinations at least below selection: #1 (the original reading error rate of Raw Read Error Rate), #3 (average time of Spin Up Time disc starting), #5 (the sector number that Reallocated Sector Count hard disk remaps), #7 (Seek Error Rate magnetic head tracking error Rate), the #9 hourage of power-up (Power On Hours hard disk), #187 (Reported Uncorrectable Errors hardware The error count of unrepairable), #189 (the excessively high number for causing to write failure of High Fly Write magnetic head separation disc piece), #194 (Temperature Celsius hard drive internal temperature), the #195 (mistake that Hardware ECC Recovered hardware ECC is repaired Miscount), #197 (Current Pending Sector Count unstable sector number).The SMART information of hard disk is The health degree of the real-time status for the hard disk indices that manufacturer provides, some of them and hard disk, which exists, be associated with, some with The health degree of hard disk has no to be associated with, such as before disk is on the verge of failure, and #3 can be obvious fluctuated, and #1 can increase suddenly, and #11, # 14 do not change then always, therefore the present invention filters out the characteristic value that can indicate disk failure, can reduce algorithm complexity Under the premise of, promote the accuracy of prediction.Because the number of various years, the SMART information of different vendor slightly have difference, but it is similar Selection combination should be within the scope of the present invention.
The form of selected SMART information includes but is not limited to: current Raw value, current Normal value, current Raw value and when Between before L before the difference of the value, current Normal value and time L the value difference.Time L is the integral multiple of time N.
Preferably, step 2.3: numerical value normalization.
Role may be bigger than normal in algorithm training for the big characteristic value of data variation range, it is therefore desirable to pass through numerical value It is normalized to each characteristic value and fair comparison is provided.
Assuming that a is characterized the value before normalization, anormalIt is characterized normalized value, max a and min a are several respectively According to the maximum value and minimum value for concentrating the feature occurred.The normalized mode of numerical value includes but is not limited to:
Optionally, step 2.4: sample merges.
The appearance form of sample includes but is not limited to: (1) relevant information of the monolithic hard disk at single time point is 1 galley proof This, is on the verge of faulty disk by a plurality of continuous sample to determine whether.(2) correlation of the monolithic hard disk at continuous multiple time points Information merges into 1 sample, and merging mode is the one kind such as average value, variance, standard deviation, mode, maximum value minimum, entropy Or a variety of combinations, the comprehensive information for embodying multiple time points in single sample, therefore, it is determined that whether being to be on the verge of faulty disk.
Step 2.5: data markers.
Sample is marked according to the virtual condition of hard disk.Labeling method includes but is not limited to: (1) to the sample of healthy disk This status indication is 0.(2) the sample status indication to low-quality disk before failure except the K time is 0.(3) to low-quality disk before failure K Sample status indication within time is 1, or, it is preferable that mark value is evenly distributed between 0 and u (the usual value of u is 1).Sample This distance fault moment is closer, and corresponding hard disk failure degree is higher, and state value is closer to 1.
Typical K value is 10 days.Typical u value is 0.5,1.
For example, low-quality disk 10 days or more sample states before failure are 0,1 day sample state is 1 before failure, 5 days sample states are 0.5 before failure, and so on.
Remarks: can also be the numbers such as 10,100,254354 by the sample labeling of healthy disk in label.It can also be anti- Come over, be 0 by the sample status indication of healthy disk, the sample status indication of low-quality disk is 1.Such as the above mark mode belongs to In the Spirit Essence of this patent, though not enumerating, it should all be included in the protection scope of this patent.
Ultimately form " training dataset ".
Step 3: training hard disk failure Degree Model.
Step 3.1: model training.
Model training can be carried out by a wheel or more wheels.Training algorithm includes but is not limited to one or more combination: Artificial neural network algorithm, random forests algorithm, stealthy Markov model algorithm, Bayesian Classification Arithmetic, logistic regression are calculated Method, algorithm of support vector machine etc..
Both fault degree identification model can be established respectively for the hard disk of different brands, distinct interface, can also only build A vertical set of model.
Finally obtain hard disk failure degree identification model.
Step 3.2: parameter setting.
Fault threshold x is set, if fault degree is higher than x, then it is assumed that the disk is to be on the verge of faulty disk.Decision procedure includes but not Be limited to: any sample labeling of a certain hard disk is to be on the verge of faulty disk to determine that the disk is to be on the verge of faulty disk, is more than half in multisample It is that the disk is on the verge of failure that number, which is judged to being on the verge of failure, and being all judged to being on the verge of failure in multisample is that the disk is on the verge of failure.
Proportion threshold value y is repaired in setting, repairs ratio y by being on the verge of the data that faulty disk undertakes.Y value range is [0,1], by It is 1-y that the data that other healthy disks undertake, which repair ratio,.The value most ideally: by be on the verge of faulty disk data repair when Between be exactly equal to the data repair time that other healthy disks undertake.Active data reparation is repaired while being carried out by this two kinds, Obviously, the time of active data reparation is equal to the time of used time the greater.
Remaining life cycle threshold value z is set.After active data reparation, hard disk residue life cycle is smaller more ideal, The most ideal situation is that repairing the terminal that disk when finishing moves towards life just.But the accuracy of hard disk failure prediction is difficult to Very, appropriate to adjust x or y if remaining life cycle is greater than threshold value z, or x and y are adjusted simultaneously.
Remarks: the model training of step 3 and the prediction of step 4, either same set of system, is also possible to respective independence System.Preferably, the system of model training should have the hard disk more than 1000 pieces.
Step 4: the hard disk for participating in actively repairing is differentiated, including repairing disk, participating in the hard disk groups repaired jointly, destination disk.
Step 4.1: generating fault degree list.
" newest " the acquisition information of hard disks all in step 1 is subjected to feature extraction, using the feature of extraction as input, warp The model of step 3 is crossed, obtains the fault degree of all hard disks of system one by one, ultimately generates " fault degree list ".
Step 4.2: determining that the highest faulty disk that is on the verge of of fault degree is discs to be repaired.
If the fault degree of 0 hard disk is higher than fault threshold x, step 4.1 is returned to.
If the fault degree of 1 hard disk is higher than fault threshold x, which is actively repaired, referred to as A disk.
If there is the fault degree of multiple hard disks to be higher than fault threshold x, handle one by one, first by the highest disk of fault degree Starting is actively repaired, referred to as A disk.
Step 4.3: determining the hard disk groups for participating in repairing jointly.
System metadata is accessed, all p data blocks (Strip) in A disk is traversed, reads its affiliated band (Stripe) Hard disk serial number, these hard disks participate in it is common repair, referred to as B disk group.
Depending on the quantity of hard disk is with correcting and eleting codes algorithm in B disk group.Such as the MSR code of typical 4+2, in B disk group totally 5 pieces it is hard Disk.The RS code of for example typical 3+1 again, totally 3 pieces of hard disks in B disk group.
Step 4.4: the selected highest hard disk of fault degree is as reparation destination disk.
If there is the fault degree of 1 hard disk minimum, referred to as C disk.
If there is the fault degree of multiple hard disks minimum side by side, need to choose first, referred to as C disk.Preferably, fault degree Residual capacity the maximum is chosen in minimum multiple hard disks side by side.Other choosing methods include but is not limited to: randomly selecting, number Reckling, number the maximum, residual capacity reckling.
Step 5: carrying out active data reparation.
Step 5.1: determining the data block set by A disk and B disk group responsible for rehabilitation respectively.
Assuming that shared p data block is to be repaired, for set Q.
The reparation of p*y data block is undertaken, by A disk for set T.
The reparation of p-p*y data block is undertaken, by B disk group for set R.
Set T is selected from set Q, selected mode includes but is not limited to: positive sequence is chosen, backward is chosen, jump is chosen, It randomly selects.
Set T is subtracted from set Q obtains set R.
Then, following step 5.2 carries out simultaneously with step 5.3.
Step 5.2: by the method for copy, by data block all in set T by A disk reparation to C disk.
Repair process carries out one by one, makes label to the data block finished is repaired in set T.
Step 5.3: the method calculated by encoding and decoding, by data blocks all in set R by B disk group reparation to C disk.
Repair process carries out one by one, makes label to the data block finished is repaired in set R.
The algorithm that encoding and decoding calculate is subject to the practical correcting and eleting codes algorithm in this system.
Step 5.4: waiting end to be repaired.
It, should if A disk, in repair process, low-quality disk occurs for other hard disks in addition to A disk, then system enters degrading state Pause is actively repaired, using the method for traditional Passive fault-tolerant control, until degrading state terminates.
If A disk, in repair process, low-quality disk occurs for A disk, then 6 are entered step.
If low-quality disk situation does not occur, actively repairs and smoothly complete, then enter step 7.
The processing that step 6:A disk breaks down in repair process.
Step 6.1: the data block set S repaired is not completed in statistics set T.
Step 6.2: the data block of set S is transferred into B disk group reparation.
Step 6.3: adjusting parameter, including but not limited to: lowering fault threshold x, lower and repair proportion threshold value y.
Step 6.3: waiting end to be repaired.Enter step 8.
Step 7: for the processing of A disk residue life cycle.
The progress of A disk continuously, cover type is written, write-in content includes but is not limited to: random value, full 0, complete 1.Until Until the disk breaks down or the write time reach residue life cycle threshold value z when until.Record write time h, it is clear that h ∈ (0,z]。
If h=z, regarded as judging by accident, erroneous judgement information is fed back into training pattern;Training pattern is according to feedback to hard The sample labeling of disk A is adjusted.If do not fed back, hard disk A can be determined as low-quality disk by system, and A disk is in nearest 10 days samples It is positive sample (being generally denoted as 1), and the verifying Jing Guo step 7, discovery A is not low-quality disk, is misjudged, so to feed back It is negative sample (being generally denoted as 0) by the sample labeling of A to training pattern.
It is appropriate to raise fault threshold x, and (or) appropriate up-regulation reparation proportion threshold value y if h > z/4.
If h < z/4, parameter is remained unchanged.
1/4 is empirical value, can be adjusted on demand.
Step 8: repairing the processing of completion.
Step 8.1: the metadata information for being related to A disk is uniformly revised as C disk by modification system metadata.
Step 8.2: will newly acquire information update training pattern for a period of time.
Step 8.3: if continuous 3 shakes (- turning down-being turned up to be turned up ,-be turned up-and turn down or, turning down) of fault threshold x, then Fault threshold x is no longer adjusted.
Beneficial effects of the present invention:
Using the method for the invention, compared with prior art, can fast and accurate active reparation, effectively avoid be System enters degrading state, utilizes the remaining life cycle for being on the verge of faulty hard disk to greatest extent.
Detailed description of the invention
Fig. 1 is flow chart of the invention.
Fig. 2 is module relation diagram of the invention.
Fig. 3 is implementation Use Case Map of the invention.
Specific embodiment
The implementation of 3 pairs of technical solutions is described in further detail with reference to the accompanying drawing.
(1) training system.
Overview: it is carried out at a large-scale data center, which possesses 10182 pieces of hard disks.Acquire all hard disks SMART information and system CPU usage, every 1 hour progress one acquisition continues 3 months, occurs 452 pieces of faulty disks altogether.
Characteristic value selection: #1, #3, #5, #7, #9, #187, #189, #194, #195, #197.Remaining feature abandons.
Label: each sample standard deviation of 9730 pieces of healthy disks is labeled as 0.By 452 pieces of faulty disks 10 days before failure Sample is respectively labeled as: 1,0.95,0.9,0.85,0.8,0.75,0.7,0.65,0.6,0.55.Remaining sample abandons.
Training: above-mentioned sample is trained using artificial neural network algorithm, finally obtains fault prediction model.
(2) forecasting system.
Overview: it is carried out in a small-sized storage server, which has 7 pieces of hard disks, wherein four pieces of hard disks A, B, C, D are done Raid5.
Sampling: the SMART information of acquisition A, B, C, D, E, F, G and the CPU usage of system are once adopted for every 6 hours Collection then starts to predict.
Characteristic value selection: #1, #3, #5, #7, #9, #187, #189, #194, #195, #197.Remaining feature abandons.Form 7 Bar sample.
Prediction: by above-mentioned 7 samples input model one by one, be calculated fault degree be respectively as follows: 0.31,0.95,0.43, 0.42、0.72、0.01、0.04。
It repairs: being more than fault threshold 0.9 in view of B disk, therefore B disk is marked as being on the verge of faulty disk.At the same time, same Other three pieces of disks (A, C, D) in RAID participate in common repair.It is minimum in view of the fault degree of F disk, therefore F disk is to repair target Disk.The data that 2560 sectors are shared in B disk are to be repaired to F disk, wherein No. 1 to No. 1280 sector data is responsible for by B disk It is copied to F disk, No. 1281 to No. 2560 sector data is responsible for encoding and decoding calculating by tri- pieces of disks of A, C, D and is generated in F disk.It repairs The metadata for changing B disk involved in system is F disk.
Finally, the completion of 103 seconds B disk used times copies, and A, C, D 59 seconds tri- disk used times completion encoding and decoding calculate.Therefore, reduction is repaired Compound proportion threshold value is to 40%.Continuously, cover type full 0 is written to the progress of B disk, until B disk damages nothing completely after 903 seconds Method write-in.Due to " 1000 seconds > 903 seconds > 1000 seconds/4 ", then fault threshold is raised to 0.92.
Although disclosing specific embodiments of the present invention and attached drawing for the purpose of illustration, its object is to help to understand the present invention Content and implement accordingly, but it will be appreciated by those skilled in the art that: do not departing from the present invention and the attached claims Spirit and scope in, various substitutions, changes and modifications are all possible.The present invention should not be limited to this specification and most preferably implement Example and attached drawing disclosure of that, the scope of protection of present invention is subject to the scope defined in the claims.

Claims (10)

1. a kind of active data restorative procedure towards correcting and eleting codes, step include:
1) in a set period of time, taken at regular intervals respectively sets the information of hard disk;
2) feature extraction is carried out to the information at each hard disk each time point of acquisition, generates a sample data;
3) it is marked according to sample data of the virtual condition of hard disk to corresponding hard disk, obtains a training dataset;The mark Note includes healthy disk, low-quality disk and is on the verge of faulty disk;Wherein, the corresponding mark value a of healthy disk, low-quality disk correspond to a mark value b, are on the verge of Faulty disk corresponds to multiple mark values between a~b, if sample data acquisition time when disk state is low-quality disk is t, away from The sample data of the hard disk in the set period of time before time t labeled as be on the verge of faulty disk and with a distance from time t get over The mark value of the close hard disk sample data is closer to low-quality disk mark value;Sample data when disk state is low-quality disk is labeled as bad The corresponding mark value b of disk, remaining sample data are labeled as the corresponding mark value a of healthy disk;
4) setting fault threshold x repairs ratio y with the data that faulty disk undertakes are on the verge of, and is obtained using training dataset training One hard disk failure degree identification model;
5) it acquires the information of each hard disk in correcting and eleting codes system and carries out feature extraction, be then enter into the hard disk failure journey Identification model is spent, the corresponding fault degree of each hard disk is obtained;
6) fault degree of a hard disk A is higher than fault threshold x if it exists, then starts to hard disk A and actively repair, that is, execute Step 7), 8);The minimum hard disk of a fault degree is chosen, disk C is denoted as;
7) all data blocks in hard disk A are traversed, the hard disk serial number of its affiliated band is read, these hard disk serial numbers are corresponding The hard disk that hard disk is repaired jointly as participation, referred to as B disk group;The quantity of hard disk is deleted by entangling of using of correcting and eleting codes system in B disk group Code algorithm determines;
8) data block all in set T is repaired by hard disk A to the disk C, it will set R by the method that encoding and decoding calculate In all data blocks by B disk group reparation to the disk C;Wherein, shared p data block is to be repaired in hard disk A, and hard disk A is held The reparation for carrying on a shoulder pole p*y data block, is denoted as set T, B disk group undertakes the reparation of p-p*y data block, is denoted as set R.
2. the method as described in claim 1, which is characterized in that after the completion of hard disk A is actively repaired, connect to hard disk A Continuous ground is written to cover type, until when hard disk A breaks down or the write time reaches residue life cycle threshold value z, record Write time h;If h=z, it is judged to identifying mistake to hard disk A fault degree, feeds back to hard disk failure degree identification model; If h > z/n, the fault threshold x and/or up-regulation raised in hard disk failure degree identification model repairs proportion threshold value y;If h < z/ N, then keep the parameter constant of hard disk failure degree identification model, and n is the integer greater than 1.
3. the method as described in claim 1, which is characterized in that the information of the hard disk includes hard disk SMART information;The spy Sign includes one or more combinations of following message: the fan that original reading error rate, the average time of disc starting, hard disk remap Area's number, magnetic head tracking error rate, hourage, the error count of hardware unrepairable, the magnetic head separation disc piece of hard disk power-up are excessively high Error count, the unstable sector number for causing the number for writing failure, hard drive internal temperature, hardware ECC to repair.
4. method as claimed in claim 3, which is characterized in that the form of the hard disk SMART information includes but is not limited to: when It should before difference, current Normal value and the time L of the value before preceding Raw value, current Normal value, current Raw value and time L The difference of value;Wherein, time L is the integral multiple of time N, and N is the time interval of information collection.
5. the method as described in claim 1, which is characterized in that if hard disk A in active repair process, the correcting and eleting codes system Low-quality disk occurs for other hard disks in system, then correcting and eleting codes system enters degrading state, suspends to the active reparation of hard disk A, using quilt The method of visibly moved mistake, until degrading state terminates;If hard disk A, in active repair process, low-quality disk occurs for hard disk A, then The data block set S repaired is not completed in statistics set T, the data block of set S is then transferred into B disk group reparation, lowers failure Threshold value x, it lowers and repairs proportion threshold value y.
6. the method as described in claim 1, which is characterized in that after the completion of hard disk A is actively repaired, will be related to hard disk A's Metadata information is uniformly revised as the disk C;In step 6), if there is the fault degree of multiple hard disks to be higher than fault threshold x, root It is handled one by one according to fault degree, the highest disk of fault degree is started as hard disk A preferentially and is actively repaired.
7. a kind of active data repair system towards correcting and eleting codes, which is characterized in that including information acquisition module, data processing mould Block, hard disk failure Degree Model training module, active repair module;Wherein,
Information acquisition module, the information for each hard disk of taken at regular intervals;
Data processing module, the information for each hard disk each time point to acquisition carry out feature extraction, generate a sample Data;And the sample data of corresponding hard disk is marked according to the virtual condition of hard disk, obtain a training dataset;It is described Label includes healthy disk, low-quality disk and is on the verge of faulty disk;Wherein, the corresponding mark value a of healthy disk, low-quality disk correspond to a mark value b, are on the point of Face faulty disk and corresponds to multiple mark values between a~b, if sample data acquisition time when disk state is low-quality disk is t, The sample data of the hard disk in the set period of time before time t is labeled as being on the verge of faulty disk and apart from time t The mark value of the closer hard disk sample data is closer to low-quality disk mark value;Sample data when disk state is low-quality disk is labeled as The corresponding mark value b of low-quality disk, remaining sample data are labeled as the corresponding mark value a of healthy disk;
Hard disk failure Degree Model training module, for according to the fault threshold x of setting and being on the verge of the data that faulty disk undertakes and repairing Compound proportion y obtains a hard disk failure degree identification model using training dataset training;
Active repair module, for corresponding using each hard disk in hard disk failure degree identification model identification correcting and eleting codes system Fault degree, the fault degree of a hard disk A is higher than fault threshold x if it exists, then starts to hard disk A and actively repair, choosing The hard disk for taking a fault degree minimum, is denoted as disk C;Wherein active restorative procedure are as follows: traverse all data in hard disk A Block reads the hard disk serial number of its affiliated band, and the hard disk that the corresponding hard disk of these hard disk serial numbers is repaired jointly as participation claims For B disk group;Data block all in set T is repaired by hard disk A to the disk C, will be collected by the method that encoding and decoding calculate All data blocks are by B disk group reparation to the disk C in conjunction R;Wherein, it is to be repaired that p data block is shared in hard disk A, hard disk A The reparation for undertaking p*y data block, is denoted as set T, and B disk group undertakes the reparation of p-p*y data block, is denoted as set R;B disk group The correcting and eleting codes algorithm that the quantity of middle hard disk is used by correcting and eleting codes system determines.
8. system as claimed in claim 7, which is characterized in that the active repair module is actively repaired hard disk A and completed Afterwards, hard disk A progress continuously, cover type is written, until hard disk A breaks down or the write time reaches remaining Life Cycle Until when phase threshold value z, write time h is recorded;If h=z, it is judged to identifying mistake to hard disk A fault degree, feeds back to hard disk Fault degree identification model;If h > z/n, the fault threshold x and/or up-regulation raised in hard disk failure degree identification model is repaired Proportion threshold value y;If h < z/n, the parameter constant of hard disk failure degree identification model is kept, n is the integer greater than 1.
9. system as claimed in claim 7, which is characterized in that the information of the hard disk includes hard disk SMART information;The spy Sign includes one or more combinations of following message: the fan that original reading error rate, the average time of disc starting, hard disk remap Area's number, magnetic head instruction road error rate, hourage, the error count of hardware unrepairable, the magnetic head separation disc piece of hard disk power-up are excessively high Error count, the unstable sector number for causing the number for writing failure, hard drive internal temperature, hardware ECC to repair;The hard disk The form of SMART information includes but is not limited to: the value before current Raw value, current Normal value, current Raw value and time L The difference of the value before difference, current Normal value and time L;Wherein, time L is the integral multiple of time N, and N is information collection Time interval.
10. system as claimed in claim 7, which is characterized in that it further include a system metadata module, it is hard for this will to be related to The metadata information of disk A is uniformly revised as the disk C;If thering is the fault degree of multiple hard disks to be higher than fault threshold x, according to event Barrier degree is handled one by one, is preferentially started the highest disk of fault degree as hard disk A and is actively repaired.
CN201811300732.7A 2018-11-02 2018-11-02 A kind of active data restorative procedure and system towards correcting and eleting codes Active CN109658975B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811300732.7A CN109658975B (en) 2018-11-02 2018-11-02 A kind of active data restorative procedure and system towards correcting and eleting codes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811300732.7A CN109658975B (en) 2018-11-02 2018-11-02 A kind of active data restorative procedure and system towards correcting and eleting codes

Publications (2)

Publication Number Publication Date
CN109658975A true CN109658975A (en) 2019-04-19
CN109658975B CN109658975B (en) 2019-12-03

Family

ID=66110596

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811300732.7A Active CN109658975B (en) 2018-11-02 2018-11-02 A kind of active data restorative procedure and system towards correcting and eleting codes

Country Status (1)

Country Link
CN (1) CN109658975B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750213A (en) * 2019-09-09 2020-02-04 华为技术有限公司 Hard disk management method and device
CN111475329A (en) * 2020-02-25 2020-07-31 成都信息工程大学 Method and device for reducing predictive erasure code repair under big data application platform
CN111539870A (en) * 2020-02-25 2020-08-14 成都信息工程大学 New media image tampering recovery method and device based on erasure codes
CN111767162B (en) * 2020-05-20 2021-02-26 北京大学 Fault prediction method for hard disks of different models and electronic device
WO2022028209A1 (en) * 2020-08-05 2022-02-10 华为技术有限公司 Memory failure processing method and apparatus
CN116028276A (en) * 2023-02-27 2023-04-28 深圳市泛联信息科技有限公司 Delay data reconstruction method, delay data reconstruction device, storage node and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004063040A (en) * 2002-07-31 2004-02-26 Toshiba Corp Fault forecast system, disk device provided with fault forecasting function, fault forecasting method, and fault forecasting program
US20080008015A1 (en) * 2006-06-23 2008-01-10 Karen Darbinyan Architecture, System and Method for Compressing Repair Data in an Integrated Circuit (IC) Design
CN102043685A (en) * 2010-12-31 2011-05-04 成都市华为赛门铁克科技有限公司 RAID (redundant array of independent disk) system and data recovery method thereof
CN103197995A (en) * 2012-01-04 2013-07-10 百度在线网络技术(北京)有限公司 Hard disk fault detection method and device
CN103699457A (en) * 2013-09-26 2014-04-02 深圳市泽云科技有限公司 Method and device for restoring disk arrays based on stripping
CN107391301A (en) * 2017-08-16 2017-11-24 北京奇虎科技有限公司 Data managing method, device, computing device and the storage medium of storage system
CN108205424A (en) * 2017-12-29 2018-06-26 北京奇虎科技有限公司 Data migration method, device and electronic equipment based on disk

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004063040A (en) * 2002-07-31 2004-02-26 Toshiba Corp Fault forecast system, disk device provided with fault forecasting function, fault forecasting method, and fault forecasting program
US20080008015A1 (en) * 2006-06-23 2008-01-10 Karen Darbinyan Architecture, System and Method for Compressing Repair Data in an Integrated Circuit (IC) Design
CN102043685A (en) * 2010-12-31 2011-05-04 成都市华为赛门铁克科技有限公司 RAID (redundant array of independent disk) system and data recovery method thereof
CN103197995A (en) * 2012-01-04 2013-07-10 百度在线网络技术(北京)有限公司 Hard disk fault detection method and device
CN103699457A (en) * 2013-09-26 2014-04-02 深圳市泽云科技有限公司 Method and device for restoring disk arrays based on stripping
CN107391301A (en) * 2017-08-16 2017-11-24 北京奇虎科技有限公司 Data managing method, device, computing device and the storage medium of storage system
CN108205424A (en) * 2017-12-29 2018-06-26 北京奇虎科技有限公司 Data migration method, device and electronic equipment based on disk

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
朱炳鹏: "大规模存储系统硬盘故障预测方法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
胡维: "基于智能预警和自修复的高可靠磁盘阵列关键技术研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750213A (en) * 2019-09-09 2020-02-04 华为技术有限公司 Hard disk management method and device
CN111475329A (en) * 2020-02-25 2020-07-31 成都信息工程大学 Method and device for reducing predictive erasure code repair under big data application platform
CN111539870A (en) * 2020-02-25 2020-08-14 成都信息工程大学 New media image tampering recovery method and device based on erasure codes
CN111539870B (en) * 2020-02-25 2023-07-14 成都信息工程大学 Tamper recovery method and device for new media image based on erasure codes
CN111475329B (en) * 2020-02-25 2023-07-18 成都信息工程大学 Method and device for reducing predictive erasure code repair under big data application platform
CN111767162B (en) * 2020-05-20 2021-02-26 北京大学 Fault prediction method for hard disks of different models and electronic device
WO2022028209A1 (en) * 2020-08-05 2022-02-10 华为技术有限公司 Memory failure processing method and apparatus
CN116028276A (en) * 2023-02-27 2023-04-28 深圳市泛联信息科技有限公司 Delay data reconstruction method, delay data reconstruction device, storage node and storage medium
CN116028276B (en) * 2023-02-27 2023-06-09 深圳市泛联信息科技有限公司 Delay data reconstruction method, delay data reconstruction device, storage node and storage medium

Also Published As

Publication number Publication date
CN109658975B (en) 2019-12-03

Similar Documents

Publication Publication Date Title
CN109658975B (en) A kind of active data restorative procedure and system towards correcting and eleting codes
WO2020114313A1 (en) Method and apparatus for predicting hard disk fault occurrence time, and storage medium
CN110413227B (en) Method and system for predicting remaining service life of hard disk device on line
CN101246727B (en) Optical storage medium recording method and optical storage device
JP2005122338A (en) Disk array device having spare disk drive, and data sparing method
CN1466760A (en) Critical event log for a disc drive
CN106407050A (en) Data storage method
CN104407821B (en) A kind of method and device for realizing RAID reconstruction
CN103929609B (en) A kind of video recording playback method and device
CN105468479B (en) A kind of disk array RAID bad block processing methods and device
CN104461791B (en) Information processing method and device
CN102385537A (en) Disk failure processing method of multi-copy storage system
CN111767162B (en) Fault prediction method for hard disks of different models and electronic device
CN1099113C (en) Process and arrangement for writing binary data onto glass masters
US7809978B2 (en) Storage device and control device
CN106527972A (en) K1-based multi-path method for realizing dual activity of storages
CN104572374B (en) Processing method, device and the storage device of storage
Yang et al. Zte-predictor: Disk failure prediction system based on lstm
CN104205097B (en) A kind of De-weight method device and system
US9971645B2 (en) Auto-recovery of media cache master table data
CN105183590A (en) Disk array fault tolerance processing method
CN105302677A (en) Information-processing device and method
CN106528342A (en) Disk array fault tolerance apparatus with cloud server backup function
CN111027803A (en) Construction management method and construction management system
CN101490761B (en) Information recording medium to which extra ECC is applied, and method and apparatus for managing the information recording medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant