WO2014094572A1 - 一种Raid5阵列读IO失败的修复方法和装置 - Google Patents
一种Raid5阵列读IO失败的修复方法和装置 Download PDFInfo
- Publication number
- WO2014094572A1 WO2014094572A1 PCT/CN2013/089373 CN2013089373W WO2014094572A1 WO 2014094572 A1 WO2014094572 A1 WO 2014094572A1 CN 2013089373 W CN2013089373 W CN 2013089373W WO 2014094572 A1 WO2014094572 A1 WO 2014094572A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- read
- repair
- failed
- failure
- write
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000008439 repair process Effects 0.000 claims description 111
- 238000003491 array Methods 0.000 claims 1
- 230000007246 mechanism Effects 0.000 description 7
- 238000012544 monitoring process Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 238000011084 recovery Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008263 repair mechanism Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/08—Error detection or correction by redundancy in data representation, e.g. by using checking codes
- G06F11/10—Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
- G06F11/1076—Parity data used in redundant arrays of independent storages, e.g. in RAID systems
- G06F11/1084—Degraded mode, e.g. caused by single or multiple storage removals or disk failures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2211/00—Indexing scheme relating to details of data-processing equipment not covered by groups G06F3/00 - G06F13/00
- G06F2211/10—Indexing scheme relating to G06F11/10
- G06F2211/1002—Indexing scheme relating to G06F11/1076
- G06F2211/1059—Parity-single bit-RAID5, i.e. RAID 5 implementations
Definitions
- the present invention relates to memory array technology, and more particularly to a method and apparatus for repairing a Raid 5 array read 10 failure. Background technique
- the traditional RAID5 array in the degraded mode (single disk damage of the array), because the strip no longer has the ability of data redundancy protection, the fault tolerance mechanism of the disk read error is weak, so the following problems occur: If during the array reconstruction process If the rebuild process has a read error on a disk or a read error on the disk that is externally obtained, the disk will be kicked out of the array, the rebuild will be aborted, and the array will be unavailable. If the system is in a degraded state because of the system In the absence of a hot spare disk and other objective reasons, it has not yet entered the rebuild state. At this time, the external service has a read error on a certain disk, which will also cause the disk to be kicked out of the array and the array is unavailable.
- the present invention provides the following technical solutions:
- a method for reading a 10 failure of a Raid5 array the method being applied to a degraded Raid5 array, the method comprising: A. placing a failed read 10 in a failed repair thread queue; B, failing in the failed repair thread queue Reading 10, constructing a write 10 of invalid data and performing the write 10 operation, the start position and size of the write 10 are the same as the failed read 10; C, the write 10 After the success, the invalid data is written in the data cache of the failed read 10, and the read 10 is returned successfully.
- the method further includes: Step D1: determining whether the failed read 10 is an index area, and if so, returning to read 10 fails, and if not, executing step B.
- step C further comprises: if the write 10 fails, returning the read 10 failure.
- the method further includes: step D2, setting a flag to the failed read 10 to indicate that the read 10 has passed the failure repair process; and before step A, further comprising: step D3: determining the failed read 10 Is there a flag that has been repaired by the failure repair process? If not, go to step A; if yes, enter the normal processing flow of the Raid 5 array.
- the step D3 further comprises determining whether the failure of the read 10 is caused by sector corruption if the failed read 10 is not set to pass the failure repair processing flag, and if yes, performing step A; if not, entering Raid 5 array normal processing flow.
- the present invention also provides a Raid5 array read 10 failed repair device for use in a degraded Raid5 array, the device comprising: a repair preparation module and a repair execution module.
- a repair preparation module for placing a failed read 10 in a failed repair thread queue
- a repair execution module configured to, for the failed read 10 in the failed repair thread queue, construct a write 10 of invalid data and perform the write 10 operation, the start position and size of the write 10 and the failed read 10 The same; after the write 10 is successful, the invalid data is written in the data cache of the failed read 10, and the read 10 is returned successfully.
- the repair preparation module is further configured to judge the failed read 10 in the failed repair thread queue to determine whether it reads the index area, and if so, does not perform the repair process, and if not, the repair process is performed by the repair execution module.
- the repair execution module returns to the read 10 failure when the write 10 fails.
- the repair preparation module is also used to set the flag for the failed read 10 in the failed repair thread queue. Zhi, used to indicate that the reading 10 has failed after the processing;
- the repair preparation module determines whether the failed read 10 is set with a flag of failed repair processing before placing the failed read 10 in the failed repair thread queue, and if not, placing the failed read 10 in the failed repair thread queue. If there is, enter the normal processing of the Raid 5 array.
- the repair preparation module is further configured to determine whether the read 10 failure is caused by a sector corruption if the failed read 10 is not set to pass the failure repair processing flag, and if so, the failed read 10 is placed in a failure repair Thread queue; if no, enter the normal processing flow of the Raid 5 array.
- the present invention implements an instant repair mechanism for reading errors of the recorded data area of the degraded mode RAID5 array.
- Figure 3 is a third flow chart of an embodiment of the present invention.
- Figure 4 is a logical block diagram of the apparatus of the present invention.
- DETAILED DESCRIPTION OF THE INVENTION A careful study of the characteristics of the monitoring service can be found that for a monitored storage service, if a small number of bad sectors appear on the disk, it is acceptable to lose some of the old video recording data, because the video monitoring data is massive. However, there are a lot of video information that is actually useless. For example, the image collected by a camera may not change for a few hours, or there are few changes. If a small number of bad sectors appear on the disk but the disk can still be used normally, an immediate and effective error handling mechanism is needed to avoid the problem that the disk is kicked out of the array, the rebuild is aborted, the array is unavailable, and so on. The recorded data can be stored in the array normally.
- the present invention proposes a read 10 failure recovery method for downgrading a Raid5 array, the method comprising the steps of: Step A: placing the failed read 10 in the error repair thread queue;
- Step B For the failed read 10 in the error repair thread queue, construct a write 10 with invalid data and execute the write 10, the start logical address and length of the write 10 are the same as the failed read 10;
- Step C Write the invalid data in the data cache of the failed read 10, and return the information that the read 10 is successful.
- the Raid array in the degraded state when a read 10 failure occurs, will not immediately feed back the read failure information, but construct a new write 10 command for the logical address pointed to by the 10, the new write 10 command uses the disk equipped bad
- the sector redistribution mechanism/disk bad block remapping mechanism writes invalid data to the logical address and the logical address corresponding to the above read 10, but the physical space is inconsistent with the physical space corresponding to the above read 10. And the invalid data constructed is written into the read buffer as the actually read data.
- the read operation 10 is successful, although the read data itself is inconsistent with the real data (the damage of the sector causes the real data to be lost).
- the successful resolution of the read 10 operation makes the disk where the bad sector is located not kicked out of the array, and the array is not unavailable. If it is reconstructed, the reconstruction will not be aborted.
- the RAID5 array in the degraded mode has a read failure of 10, and the read failure 10 is queued to the 10 queues of the error repair thread.
- the error repair thread repairs the failure 10 of the process.
- Figure 3 shows the response flow for repairing write 10.
- FIG. 1 through 3 shows the repair process for a read failure 10.
- a read 10 failure occurs, the failure to immediately feed back the read 10 failure causes the disk to be kicked out, the rebuild is aborted, and the repair process is performed for the failed read 10.
- sector corruption of the disk is a cause of a read failure.
- the present invention mainly performs repair processing for the failure of the read 10 caused by the cause.
- the read 10 fails due to a bad sector, the error code of the sector corruption is fed back, and it can be judged that the read 10 fails due to the sector corruption. In this case, the read failure 10 can be fixed.
- the read failure 10 is first put into the failure repair thread queue, and the repair processing for the read 10 is awakened. Because there is a case that the repair is unsuccessful during the repair process, the read 10 failure is still returned when the repair is unsuccessful, but the failed read 10 has been repaired, so there is no need to repair it again, otherwise it will die. Looped. Therefore, before the read failure 10 is put into the failure repair thread queue, it is necessary to judge whether the read 10 failure has undergone the failure repair process, and if it has not been placed in the failure repair thread queue, if the failure has been repaired, then It can only be processed according to the processing flow of the existing Raid array, such as kicking, aborting, and so on.
- the main reason for judging whether the index area or the data area is read is that: Damage to the index area will result in all monitoring videos of single or multiple disks being unusable. If you use a fix that will write invalid data to the index area, it will still cause the above problem. Since the disk has a bad sector redistribution mechanism, some storage devices also support their own disk bad block remapping mechanism. Therefore, when a write 10 to a logical address is performed, when the current corresponding sector of the logical address is damaged, The new sector is automatically assigned to correspond to the logical address, and the write 10 will actually perform a write operation on the newly allocated sector. The written content is invalid data, and can be all 0 data or other data.
- the write 10 of the above construction is generally successful at the time of execution, but does not exclude some other situations that lead to failure. If the write is not successful, the repair for the above read 10 fails, and the return to read 10 fails. If the write is successful, it indicates that the repair for the above read 10 is successful, and the read back to 10 is successful. Bad sectors are automatically isolated from the storage service and are immediately isolated (subsequent reads or writes to the original bad sector become read and write operations to the newly allocated sector). After the repair is successful, the constructed invalid data, such as all 0s, needs to be written in the data cache of the read 10.
- the purpose of performing the above write 10 is not to actually write the data, but to isolate the bad sectors, so that the original failed read 10 can succeed without causing a kick-off, etc., to achieve the so-called repair purpose.
- the data in the damaged sector has indeed been lost, the loss of a small amount of stored data sometimes does not affect the actual business, such as video storage services.
- the stripe of invalid data write 10 is executed, and the consistency of the strip checksum with the data needs to be updated by a follow-up mechanism: if the read 10 is caused by an external service, and the strip for the read 10 has been reconstructed , you need to recalculate the stripe checksum and write the new checksum to disk. If the read 10 is caused by an external service, and the strip for which the read 10 is directed has not yet started reconstruction, then The consistency of the strip checksum and the data will naturally be restored during the subsequent reconstruction process. If the read 10 is caused by the reconstruction, the invalid data and other stripe data are directly used to calculate the checksum to be written to the disk to ensure the check and the consistency with the data.
- the present invention also provides a repair device for a Raid5 array read 10 failure.
- the device is applied to a degraded Raid5 array, and the device includes: a repair preparation module and a repair execution module.
- the Raid5 array read 10 failure repairing device is implemented by a computer program, and the repair preparation module and the repair execution module are stored in the memory, and the CPU is instructed to perform processing.
- a repair execution module configured to, for the failed read 10 in the failed repair thread queue, construct a write 10 of invalid data and perform the write 10 operation, the start position and size of the write 10 and the failed read 10 The same; after the write 10 is successful, the invalid data is written in the data cache of the failed read 10, and the read 10 is returned successfully.
- the repair preparation module is further configured to judge the failed read 10 in the failed repair thread queue to determine whether it reads the index area, and if so, does not perform the repair process, and if not, the repair process is performed by the repair execution module.
- the repair execution module returns to the read 10 failure when the write 10 fails.
- the repair preparation module is further configured to set a flag for the failed read 10 in the failed repair thread queue to indicate that the read 10 has failed after processing;
- the repair preparation module determines whether the failed read 10 is set with a flag of failed repair processing before placing the failed read 10 in the failed repair thread queue, and if not, placing the failed read 10 in the failed repair thread queue. If there is, enter the normal processing of the Raid 5 array.
- the repair preparation module is further configured to determine whether the read 10 failure is caused by a sector corruption if the failed read 10 is not set to pass the failure repair processing flag, and if so, the failed read 10 is placed in a failure repair Thread queue; if no, enter the normal processing flow of the Raid 5 array.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Debugging And Monitoring (AREA)
- Techniques For Improving Reliability Of Storages (AREA)
- Memory System Of A Hierarchy Structure (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210550368.6 | 2012-12-17 | ||
CN2012105503686A CN102981921A (zh) | 2012-12-17 | 2012-12-17 | 一种Raid5阵列读IO失败的修复方法和装置 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2014094572A1 true WO2014094572A1 (zh) | 2014-06-26 |
Family
ID=47855977
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2013/089373 WO2014094572A1 (zh) | 2012-12-17 | 2013-12-13 | 一种Raid5阵列读IO失败的修复方法和装置 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN102981921A (zh) |
WO (1) | WO2014094572A1 (zh) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102981921A (zh) * | 2012-12-17 | 2013-03-20 | 浙江宇视科技有限公司 | 一种Raid5阵列读IO失败的修复方法和装置 |
CN103678048B (zh) * | 2013-11-29 | 2015-11-25 | 华为技术有限公司 | 独立磁盘冗余阵列修复方法、装置和存储设备 |
CN109840163B (zh) * | 2018-12-27 | 2022-05-24 | 西安紫光国芯半导体有限公司 | 一种Nand-Flash错误数据冗余替换方法 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1866226A (zh) * | 2005-05-17 | 2006-11-22 | 英业达股份有限公司 | 数据保护方法 |
CN101887351A (zh) * | 2010-06-22 | 2010-11-17 | 杭州华三通信技术有限公司 | 一种磁盘阵列容错方法及其系统 |
CN102184129A (zh) * | 2011-04-27 | 2011-09-14 | 杭州华三通信技术有限公司 | 磁盘阵列的容错方法和装置 |
CN102981921A (zh) * | 2012-12-17 | 2013-03-20 | 浙江宇视科技有限公司 | 一种Raid5阵列读IO失败的修复方法和装置 |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1253791C (zh) * | 2002-11-22 | 2006-04-26 | 华为技术有限公司 | 5级独立冗余磁盘阵列中多盘失败情况下的读写操作方法 |
CN100495313C (zh) * | 2007-10-19 | 2009-06-03 | 杭州华三通信技术有限公司 | 实现磁盘冗余阵列重建的方法和磁盘冗余阵列 |
CN102637141A (zh) * | 2011-02-14 | 2012-08-15 | 鸿富锦精密工业(深圳)有限公司 | Raid自动化测试系统及方法 |
-
2012
- 2012-12-17 CN CN2012105503686A patent/CN102981921A/zh active Pending
-
2013
- 2013-12-13 WO PCT/CN2013/089373 patent/WO2014094572A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1866226A (zh) * | 2005-05-17 | 2006-11-22 | 英业达股份有限公司 | 数据保护方法 |
CN101887351A (zh) * | 2010-06-22 | 2010-11-17 | 杭州华三通信技术有限公司 | 一种磁盘阵列容错方法及其系统 |
CN102184129A (zh) * | 2011-04-27 | 2011-09-14 | 杭州华三通信技术有限公司 | 磁盘阵列的容错方法和装置 |
CN102981921A (zh) * | 2012-12-17 | 2013-03-20 | 浙江宇视科技有限公司 | 一种Raid5阵列读IO失败的修复方法和装置 |
Also Published As
Publication number | Publication date |
---|---|
CN102981921A (zh) | 2013-03-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10776267B2 (en) | Mirrored byte addressable storage | |
US8156392B2 (en) | Apparatus, system, and method for bad block remapping | |
US8589724B2 (en) | Rapid rebuild of a data set | |
US7809979B2 (en) | Storage control apparatus and method | |
JP5607725B2 (ja) | 固体ディスクを制御するための装置、方法、およびコンピュータ・プログラム | |
CN102184129B (zh) | 磁盘阵列的容错方法和装置 | |
US20130339784A1 (en) | Error recovery in redundant storage systems | |
JP4886209B2 (ja) | アレイコントローラ、当該アレイコントローラを含む情報処理装置及びディスクアレイ制御方法 | |
WO2013159503A1 (zh) | 一种硬盘数据恢复方法、装置及系统 | |
JP4792490B2 (ja) | 記憶制御装置及びraidグループの拡張方法 | |
TW201535382A (zh) | 動態隨機存取記憶體(dram)列備用技術 | |
US8074113B2 (en) | System and method for data protection against power failure during sector remapping | |
JP2006079418A (ja) | 記憶制御装置、制御方法及びプログラム | |
WO2024113685A1 (zh) | 一种raid阵列的数据恢复方法及相关装置 | |
US20070036055A1 (en) | Device, method and program for recovering from media error in disk array device | |
WO2014094572A1 (zh) | 一种Raid5阵列读IO失败的修复方法和装置 | |
US7308601B2 (en) | Program, method and apparatus for disk array control | |
JP5040331B2 (ja) | 記憶装置、記憶装置の制御方法、及び記憶装置の制御プログラム | |
JP4203034B2 (ja) | アレイコントローラ、メディアエラー修復方法及びプログラム | |
TW201329701A (zh) | 具有自動重映射功能的磁碟陣列及其自動重映射方法 | |
JP4143040B2 (ja) | ディスクアレイ制御装置、同装置に適用されるデータ欠損検出時の処理方法及びプログラム | |
JP4248164B2 (ja) | ディスクアレイのエラー回復方法、ディスクアレイ制御装置及びディスクアレイ装置 | |
US20140173337A1 (en) | Storage apparatus, control method, and control program | |
US20200286577A1 (en) | Storage area retirement in a storage device | |
JP6175771B2 (ja) | ディスクアレイ装置、バッドセクタ修復方法および修復プログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 13866156 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13866156 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 18.12.2015) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 13866156 Country of ref document: EP Kind code of ref document: A1 |