A kind of method for reconstructing of raid-array
Technical field
The application relates to computer memory technical field, and particularly raid-array (RedundantArray ofIndependent Disks, RAID) technology, particularly relates to a kind of method for reconstructing of raid-array.
Background technology
RAID be a kind of polylith independently disk differently to combine formation disk group (logic magnetic disc), thus the technology memory property higher than single disk being provided and providing data redundancy to protect.The principle of RAID technique, exactly data and corresponding parity information are stored on each disk of composition RAID system, and parity information and corresponding data is stored on different disks respectively.After a data in magnetic disk of RAID system is damaged, remaining data and corresponding parity information is utilized to go to recover impaired data.As basis and the critical component of network store system, RAID with it fast, the feature of magnanimity and high reliability and famous.After RAID technique occurs, very extensive in the application demand of the every field such as industry, military affairs, education, be also industrial hot spot to the research of RAID technique always.
The different modes of composition disk array is called RAID rank (RAID Levels).Such as common RAID rank has RAID0, RAID1, RAID5, RAID6 etc.Different RAID ranks provides different Data Protection Scheme.
For the RAID5 of 4 pieces of disk compositions, only allowed one piece of hard disk to break down, when one piece of disk breaks down, RAID5 does not just possess data redundancy defencive function, needs to change as early as possible so break down when coiling.When after replacing faulty hard disk, Magnetic Disk Controller can utilize the data on normal disk and parity checking to calculate, and on the new disk after the result write calculated being changed, this process is called the reconstruction of RAID.
The object of rebuilding is to allow RAID again have data redundancy defencive function.When there is the disk failure of RAID, disk array manufacturer generally uses HotSpare disk technology to realize the automatic Reconstruction of RAID.HotSpare disk technology, in simple terms, being exactly that for this RAID specifies one piece of disk as HotSpare disk, when certain block member's disk failures of RAID system, HotSpare disk can replace failed disk automatically when creating RAID system, triggering RAID and rebuilding.As its name suggests, " heat " standby dish, when replacing failed disk, does not need to interrupt the read-write business in RAID system, namely during RAID system reconstructing, still can carry out carrying out read-write operation to this RAID system.
In prior art, when input and output (IO) request on upper strata can not be coiled response by certain member of RAID system, generally all can think that this member dish lost efficacy, RAID system can start process of reconstruction automatically.The reconstruction operation expense of RAID system is large, the cycle is long, affects the performance of normal data IO, and general during rebuilding, if there is other disk failure, RAID system can directly be collapsed, and then makes RAID system very fragile, therefore should avoid starting reconstruction operation as far as possible.
Summary of the invention
This application provides the method for reconstructing of a kind of RAID, the probability carrying out RAID reconstruction can be reduced as far as possible.
The method for reconstructing of a kind of RAID that the embodiment of the present application provides, comprising:
The controller of A, RAID system finds that the first disk in this RAID system cannot respond I/O operation, closes separately the power supply of the first disk, and starts the timer of a scheduled duration;
B, during described timer timing, RAID system carries out normal read-write operation, and all bar reel numbers of write operation occurred record during this period;
C, described timer expiry, open the power supply of the first disk, powers on to the first disk;
After D, the first disk power on, the first disk is done and carries out readwrite tests operation;
E, judge whether the first disk is read and write normally, if so, performs F, otherwise perform step G;
F, according to the generation of the first disk turnoff time interocclusal record all bar reel numbers of write operation, recover data in the corresponding band of the first disk, be recovered rear process ends;
G, be low-quality disk by the first disk label, the second disk as HotSpare disk is replaced the first disk, calculate according to the data of other disks in RAID system and parity checking, the result of calculating is write in described second disk.
Preferably, described readwrite tests operation comprises:
D1, check the first disk whether online and by drive load in operating system, if online, the first disk is low-quality disk; Continue to perform step D2 if online;
D2, to this disk send " TEST UNIT READY " this scsi command chkdsk whether be ready to read and write; If cannot read and write, disk is low-quality disk; If can step D3 be performed;
D3, RAID metadata corresponding for the first disk recorded in operating system is write the position of these disk corresponding element data, if write failure, then judge that the first disk is low-quality disk, continued to perform step D4 if write as merit;
D4, do read operation to the first disk RAID metadata, if the merit of being read as, the first disk confirms as well to coil, and reading failure then judges that the first disk is low-quality disk.
As can be seen from the above technical solutions, when certain disk of RAID system cannot respond I/O operation, first lower electric treatment is carried out to it, during lower electricity, allows application layer normally to read and write RAID, and during this period in there are all bar reel numbers of write operation; Then upper electric treatment is carried out to this disk, tests it and whether can normally read and write, if so, according to the generation of record all bar reel numbers of write operation, start to recover data in the corresponding band of this disk; Otherwise, be low-quality disk by this disk label, and start conventional process of reconstruction.The disk of RAID system in most of the cases can be made in this way to recover normal and without the need to carrying out reconstruction operation.
Accompanying drawing explanation
The method for reconstructing process flow diagram of a kind of raid-array that Fig. 1 provides for the embodiment of the present application.
Embodiment
In most cases, the I/O request on upper strata can not be coiled response by certain member of RAID system, and the disk not coiled as this member has really damaged.According to disk producer Seagate corporate statistics, when disk can not respond I/O request, the situation of 95% is because the software error of firmware, verification and so on causes, and these situations can make disk still effective by simple reparation operation; When only having 5%, be because disk is really damaged.Therefore, if when disk does not have real damage, just start process of reconstruction to RAID system, can the very big operation and maintenance cost improving RAID system.
This application provides a kind of method for reconstructing of raid-array, its basic thought is: provide on each disk slot interface of RAID system and control the circuit that disk realizes independent upper and lower electricity; When certain disk of RAID system cannot respond I/O operation, first lower electric treatment is carried out to it, during lower electricity, allows application layer normally to read and write RAID, and during this period in there are all bar reel numbers of write operation; Then upper electric treatment is carried out to this disk, tests it and whether can normally read and write, if so, according to the generation of record all bar reel numbers of write operation, start to recover data in the corresponding band of this disk; Otherwise, be low-quality disk by this disk label, and start conventional process of reconstruction.
For making the know-why of technical scheme, feature and technique effect clearly, below in conjunction with specific embodiment, technical scheme is described in detail.
The method for reconstructing flow process of a kind of raid-array that the embodiment of the present application provides as shown in Figure 1, comprises the steps:
The controller of step 101:RAID system finds that certain the block disk in this RAID system cannot respond I/O operation, closes separately the power supply of this disk, allows this disk power-off, and starts the timer of a scheduled duration.Below this disk is called the first disk.
Step 102: during described timer timing (namely between the first disk turnoff time), RAID system carries out normal read-write operation, and there are all bar reel numbers of write operation during this period in record.
Step 103: described timer expiry, opens the power supply of the first disk, powers on to the first disk.
Step 104: after the first disk powers on, does the first disk and carries out readwrite tests operation.
In the embodiment of the present application, readwrite tests does following operation:
D1, check the first disk whether online and by drive load in operating system, if online, the first disk is low-quality disk; Continue to perform step D2 if online;
D2, to this disk send " TEST UNIT READY " this scsi command chkdsk whether be ready to read and write; If cannot read and write, disk is low-quality disk; If can step D3 be performed;
D3, RAID metadata corresponding for the first disk recorded in operating system is write the position of these disk corresponding element data, if write failure, then judge that the first disk is low-quality disk, continued to perform step D4 if write as merit;
D4, do read operation to the first disk RAID metadata, if the merit of being read as, the first disk confirms as well to coil, and reading failure then judges that the first disk is low-quality disk.
Step 105: judge whether the first disk is read and write normally, if so, performs step 106, otherwise performs step 107.
Step 106: according to the generation of the first disk turnoff time interocclusal record all bar reel numbers of write operation, recover data in the corresponding band of the first disk, be recovered rear process ends.
Step 107: be low-quality disk by the first disk label, replaces the first disk using the second disk as HotSpare disk, calculates, the result of calculating write in described second disk according to the data of other disks in RAID system and parity checking.
The foregoing is only the preferred embodiment of the application; not in order to limit the protection domain of the application; within all spirit in technical scheme and principle, any amendment made, equivalent replacements, improvement etc., all should be included within scope that the application protects.