CN107844381A

CN107844381A - The fault handling method and device of storage system

Info

Publication number: CN107844381A
Application number: CN201610837841.7A
Authority: CN
Inventors: 郑文武; 李先绪; 黄植勤; 吴家隐; 邱红飞; 陈泳
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2016-09-21
Filing date: 2016-09-21
Publication date: 2018-03-27

Abstract

The invention discloses a kind of fault handling method of storage system and device, it is related to field of computer technology.The present invention is judged the health status of disk before disk failure, after finding dangerous disk, enables HotSpare disk immediately, but HotSpare disk is not added in disk array immediately.Now dangerous disk can also normal work, can within a short period of time online by dangerous disk data duplication into HotSpare disk, and in write operation afterwards, keep HotSpare disk it is consistent with the data of dangerous disk.Once dangerous disk turns into faulty disk, then HotSpare disk is added in disk array immediately.Because the data of HotSpare disk and dangerous disk are completely the same; so HotSpare disk can substitute faulty disk normal work immediately; the very long restructuring procedure that data recovery is carried out using verification data is avoided, so as to further reduce the time that data are in unprotect state, improves the security of data.

Description

The fault handling method and device of storage system

Technical field

The present invention relates to field of computer technology, the fault handling method and device of more particularly to a kind of storage system.

Background technology

Disk array (Redundant Arrays of Independent Disks, RAID) is in current storage system Conventional technology, for ensureing the safe and reliable of data.One RAID group is made up of the disk of 2 pieces or more than 2 pieces.Removed in disk Outside business datum, also comprising verification data.After 1 piece in RAID groups or 2 pieces of hard disks break down, can manually or from In the dynamic RAID groups for adding new building, verification data of the system in normal disk is by the data recovery of loss to new building In.Typically now by the way of new building is automatically added to, this new building is referred to as HotSpare disk.HotSpare disk does not work usually, when After there are disk failures in RAID groups, then it is added in RAID groups.HotSpare disk is added to after RAID rents, system according to verification data, Restore data in HotSpare disk, rebuild RAID groups, referred to as reconstruct.

In RAID groups, disk is divided into multiple data blocks according to band, and reconstruct also recovers data according to band block-by-block.System System will verification deblocking read, computing complete generation recover after data after, then by this part recovery data be written to it is hot standby Disk.After the recovery for completing a block number evidence, continue to recover the data of other remaining blocks.This process is general veryer long, actual raw Produce in environment, reconstitution time is short then 2 hours, if data volume is larger, more than 10 hours are even up to a couple of days.Even for Newest RAID2.0 technologies, reconstitution time are also usually up to a few hours.

Because current disk array reconfiguration technique is time-consuming longer, and when reconstructing, it is impossible to disk failures thing occurs again Part, otherwise data recovery procedure can be interrupted, cause loss of data.Therefore in restructuring procedure, data are in unprotect for a long time State, data safety is by serious threat.

The content of the invention

A technical problem to be solved by this invention is：How in disk failures ensure data do not lose, subtract simultaneously Data are in the time of unprotect state in few restructuring procedure, improve Information Security.

According to an aspect of the present invention, there is provided a kind of storage system fault handling method, including：Obtain disk Operational factor；Judge whether disk is in the hole according to the operational factor of disk；By the number of disk in the hole HotSpare disk is synchronized to when factually；When disk failures in the hole, the disk for replacing damage using HotSpare disk works.

In one embodiment, the operational factor that disk is obtained from monitoring analysis report technical data of disk is passed through.

In one embodiment, the operational factor of disk includes the reading error rate of disk and/or writes out error rate；According to disk Operational factor judge disk it is whether in the hole including：The reading error rate of disk is carried out pair with reading error rate threshold value Than, and/or error rate will be write out and contrasted with writing out error rate threshold value；Exceed if the read out error rate and read error rate threshold value and/or write Error rate, which exceedes, writes out error rate threshold value, it is determined that disk is in the hole.

In one embodiment, the scope for reading error rate threshold value is 15%~25%, and the scope for writing out error rate threshold value is 30%~50%.

In one embodiment, the operational factor of disk includes reading rate and/or writing rate；According to the operational factor of disk Judge disk it is whether in the hole including：The reading rate of disk and reading rate threshold value are contrasted, and/or by writing rate Contrasted with writing rate threshold value；If reading rate is less than writing rate threshold value less than reading rate threshold value and/or writing rate, it is determined that Disk is in the hole.

According to another aspect of the present invention, there is provided a kind of storage system fault treating apparatus, including：Disk parameter Acquiring unit, for obtaining the operational factor of disk；Disk State judging unit, judge magnetic for the operational factor according to disk Whether disk is in the hole；Data in magnetic disk synchronization unit, for the real time data synchronization of disk in the hole to be arrived HotSpare disk；Disk replacement unit, for when disk failures in the hole, the disk work of damage to be replaced using HotSpare disk Make.

In one embodiment, disk parameter acquiring unit, for the monitoring analysis report technical data certainly by disk Obtain the operational factor of disk.

In one embodiment, the operational factor of disk includes the reading error rate of disk and/or writes out error rate；Disk State Judging unit, for the reading error rate of disk to be contrasted with reading error rate threshold value, and/or error rate will be write out and write out error rate Threshold value is contrasted, if the read out error rate exceed read error rate threshold value and/or write out error rate exceed write out error rate threshold value, it is determined that Disk is in the hole.

In one embodiment, the operational factor of disk includes reading rate and/or writing rate；Disk State judging unit, For the reading rate of disk and reading rate threshold value to be contrasted, and/or writing rate and writing rate threshold value are contrasted；If Reading rate is less than writing rate threshold value less than reading rate threshold value and/or writing rate, it is determined that disk is in the hole.

The present invention is judged the health status of disk before disk failure, after finding dangerous disk, is enabled immediately hot standby Disk, but HotSpare disk is not added in disk array immediately.Now dangerous disk can also normal work, can be online within a short period of time By dangerous disk data duplication into HotSpare disk, and in write operation afterwards, keep HotSpare disk consistent with the data of dangerous disk.One Denier danger disk turns into faulty disk, then HotSpare disk is added in disk array immediately.Because the data of HotSpare disk and dangerous disk are complete It is complete consistent, so HotSpare disk can substitute faulty disk normal work immediately, avoid and carry out data recovery using verification data Very long restructuring procedure, so as to further reduce the time that data are in unprotect state, improve the security of data.

By referring to the drawings to the present invention exemplary embodiment detailed description, further feature of the invention and its Advantage will be made apparent from.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 shows the schematic flow sheet of the fault handling method of the storage system of one embodiment of the present of invention.

Fig. 2 shows the schematic flow sheet of the fault handling method of the storage system of the application examples of the present invention.

Fig. 3 shows the structural representation of the fault treating apparatus of the storage system of one embodiment of the present of invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.Below Description only actually at least one exemplary embodiment is illustrative, is never used as to the present invention and its application or makes Any restrictions.Based on the embodiment in the present invention, those of ordinary skill in the art are not making creative work premise Lower obtained every other embodiment, belongs to the scope of protection of the invention.

For in the prior art, disk array reconfiguration technique is time-consuming longer, and in restructuring procedure, data are in for a long time Unprotect state, the problem of data safety is by serious threat, propose this programme.

Below with reference to the fault handling method of the storage system of Fig. 1 description present invention.

Fig. 1 is the flow chart of fault handling method one embodiment of the storage system of the present invention.As shown in figure 1, the reality Applying the method for example includes：

Step S102, obtain the operational factor of disk.

Wherein, the operational factor of disk for example passes through SMART (Self-Monitoring Analysis and Reporting Technology, from monitoring analysis and reporting techniques) data obtain, have recorded including disk in SMART data Error rate, the information such as read or write speed.

Step S104, judge whether disk is in the hole according to the operational factor of disk.

Wherein, the present invention provides two kinds of reference schemes for how to judge whether disk is in the hole：First, according to magnetic The reading error rate of disk and/or write out error rate and judge whether disk is in the hole, specifically, by the reading error rate of disk with reading Error rate threshold is contrasted, and/or will be write out error rate and be contrasted with writing out error rate threshold value；Exceed if the read out error rate and read Error rate threshold value and/or write out error rate exceed write out error rate threshold value, it is determined that disk is in the hole.Can be according to reading error rate Or write out one in error rate to judge whether disk in the hole, can also two combinations judge whether disk is in Precarious position, relative to more accurate as criterion using one of which.Inventor, which has found that if criterion is excessively tight, (to be allowed Reading error rate, to write out error rate excessive), then may cause to fail to judge so that some disks that will be broken down are not judged For dangerous disk；And standard is excessively loose (the reading error rate of permission, write out error rate too small), then may cause one side system dangerous disk mistake It is more, on the other hand dangerous disk still can normal operation for a long time, premature loss HotSpare disk.It is arranged to when reading error rate threshold value During numerical value between 15%~25%, such as 20%, when writing out the numerical value that error rate threshold value is arranged between 30%~50%, such as 40%, can more accurately predict disk will break down.2nd, magnetic is judged according to the reading rate of disk and/or writing rate Whether disk in the hole, specifically, the reading rate of disk and reading rate threshold value are contrasted, and/or by writing rate with Writing rate threshold value is contrasted；If reading rate is less than writing rate threshold value less than reading rate threshold value and/or writing rate, it is determined that magnetic Disk is in the hole.It can judge whether disk is in the hole, also may be used according to one in reading rate or writing rate Judge whether disk is in the hole with two combinations, relative to more accurate as criterion using one of which. , can also according to demand or actual observation experience chooses other operational factors to judge magnetic in addition to above two determination methods Whether disk is in the hole.

Inventor is had found when the threshold value that operational factor defines, although disk still can correctly be read and write, can be recognized It is in the hole for disk, most probably within a short period of time, turn into the faulty disk that can not normally read and write.Enter in new building stable Working condition, after seldom there is chance failure, disk failure prediction is carried out using SMART data, there is very high accuracy rate, The quality of disk is better, operation is more stable, then the accuracy rate of failure predication is higher.Among practice, accuracy rate is more than 90%.

Step S106, by the real time data synchronization of disk in the hole to HotSpare disk.

Wherein, real-time synchronization includes existing data in disk are fully synchronized into HotSpare disk, meanwhile, if disk is held Write operation data to be written real-time synchronization gone into HotSpare disk.

Step S108, when disk failures in the hole, the disk for replacing damage using HotSpare disk works.

Because the data in disk have just been carried out synchronization by HotSpare disk before disk failure, therefore work as disk once Break down damage when, HotSpare disk can in time be used for substitute disk work, it is not necessary to carry out the process of data reconstruction again.

If not predicting disk accurately according to the service data of disk will break down, adopted after disk failures Data convert is carried out with original reconfiguration technique.Further, since the service data of disk there may be fluctuation, can be periodic Service data is monitored, if disk is confirmed as dangerous disk but after preset time, service data is recovered normally to be not belonging to endanger Dangerous disk, then HotSpare disk synchrodata need not be reused, HotSpare disk is removed, avoid the resource of excessive consumption HotSpare disk.

The method of above-described embodiment, the health status of disk is judged before disk failure, after finding dangerous disk, stood HotSpare disk is enabled, but HotSpare disk is not added in disk array immediately.Now dangerous disk can also normal work, can be shorter Online by dangerous disk data duplication into HotSpare disk in time, and in write operation afterwards, HotSpare disk and dangerous disk are kept Data are consistent.Once dangerous disk turns into faulty disk, then HotSpare disk is added in disk array immediately.Due to HotSpare disk and danger The data of disk are completely the same, so HotSpare disk can substitute faulty disk normal work immediately, avoid and carried out using verification data The very long restructuring procedure of data recovery, so as to further reduce the time that data are in unprotect state, improve data Security.

One application examples of the fault handling method of storage system of the present invention is described below with reference to Fig. 2.

Fig. 2 is the flow chart of one application examples of fault handling method of the storage system of the present invention.As shown in Fig. 2 the reality Applying the method for example includes：

Step S202, the SMART parameter of one piece of disk in reading disk array.

Specifically, can by write DiskState (Re, We) functions read the read error rate in SMART parameter and Write error rate, and set and read error rate threshold value and write out error rate threshold value.

Step S204, judges the dangerous disk that whether belongs to of disk, i.e., whether read error rate, which is more than, reads error rate threshold value 20%, Whether write error rate, which is more than, is write out error rate threshold value 40%, if read error rate is more than 20% and write error rate is more than 40%, Step S206 is performed, otherwise determines that disk is not belonging to dangerous disk, next piece of disk in return to step S202 reading disk arrays SMART parameter.

Step S206, by the data duplication in dangerous disk to HotSpare disk.

Step S208, keep the synchronization of data in HotSpare disk and dangerous disk.

Step S210, judges whether dangerous disk breaks down, if dangerous disk failure, performs step S212, otherwise, after It is continuous to perform step S208.

Step S212, HotSpare disk is added in disk array and replaces failed disk.

The present invention also provides a kind of fault treating apparatus of storage system, is described with reference to Fig. 3.

Fig. 3 is the structure chart of fault handling method one embodiment of the storage system of the present invention.As shown in figure 3, the dress Putting 30 includes：

Disk parameter acquiring unit 302, for obtaining the operational factor of disk.

Specifically, disk parameter acquiring unit 302, for being counted by disk from monitoring analysis report technology (SMART) According to the operational factor for obtaining disk.

Disk State judging unit 304, for judging whether disk is in the hole according to the operational factor of disk.

In the case where the operational factor of disk includes the reading error rate of disk and/or writes out error rate, Disk State judges Unit 304, for the reading error rate of disk to be contrasted with reading error rate threshold value, and/or error rate will be write out and write out error rate Threshold value is contrasted, if the read out error rate exceed read error rate threshold value and/or write out error rate exceed write out error rate threshold value, it is determined that Disk is in the hole.Wherein, the scope for reading error rate threshold value is 15%~25%, such as 20%；Write out error rate threshold value Scope is 30%~50%, such as 40%.

In the case where the operational factor of disk includes reading rate and/or writing rate, Disk State judging unit 304, use Contrasted in by the reading rate of disk and reading rate threshold value, and/or writing rate and writing rate threshold value are contrasted；If read Speed is less than writing rate threshold value less than reading rate threshold value and/or writing rate, it is determined that disk is in the hole.

Data in magnetic disk synchronization unit 306, for by the real time data synchronization of disk in the hole to HotSpare disk.

Disk replacement unit 308, for when disk failures in the hole, the magnetic of damage to be replaced using HotSpare disk Disk works.

One of ordinary skill in the art will appreciate that hardware can be passed through by realizing all or part of step of above-described embodiment To complete, by program the hardware of correlation can also be instructed to complete, described program can be stored in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only storage, disk or CD etc..

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit the invention, it is all the present invention spirit and Within principle, any modification, equivalent substitution and improvements made etc., it should be included in the scope of the protection.

Claims

A kind of 1. fault handling method of storage system, it is characterised in that including：

Obtain the operational factor of disk；

Judge whether disk is in the hole according to the operational factor of the disk；

By the real time data synchronization of disk in the hole to HotSpare disk；

When the disk failures in the hole, the HotSpare disk is utilized to replace the disk work of damage.
2. according to the method for claim 1, it is characterised in that

Pass through the operational factor that the disk is obtained from monitoring analysis report technical data of disk.
3. according to the method for claim 1, it is characterised in that

The operational factor of the disk includes the reading error rate of disk and/or writes out error rate；

The operational factor according to the disk judge disk it is whether in the hole including：

The reading error rate of the disk is contrasted with reading error rate threshold value, and/or error rate will be write out and write out error rate threshold value Contrasted；

If the read out error rate exceed read error rate threshold value and/or write out error rate exceed write out error rate threshold value, it is determined that disk is in Precarious position.
4. according to the method for claim 3, it is characterised in that

The scope for reading error rate threshold value is 15%~25%, and the scope for writing out error rate threshold value is 30%~50%.
5. according to the method for claim 1, it is characterised in that

The operational factor of the disk includes reading rate and/or writing rate；

The operational factor according to the disk judge disk it is whether in the hole including：

The reading rate of the disk and reading rate threshold value are contrasted, and/or writing rate and writing rate threshold value are contrasted；

If reading rate is less than writing rate threshold value less than reading rate threshold value and/or writing rate, it is determined that disk is in the hole.
A kind of 6. fault treating apparatus of storage system, it is characterised in that including：

Disk parameter acquiring unit, for obtaining the operational factor of disk；

Disk State judging unit, for judging whether disk is in the hole according to the operational factor of the disk；

Data in magnetic disk synchronization unit, for by the real time data synchronization of disk in the hole to HotSpare disk；

Disk replacement unit, for when the disk failures in the hole, utilizing the HotSpare disk to replace damage Disk works.
7. device according to claim 6, it is characterised in that

The disk parameter acquiring unit, for the fortune that the disk is obtained from monitoring analysis report technical data by disk Row parameter.
8. device according to claim 6, it is characterised in that

The operational factor of the disk includes the reading error rate of disk and/or writes out error rate；

The Disk State judging unit, for the reading error rate of the disk to be contrasted with reading error rate threshold value, and/or Error rate will be write out and contrasted with writing out error rate threshold value, exceeded if the read out error rate and read error rate threshold value and/or write out error rate and surpass Cross and write out error rate threshold value, it is determined that disk is in the hole.
9. device according to claim 8, it is characterised in that

The scope for reading error rate threshold value is 15%~25%, and the scope for writing out error rate threshold value is 30%~50%.
10. device according to claim 6, it is characterised in that

The operational factor of the disk includes reading rate and/or writing rate；

The Disk State judging unit, for the reading rate of the disk and reading rate threshold value to be contrasted, and/or it will write Speed is contrasted with writing rate threshold value；If reading rate is less than writing rate threshold value less than reading rate threshold value and/or writing rate, Determine that disk is in the hole.