CN106990918A - Trigger the method and device that RAID array is rebuild - Google Patents

Trigger the method and device that RAID array is rebuild Download PDF

Info

Publication number
CN106990918A
CN106990918A CN201710125115.7A CN201710125115A CN106990918A CN 106990918 A CN106990918 A CN 106990918A CN 201710125115 A CN201710125115 A CN 201710125115A CN 106990918 A CN106990918 A CN 106990918A
Authority
CN
China
Prior art keywords
disk
response time
raid array
average response
exception
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710125115.7A
Other languages
Chinese (zh)
Inventor
上官应兰
张学东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Sequoia Polytron Technologies Inc
Original Assignee
Hangzhou Sequoia Polytron Technologies Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Sequoia Polytron Technologies Inc filed Critical Hangzhou Sequoia Polytron Technologies Inc
Priority to CN201710125115.7A priority Critical patent/CN106990918A/en
Publication of CN106990918A publication Critical patent/CN106990918A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0611Improving I/O performance in relation to response time
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0632Configuration or reconfiguration of storage systems by initialisation or re-initialisation of storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0689Disk arrays, e.g. RAID, JBOD

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application provides a kind of method and device for triggering RAID array reconstruction, and methods described is applied to the RAID sub-system of storage device, and methods described includes:The IO read write commands received are issued to each member's disk in the RAID array;Count the average response time of each member's disk IO read write commands in default measurement period in the RAID array;Search member's disk that average response time in the non-HotSpare disk and non-faulting member's disk of the RAID array reaches exception response time threshold;The average response time found being reached to, the maximum top n member disk label of average response time is failure member's disk in member's disk of exception response time threshold, and notifies the RAID array to rebuild;Wherein, N is not more than the RAID array and supported while the member's disk number rebuild.The response time of the IO read write commands based on member's disk can be realized using this method to trigger the reconstruction to the RAID array belonging to member's disk.

Description

Trigger the method and device that RAID array is rebuild
Technical field
The application is related to computer communication field, more particularly to the method and device that triggering RAID array is rebuild.
Background technology
RAID array (Redundant Array of IndependentDisks, RAID) is a kind of The disk (physical disk) of polylith independence is combined in different ways and forms a disk group (logic magnetic disc), so as to carry For the technology of the storage performance and data reliability higher than single disk.
In computer communication field, it will usually redundancy protecting is carried out to data in disk using RAID array technology, when having When data write, data are split in multiple member's disks according to RAID array algorithm.It is different according to RAID array rank, can 1 piece of tolerance or polylith disk failure are offline, when detecting disk I O error or offline disk, can be used special hot standby Disk or global HotSpare disk are rebuild, and recover RAID array data redundancy.
However, in the existing method rebuild of triggering RAID array, only account for disk I O error and disk from The situation of line, do not account for the response time after disk aging it is slack-off cause the situation of service disconnection, therefore how in disk response RAID array is triggered in the case of slow and is rebuild turns into urgent problem to be solved.
The content of the invention
In view of this, the application provides a kind of method and device for triggering RAID array reconstruction, and member's magnetic is based on to realize The response time of the IO read write commands of disk triggers the reconstruction to the RAID array belonging to member's disk.
Specifically, the application is achieved by the following technical solution:
The method that RAID array is rebuild is triggered there is provided a kind of according to the first aspect of the application, methods described is applied to deposit Store up the RAID sub-system of equipment;At least one pre-configured RAID array of the storage device, the RAID array includes several Member's disk;Methods described includes:
The IO read write commands received are issued to each member's disk in the RAID array;
Based on the response time of the IO read write commands in default measurement period of each member's disk in the RAID array, statistics The average response time of each member's disk;Search average in the non-HotSpare disk and non-faulting member's disk of the RAID array Response time reaches member's disk of exception response time threshold;
The average response time found being reached to, average response time is most in member's disk of exception response time threshold Big top n member disk label is failure member's disk, and notifies the RAID array to rebuild;Wherein, N is not more than described RAID array supports the member's disk number rebuild simultaneously.
The device that RAID array is rebuild is triggered there is provided a kind of according to the second aspect of the application, described device is applied to deposit Store up the RAID sub-system of equipment;At least one pre-configured RAID array of the storage device, the RAID array includes several Member's disk;Described device includes:
Issuance unit, each member's disk in the RAID array is issued to by the IO read write commands received;
Statistic unit, for presetting IO read write commands in measurement period based on each member's disk in the RAID array Response time, count the average response time of each member's disk;
Average response time in searching unit, the non-HotSpare disk and non-faulting member's disk for searching the RAID array Reach member's disk of exception response time threshold;
Put down in indexing unit, member's disk for the average response time found to be reached to exception response time threshold The top n member disk label of equal response time maximum is failure member's disk, and notifies the RAID array to rebuild;Wherein, N No more than described RAID array supports the member's disk number rebuild simultaneously.
The application proposes a kind of method for triggering RAID array reconstruction, and RAID sub-system can refer to the IO received read-writes Order is issued to each member's disk in the RAID array.And can be based on each member's disk in the RAID array default The response time of the IO read write commands returned in measurement period, count the average response time of each member's disk.RAID System can be in the non-HotSpare disk and non-faulting member's disk of the RAID array, and lookup average response time reaches abnormal loud Answer member's disk of time threshold, it is possible to which the average response time found is reached to member's magnetic of exception response time threshold The maximum top n member disk label of average response time is failure member's disk in disk, and is notified belonging to N number of member's disk RAID array rebuild.
Because RAID sub-system can be while RAID array data flow not be influenceed, the average sound based on each member's disk Between seasonable, average response time is reached to the maximum top n of average response time in member's disk of exception response time threshold Member's disk label is failure member's disk, is rebuild with triggering the RAID array belonging to N number of member's disk, so as to realize The response time of IO read write commands based on member's disk triggers the reconstruction to the RAID array belonging to member's disk.
Brief description of the drawings
Fig. 1 is the flow chart for the method that a kind of triggering RAID array shown in the exemplary embodiment of the application one is rebuild;
Fig. 2 is the hard of equipment where the device that a kind of triggering RAID array shown in the exemplary embodiment of the application one is rebuild Part structure chart;
Fig. 3 is the block diagram for the device that a kind of triggering RAID array shown in the exemplary embodiment of the application one is rebuild.
Embodiment
Here exemplary embodiment will be illustrated in detail, its example is illustrated in the accompanying drawings.Following description is related to During accompanying drawing, unless otherwise indicated, the same numbers in different accompanying drawings represent same or analogous key element.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with the application.On the contrary, they be only with it is such as appended The example of the consistent apparatus and method of some aspects be described in detail in claims, the application.
It is the purpose only merely for description specific embodiment in term used in this application, and is not intended to be limiting the application. " one kind ", " described " and "the" of singulative used in the application and appended claims are also intended to including majority Form, unless context clearly shows that other implications.It is also understood that term "and/or" used herein refers to and wrapped It may be combined containing one or more associated any or all of project listed.
It will be appreciated that though various information, but this may be described using term first, second, third, etc. in the application A little information should not necessarily be limited by these terms.These terms are only used for same type of information being distinguished from each other out.For example, not departing from In the case of the application scope, the first information can also be referred to as the second information, similarly, and the second information can also be referred to as One information.Depending on linguistic context, word as used in this " if " can be construed to " ... when " or " when ... When " or " in response to determining ".
RAID array is that a kind of disk (physical disk) polylith independence is combined in different ways and forms one Disk group (logic magnetic disc), so as to provide the technology of the storage performance higher than single disk and data reliability.
In computer communication field, it will usually redundancy protecting is carried out to data in disk using RAID array technology, when having When data write, data are split in multiple member's disks according to RAID array algorithm.It is different according to RAID array rank, can 1 piece of tolerance or polylith disk failure are offline, when detecting disk I O error or offline disk, can be used special hot standby Disk or global HotSpare disk are rebuild, and recover RAID array data redundancy.
In the method that related RAID array triggering is rebuild, when the IO that RAID sub-system receives the return of member's disk is read Write error, and when judging that the mistake can not be recovered, member's disk failure can be marked, and trigger belonging to member's disk RAID array is rebuild.In addition, when RAID sub-system receives the offline notification message of member's disk, this can also be triggered offline Member's disk belonging to RAID array rebuild.
When rebuilding, HotSpare disk can be used to rebuild faulty disk or offline disk, RAID sub-system can be according to RAID gusts Row algorithm calculates the data of correspondence band in HotSpare disk, recovers the redundancy of the RAID array belonging to the failed disk.
Because disk is the device that both mechanically and electrically combines, influenceed, actually should by factors such as device aging, environment Disk I/O is likely to occur in and does not return wrong but slack-off response time phenomenon, upper layer application will be caused to read and write the disk corresponding IO, which is returned, during RAID array, on response time slack-off disk is slower than other disks, and the performance of upper layer application occurs fluctuating or IO Time-out.It is embodied in, is held the LUN created in RAID array (LUN) is distributed into front end application server Resume studies when writing, in fact it could happen that LUN performance has the situation of the overtime service disconnections of very big fluctuation even IO, but developer exists When being investigated to the phenomenons of the LUN performance great fluctuation processes, it is found that the RAID array state is normal, Disk State is also normal, member Disk does not return to IO read-write errors yet.Further investigate, although the interface of member's disk of the RAID array is identical, rotating speed phase Together, but on few members' disk the response time of the IO read-write responses returned is considerably longer than in the RAID array other members Disk.Member's disk of IO read-write response time length is being pulled out away, the member's magnetic for using HotSpare disk to replace IO to read and write response time length After disk, the RAID array performance and LUN performance recoveries are normal.
In summary, because the IO read-writes response time length of member's disk can have a strong impact on the RAID belonging to member's disk The LUN created in array and the RAID array performance, occurs occurring IO time-out under performance inconsistency, extreme case that industry may be caused Business is interrupted.However, in the method that existing triggering RAID array is rebuild, only accounting for disk I O error and disk being offline Situation, not accounting for the response time after disk aging slack-off causes the situation of service disconnection.
The application proposes a kind of method for triggering RAID array reconstruction, and RAID sub-system can refer to the IO received read-writes Order is issued to each member's disk in the RAID array.And can be based on each member's disk in the RAID array default The response time of the IO read write commands returned in measurement period, count the average response time of each member's disk.RAID System can be in the non-HotSpare disk and non-faulting member's disk of the RAID array, and lookup average response time reaches abnormal loud Answer member's disk of time threshold, it is possible to which the average response time found is reached to member's magnetic of exception response time threshold The maximum top n member disk label of average response time is failure member's disk in disk, and is notified belonging to N number of member's disk RAID array rebuild.
Because RAID sub-system can be while RAID array data flow not be influenceed, the average sound based on each member's disk Between seasonable, average response time is reached to the maximum top n of average response time in member's disk of exception response time threshold Member's disk label is failure member's disk, is rebuild with triggering the RAID array belonging to N number of member's disk, so as to realize The response time of IO read write commands based on member's disk triggers the reconstruction to the RAID array belonging to member's disk.
Referring to Fig. 1, Fig. 1 is the stream for the method that a kind of RAID array triggering shown in the exemplary embodiment of the application one is rebuild Cheng Tu.Methods described is applied to the RAID sub-system of storage device, and the storage device further comprises disk subsystem.The storage is set Standby to further comprises several RAID arrays, each RAID array can also include several member's disks, and this method is specifically included Step as follows:
Step 101:The IO read write commands received are issued to each member's disk in the RAID array;
Step 102:Response based on the IO read write commands in default measurement period of each member's disk in the RAID array Time, count the average response time of each member's disk;
Step 103:Average response time in the non-HotSpare disk and non-faulting member's disk of the RAID array is searched to reach Member's disk of exception response time threshold;
Step 104:The average response time found is reached in member's disk of exception response time threshold and averagely rung Maximum top n member disk label is failure member's disk between seasonable, and notifies the RAID array to rebuild;Wherein, N is little The member's disk number for supporting to rebuild simultaneously in the RAID array.
Wherein, above-mentioned RAID sub-system, for managing each RAID array in storage device.For example, RAID sub-system Function can include carrying out the fractionation based on RAID array algorithm to the multiple IO read write commands received, and IO after fractionation is read and write Instruction is handed down to disk subsystem.The function of RAID sub-system can also include, when RAID array is rebuild, based on the RAID gusts The algorithm of row, calculates the data of correspondence band in HotSpare disk, recovers the functions such as RAID array data redundancy.Certainly, RAID Subsystem also has a variety of functions.Herein, the function not to RAID sub-system is specifically limited.
Above-mentioned disk subsystem, is mainly used in managing all physical disks in storage device.Disk subsystem is suitable In " lower floor " system, for servicing " upper strata " system such as RAID sub-system.For example, disk subsystem can be in storage device Physical disk is scanned, and the status information of each physical disk for notifying to scan to RAID sub-system etc..Certainly, in reality In, disk subsystem also has other functions, herein, and the function progress to disk subsystem is not specifically defined.
Above-mentioned RAID array, refer to by multiple physical disks in storage device by RAID array algorithm combination into disk Group.Physical disk in the RAID array can also be referred to as member's disk.RAID array according to its rank difference and The difference of implementation, is supported while the member's disk number rebuild also is differed.For example, what tradition RAID5 can be supported The member's disk number rebuild simultaneously is one, in the realization of some producers, can be supported while rebuilding multiple.
The response time of above-mentioned member's disk, refer to that RAID sub-system issues IO read write commands to the member to member's disk Disk returns to the time required for the response of the IO read write commands.
The average response time of above-mentioned member's disk, refers in default measurement period, and member's disk is accumulative The number of the response time of the IO read write commands of completion divided by the completed IO read write commands of member's disk is member's disk Average response time.
Above-mentioned exception response time threshold, for judging whether the response time of member's disk is abnormal, when member's disk When average response time is more than or equal to the exception response time threshold, represent that member's disk is abnormal.
Certainly, developer is when setting the exception response time threshold, if the exception response time threshold set It is excessive, then possibly can not be accurately detected IO response times long member's disk so that RAID array and create thereon LUN performance can not recover well.If the exception response time threshold is configured into too small, it may detect that a large amount of Property abnormality member's disk, accidentally injure the normal member's disk of average response time, so as to cause the frequent weight of RAID array Build, influence the performance of storage device.So in actual applications, developer can according to actual conditions to the exception response when Between threshold value set.For example, the exception response time threshold can be set to the average sound in RAID array by developer The average response time of minimum member's disk is multiplied by the product of exception response time weight value between seasonable.Certainly, developer Numerical value of the exception response time threshold etc. can also directly be set.Herein, setting simply to exception response time threshold Exemplary explanation is carried out, it is not limited specifically.
In the embodiment of the present application, RAID sub-system be no longer based only on member's disk return IO read-write errors or into Member's disk carries out abnormal member's disk detection offline, when RAID sub-system is also based on the average response of the disk of each member Between, before the average response time found being reached to, average response time is maximum in member's disk of exception response time threshold N number of member's disk label is failure member's disk, and triggers the reconstruction of the RAID array belonging to member's disk, it is achieved thereby that base In member's disk IO read write commands response time to triggering the reconstruction of the RAID array belonging to member's disk.
Below by taking a RAID array in storage device as an example, the side rebuild is triggered to the RAID array that the application is proposed Method is described in detail.
When realizing, in default measurement period, RAID sub-system can receive IO read write commands, it is possible to based on this The corresponding RAID array algorithm of RAID array, the IO read write commands received is split, and the instruction after fractionation is passed through Disk subsystem is issued to each member's disk.
Member's disk can return to IO to RAID sub-system and read after completing to correspond to its allocated IO read write commands arrived The response of write command.
After the response for the IO read write commands that the return of member's disk is received in RAID sub-system, member's disk can be calculated The response time of this IO read write command, it is possible to response time and number to the completed IO read write commands of member's disk Added up respectively.
RAID sub-system can count the completed IO read-writes of each member's disk of the RAID array based on the above method The response time of the number of instruction and accumulative completed IO read write commands.
It should be noted that the response time of the IO read write commands of statistics member's disk can be by setting IO read write commands The method of response time timer counted, the other method that can also be commonly used based on ability be counted, herein, no Computational methods to the response time for the IO read write commands for counting member's disk are especially limited.
At the end of the default measurement period, RAID sub-system can calculate being averaged for each member's disk of the RAID array Response time.
When realizing, RAID sub-system can obtain corresponding with each member's disk of RAID array cumulative completed The response time of IO read write commands and number, the response for the completed IO read write commands that then each member's disk adds up respectively The number of time divided by cumulative completed IO read write commands is divided by, and the average response time of each member's disk is obtained respectively. Wherein, if the number of the corresponding cumulative completed IO read write commands of some member's disk is zero, member's disk it is flat The equal response time is by zero processing.
After calculating obtains the average response time of each member's disk of above-mentioned RAID array, RAID sub-system can be by this The response time of the above-mentioned completed IO read write commands of each member's disk added up in measurement period and completed IO read-writes The number of instruction is emptied, to cause in next measurement period, and the completed IO of each member's disk of RAID array is read and write The response time of instruction and the number of completed IO read write commands the two parameters are counted.
In the embodiment of the present application, after calculating and obtaining the average response time of each member's disk, RAID sub-system can be with In the RAID array, member's magnetic that average response time reaches and (be more than or equal to) above-mentioned exception response time threshold is searched Disk.
When realizing, RAID sub-system can be searched most in the non-HotSpare disk and non-faulting member's disk of the RAID array Average response time that is small and being not zero, and calculate average response time that is minimum and being not zero and default exception response time The product of weighted value, is used as above-mentioned exception response time threshold.
Then, RAID sub-system can search average ring in the non-HotSpare disk and non-faulting member's disk of the RAID array Member's disk of above-mentioned exception response time threshold is reached and (is more than or equal to) between seasonable.
Wherein, above-mentioned exception response time weight value, for judging whether disk average response time abnormal, generally by with Family carries out sets itself according to actual conditions, exists generally in the form of percentage, such as 200%, herein, not to the exception Response time weighted value is especially limited.
RAID sub-system finds above-mentioned average response in the non-HotSpare disk and non-faulting member's disk of the RAID array Time reached and (is more than or equal to) after member's disk of above-mentioned exception response time threshold, can be by the average response found Time reaches that the maximum top n member disk label of average response time is failure in member's disk of exception response time threshold Member's disk, wherein, N no more than (is less than or equal to) member's disk number that the RAID array is supported to rebuild simultaneously, and N is Integer more than 0.
It should be noted that the difference of the RAID array grade in the storage device of each storage device manufacturer offer And the difference of implementation, developer can set to above-mentioned N values.For example, in tradition RAID, when RAID array is During RAID5, member's disk number that RAID5 supports are rebuild simultaneously is one, and now N can be 1, and RAID sub-system will can be put down Member's disk label of equal response time maximum is failure member's disk.When RAID array is RAID6, RAID6 is supported The member's disk number rebuild simultaneously is two, and now N can be 2 or 1, and RAID sub-system can be by average response time most Big preceding two pieces of member's disks or member's disk label of maximum are failure member's disk.The value simply to N is entered herein The exemplary explanation of row, is not limited especially it.
In addition, in order to improve the accuracy that RAID sub-system detects failure member's disk, it is to avoid by provisional exception response Member's disk label be failure member's disk, RAID sub-system can not mark immediately member's disk be failure member's magnetic Disk, but record judged result of the member's disk within several continuous cycles, if member's disk it is continuous several In cycle when being look for the average response time and reached average response in member's disk of exception response time threshold Between maximum top n member's disk, then be failure member's disk by member's disk label, and trigger belonging to member's disk RAID array is rebuild.
In a kind of optional implementation, for the accuracy and practicality of increase detection failure member's disk, above-mentioned company Continue several measurement periods, can be some measurement periods of " relatively continuous ".
In mark, above-mentioned RAID sub-system can record the above-mentioned average response time found respectively and reach exception response The durations number of the maximum top n member's disk of average response time in member's disk of time threshold;If at several After measurement period, the durations number of any member disk in the top n disk reaches default durations threshold value, then will Member's disk label is failure member's disk.
In record, RAID sub-system can be reached for the average response time that finds exception response time threshold into Each member's disk in member's disk in the maximum top n member's disk of average response time, terminates in next measurement period When, if member's disk is look for the average response time and reached in member's disk of exception response time threshold again The maximum top n member's disk of average response time, then increase the durations number of member's disk and record;If the member Disk is not look for the average response time and reached average response time in member's disk of exception response time threshold Maximum top n member's disk, then reduced the durations number of member's disk and record, if the lasting week of member's disk Issue is reduced to zero, and the durations of member's disk are not re-recorded;Wherein, the initial value of the durations number of member's disk is Zero.
For example, in above-mentioned RAID array, if some member's disk is look for the average response time for the first time Reach the maximum top n member's disk of average response time in member's disk of exception response time threshold, then can be by the member The durations number of disk is set to 1.At the end of next measurement period, if the member look for again it is described average Response time reaches the maximum top n member's disk of average response time in member's disk of exception response time threshold, then will The durations number of member's disk Jia 1 certainly;If member's disk do not look for the average response time reach it is different The maximum top n member's disk of average response time in member's disk of normal response time threshold value, then holding member's disk Continuous periodicity, if the durations number of member's disk reduces to zero, the durations of member's disk is not re-recorded from subtracting 1.
In another optional implementation, above-mentioned RAID sub-system can also several statistics based on " absolute continuation " Cycle carries out failure member's disk label to member's disk.
When realizing, above-mentioned RAID sub-system can record the above-mentioned average response time found respectively and reach exception response The durations number of the maximum top n member's disk of average response time in member's disk of time threshold;If at several After measurement period, the durations number of any member disk in the top n disk reaches default durations threshold value, then will Member's disk label is failure member's disk.
In record, RAID sub-system can be reached for the average response time that finds exception response time threshold into Each member's disk in member's disk in the maximum top n member's disk of average response time, terminates in next measurement period When, if member's disk is look for the average response time and reached in member's disk of exception response time threshold again The maximum top n member's disk of average response time, then increase the durations number of member's disk and record;If the member Disk is not look for the average response time and reached average response time in member's disk of exception response time threshold Maximum top n member's disk, then be set to zero by the durations number of member's disk.Wherein, the durations of member's disk Several initial values is zero.
In the embodiment of the present application, due to being likely to occur the failure member disk of this measurement period mark because other are former Because (such as magnetic disk media mistake etc.) has begun to the situation of reconstruction.Therefore, RAID array can detect the failure of the mark into Whether member's disk is being rebuild, and RAID array system can also detect whether the failure member disk of the mark meets its institute The reconstruction requirement of the RAID array of category.For example, rebuilding the reconstruction number no more than RAID array that requirement can be member's disk Support the member's disk number rebuild simultaneously.Rebuilding requirement can also have been prepared for finishing to rebuild used HotSpare disk. In actual applications, developer can set according to actual conditions and rebuild requirement, merely just require to carry out example to rebuilding Property explanation, it is not limited especially.
The method and the RAID array of above-detailed that other RAID arrays triggering in storage device is rebuild are triggered The method of reconstruction is identical, herein, repeats no more.
It should be noted that when being rebuild to failure member's disk, each producer can be according to the weight oneself set Implementation is built to complete to rebuild, for example, can carry out kicking disk when starting to rebuild or carry out kicking disk after the completion of reconstruction, Here the implementation that RAID array is rebuild is not limited specifically.
The application proposes a kind of method for triggering RAID array reconstruction, and RAID sub-system can refer to the IO received read-writes Order is issued to each member's disk in the RAID array.And can be based on each member's disk in the RAID array default The response time of the IO read write commands returned in measurement period, count the average response time of each member's disk.RAID System can be in the non-HotSpare disk and non-faulting member's disk of the RAID array, and lookup average response time reaches abnormal loud Answer member's disk of time threshold, it is possible to which the average response time found is reached to member's magnetic of exception response time threshold The maximum top n member disk label of average response time is failure member's disk in disk, and is notified belonging to N number of member's disk RAID array rebuild.
Because RAID sub-system can be while RAID array data flow not be influenceed, the average sound based on each member's disk Between seasonable, average response time is reached to the maximum top n of average response time in member's disk of exception response time threshold Member's disk label is failure member's disk, is rebuild with triggering the RAID array belonging to N number of member's disk, so as to realize The response time of IO read write commands based on member's disk triggers the reconstruction to the RAID array belonging to member's disk.
Further, since RAID sub-system detects same member's disk in several continuous measurement periods reaches that this is some The RAID array belonging to member's disk is just triggered during individual measurement period correspondence exception response time threshold to rebuild, therefore can be had Improve the accuracy that RAID sub-system detects failure member's disk in effect ground, it is to avoid the provisional exception response of member's disk occur.
Embodiment with the method that foregoing triggering RAID array is rebuild is corresponding, and present invention also provides triggering RAID array The embodiment of the device of reconstruction.
The embodiment for the device that the application triggering RAID array is rebuild can be using on a storage device.Device embodiment can To be realized by software, it can also be realized by way of hardware or software and hardware combining.Exemplified by implemented in software, one is used as Device on logical meaning, is by corresponding computer journey in nonvolatile memory by the processor of storage device where it Sequence instruction reads what operation in internal memory was formed.For hardware view, as shown in Fig. 2 triggering RAID array weight for the application A kind of hardware structure diagram of storage device where the device built, except the processor shown in Fig. 2, internal memory, network outgoing interface and Outside nonvolatile memory, storage device in embodiment where device, can be with generally according to the actual functional capability of the storage Including other hardware, this is repeated no more.
Fig. 3 is refer to, Fig. 3 is the device that a kind of triggering RAID array shown in the exemplary embodiment of the application one is rebuild Block diagram.Described device is applied to the RAID sub-system of storage device;At least one pre-configured RAID array of the storage device, institute Stating RAID array includes several member's disks;Described device includes:
Issuance unit 310, for each member's disk being issued to the IO received read write commands in the RAID array;
Statistic unit 320, for IO read-writes to refer in default measurement period based on each member's disk in the RAID array The response time of order, count the average response time of each member's disk;
Average response in searching unit 330, the non-HotSpare disk and non-faulting member's disk for searching the RAID array Time reaches member's disk of exception response time threshold;
Indexing unit 340, member's disk for the average response time found to be reached to exception response time threshold The maximum top n member disk label of middle average response time is failure member's disk, and notifies the RAID array to rebuild;Its In, N is not more than the RAID array and supported while the member's disk number rebuild.
In a kind of optional implementation, the exception response time threshold for the RAID array non-HotSpare disk and The minimum of member's disk and the average response time being not zero and default exception response time weight in non-faulting member's disk The product of value;
In the searching unit 330, the non-HotSpare disk and non-faulting member's disk specifically for searching the RAID array The minimum of member's disk and the average response time being not zero;In the non-HotSpare disk and non-faulting member's disk of the RAID array In, search the minimum that average response time reaches member's disk in the non-HotSpare disk and non-faulting member's disk of the RAID array And the average response time being not zero and member's disk of default exception response time weight value product.
In another optional implementation, the statistic unit 320, specifically for adding up, each member's disk exists The response time of IO read write commands in the default measurement period;Each member's disk is counted in the default measurement period Completed IO read write commands number;Each member's disk is distinguished to the completion of corresponding cumulative response time and statistics IO read write command numbers be divided by, the average response time of each member's disk is obtained respectively.
In another optional implementation, the indexing unit 340, specifically for recording described find respectively Average response time reaches the maximum top n member's disk of average response time in member's disk of exception response time threshold Durations number;If after several measurement periods, the durations number of any member disk in the top n disk reaches Then it is failure member's disk by member's disk label to default durations threshold value.
In another optional implementation, the indexing unit 340, be further used for for it is described find it is flat The equal response time is reached in member's disk of exception response time threshold in the maximum top n member's disk of average response time Each member's disk, at the end of next measurement period, if member's disk is look for the average response again Between reach the maximum top n member's disk of average response time in member's disk of exception response time threshold, then increase this into The durations number of member's disk is simultaneously recorded;If member's disk do not look for the average response time reach it is abnormal loud The maximum top n member's disk of average response time in member's disk of time threshold is answered, then reduces continuing for member's disk Periodicity is simultaneously recorded;Wherein, the initial value of the durations number of member's disk is zero.
The function of unit and the implementation process of effect specifically refer to correspondence step in the above method in said apparatus Implementation process, will not be repeated here.
For device embodiment, because it corresponds essentially to embodiment of the method, so related part is real referring to method Apply the part explanation of example.Device embodiment described above is only schematical, wherein described be used as separating component The unit of explanation can be or may not be physically separate, and the part shown as unit can be or can also It is not physical location, you can with positioned at a place, or can also be distributed on multiple NEs.Can be according to reality Selection some or all of module therein is needed to realize the purpose of application scheme.Those of ordinary skill in the art are not paying In the case of going out creative work, you can to understand and implement.
The preferred embodiment of the application is the foregoing is only, not to limit the application, all essences in the application God is with principle, and any modification, equivalent substitution and improvements done etc. should be included within the scope of the application protection.

Claims (10)

1. a kind of trigger the method that RAID array is rebuild, it is characterised in that methods described is applied to the RAID subsystems of storage device System;At least one pre-configured RAID array of the storage device, the RAID array includes several member's disks;Methods described Including:
The IO read write commands received are issued to each member's disk in the RAID array;
Based on the response time of the IO read write commands in default measurement period of each member's disk in the RAID array, statistics is described The average response time of each member's disk;
Search average response time in the non-HotSpare disk and non-faulting member's disk of the RAID array and reach the exception response time Member's disk of threshold value;
The average response time found is reached to average response time maximum in member's disk of exception response time threshold Top n member disk label is failure member's disk, and notifies the RAID array to rebuild;Wherein, N is not more than described RAID gusts It is disbursed from the cost and expenses and holds while the member's disk number rebuild.
2. according to the method described in claim 1, it is characterised in that the exception response time threshold is the RAID array In non-HotSpare disk and non-faulting member's disk member's disk minimum and the average response time being not zero and default exception are loud The product of weighted value between seasonable;
Average response time reaches exception response in the non-HotSpare disk and non-faulting member's disk for searching the RAID array Member's disk of time threshold, including:
Search the minimum of member's disk and being averaged for being not zero in the non-HotSpare disk and non-faulting member's disk of the RAID array Response time;In the non-HotSpare disk and non-faulting member's disk of the RAID array, lookup average response time reaches described In the non-HotSpare disk and non-faulting member's disk of RAID array the minimum and the average response time that is not zero of member's disk with it is pre- If exception response time weight value product member's disk.
3. according to the method described in claim 1, it is characterised in that described to be based in the RAID array each member's disk pre- If the response time of IO read write commands in measurement period, the average response time of each member's disk is counted, including:
The response time for each member's disk IO read write commands in the default measurement period that add up;
Count each member's disk completed IO read write commands number in the default measurement period;
Each member's disk is distinguished to the completed IO read write commands number phase of corresponding cumulative response time and statistics Remove, the average response time of each member's disk is obtained respectively.
4. according to the method described in claim 1, it is characterised in that it is described the average response time found is reached it is abnormal loud It is failure member's disk to answer the maximum top n member disk label of average response time in member's disk of time threshold, including:
The average response time that finds described in record reaches average response in member's disk of exception response time threshold respectively The durations number of top n member's disk of time maximum;
If after several measurement periods, the durations number of any member disk in the top n disk reaches default Durations threshold value, then by member's disk label be failure member's disk.
5. method according to claim 4, it is characterised in that the average response time found described in the record respectively The durations number of the maximum top n member's disk of average response time in member's disk of exception response time threshold is reached, Including:
Average response time in member's disk of exception response time threshold is reached for the average response time found Each member's disk in maximum top n member's disk, at the end of next measurement period, if member's disk is again It is look for the average response time and is reached the maximum preceding N of average response time in member's disk of exception response time threshold Individual member's disk, then increase the durations number of member's disk and record;If member's disk is not look for described Average response time reaches the maximum top n member's disk of average response time in member's disk of exception response time threshold, Then reduce the durations number of member's disk and record;Wherein, the initial value of the durations number of member's disk is zero.
6. a kind of trigger the device that RAID array is rebuild, it is characterised in that described device is applied to the RAID subsystems of storage device System;At least one pre-configured RAID array of the storage device, the RAID array includes several member's disks;Described device Including:
Issuance unit, each member's disk in the RAID array is issued to by the IO read write commands received;
Statistic unit, for the response based on the IO read write commands in default measurement period of each member's disk in the RAID array Time, count the average response time of each member's disk;
Average response time reaches in searching unit, the non-HotSpare disk and non-faulting member's disk for searching the RAID array Member's disk of exception response time threshold;
Averagely rung in indexing unit, member's disk for the average response time found to be reached to exception response time threshold Maximum top n member disk label is failure member's disk between seasonable, and notifies the RAID array to rebuild;Wherein, N is little The member's disk number for supporting to rebuild simultaneously in the RAID array.
7. device according to claim 6, it is characterised in that the exception response time threshold is the RAID array In non-HotSpare disk and non-faulting member's disk member's disk minimum and the average response time being not zero and default exception are loud The product of weighted value between seasonable;
Member's disk in the searching unit, the non-HotSpare disk and non-faulting member's disk specifically for searching the RAID array Minimum and the average response time that is not zero;In the non-HotSpare disk and non-faulting member's disk of the RAID array, search Average response time reaches the minimum of member's disk in the non-HotSpare disk of the RAID array and non-faulting member's disk and is not Zero average response time and member's disk of default exception response time weight value product.
8. device according to claim 6, it is characterised in that the statistic unit, specifically for each member that adds up The response time of disk IO read write commands in the default measurement period;Each member's disk is counted in the default statistics Completed IO read write commands number in cycle;Each member's disk is distinguished into corresponding cumulative response time and statistics Completed IO read write commands number is divided by, and the average response time of each member's disk is obtained respectively.
9. device according to claim 6, it is characterised in that the indexing unit, specifically for recording described look into respectively The average response time found reaches the maximum top n member of average response time in member's disk of exception response time threshold The durations number of disk;If after several measurement period, the lasting week of any member disk in the top n disk Issue reaches default durations threshold value, then is failure member's disk by member's disk label.
10. device according to claim 9, it is characterised in that the indexing unit, is further used for searching for described To average response time reach the maximum top n member's magnetic of average response time in member's disk of exception response time threshold Each member's disk in disk, at the end of next measurement period, if member's disk look for again it is described average Response time reaches the maximum top n member's disk of average response time in member's disk of exception response time threshold, then increases Plus member's disk durations number and record;Reached if member's disk is not look for the average response time The maximum top n member's disk of average response time, then reduce member's disk in member's disk of exception response time threshold Durations number and record;Wherein, the initial value of the durations number of member's disk is zero.
CN201710125115.7A 2017-03-03 2017-03-03 Trigger the method and device that RAID array is rebuild Pending CN106990918A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710125115.7A CN106990918A (en) 2017-03-03 2017-03-03 Trigger the method and device that RAID array is rebuild

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710125115.7A CN106990918A (en) 2017-03-03 2017-03-03 Trigger the method and device that RAID array is rebuild

Publications (1)

Publication Number Publication Date
CN106990918A true CN106990918A (en) 2017-07-28

Family

ID=59413096

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710125115.7A Pending CN106990918A (en) 2017-03-03 2017-03-03 Trigger the method and device that RAID array is rebuild

Country Status (1)

Country Link
CN (1) CN106990918A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107678694A (en) * 2017-10-17 2018-02-09 深圳大普微电子科技有限公司 RAID stripe method for reconstructing and solid-state disk
CN108334280A (en) * 2017-12-28 2018-07-27 创新科存储技术(深圳)有限公司 A kind of RAID5 disks group fast reconstructing method and device
WO2022057374A1 (en) * 2020-09-18 2022-03-24 苏州浪潮智能科技有限公司 Method and apparatus for improving raid data backup efficiency
CN116700633A (en) * 2023-08-08 2023-09-05 成都领目科技有限公司 IO delay monitoring method, device and medium for RAID array hard disk

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5657468A (en) * 1995-08-17 1997-08-12 Ambex Technologies, Inc. Method and apparatus for improving performance in a reduntant array of independent disks
CN101329641A (en) * 2008-06-11 2008-12-24 华中科技大学 Method for rebuilding data of magnetic disk array
CN102147708A (en) * 2010-02-10 2011-08-10 成都市华为赛门铁克科技有限公司 Method and device for detecting discs
CN102981778A (en) * 2012-11-15 2013-03-20 浙江宇视科技有限公司 Redundant array of independent disks (RAID) array reconstruction method and device thereof
CN105353991A (en) * 2015-12-04 2016-02-24 浪潮(北京)电子信息产业有限公司 Disk array reconstruction optimization method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5657468A (en) * 1995-08-17 1997-08-12 Ambex Technologies, Inc. Method and apparatus for improving performance in a reduntant array of independent disks
CN101329641A (en) * 2008-06-11 2008-12-24 华中科技大学 Method for rebuilding data of magnetic disk array
CN102147708A (en) * 2010-02-10 2011-08-10 成都市华为赛门铁克科技有限公司 Method and device for detecting discs
CN102981778A (en) * 2012-11-15 2013-03-20 浙江宇视科技有限公司 Redundant array of independent disks (RAID) array reconstruction method and device thereof
CN105353991A (en) * 2015-12-04 2016-02-24 浪潮(北京)电子信息产业有限公司 Disk array reconstruction optimization method and device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107678694A (en) * 2017-10-17 2018-02-09 深圳大普微电子科技有限公司 RAID stripe method for reconstructing and solid-state disk
CN107678694B (en) * 2017-10-17 2019-02-05 深圳大普微电子科技有限公司 RAID stripe method for reconstructing and solid-state disk
CN108334280A (en) * 2017-12-28 2018-07-27 创新科存储技术(深圳)有限公司 A kind of RAID5 disks group fast reconstructing method and device
CN108334280B (en) * 2017-12-28 2021-01-08 深圳创新科技术有限公司 RAID5 disk group fast reconstruction method and device
WO2022057374A1 (en) * 2020-09-18 2022-03-24 苏州浪潮智能科技有限公司 Method and apparatus for improving raid data backup efficiency
CN116700633A (en) * 2023-08-08 2023-09-05 成都领目科技有限公司 IO delay monitoring method, device and medium for RAID array hard disk
CN116700633B (en) * 2023-08-08 2023-11-03 成都领目科技有限公司 IO delay monitoring method, device and medium for RAID array hard disk

Similar Documents

Publication Publication Date Title
CN106980468A (en) Trigger the method and device that RAID array is rebuild
CN104484251B (en) A kind of processing method and processing device of hard disk failure
US8171379B2 (en) Methods, systems and media for data recovery using global parity for multiple independent RAID levels
Schwarz et al. Disk scrubbing in large archival storage systems
EP3660681B1 (en) Memory fault detection method and device, and server
KR100974043B1 (en) On demand, non-capacity based process, apparatus and computer program to determine maintenance fees for disk data storage system
US10025666B2 (en) RAID surveyor
CN100530125C (en) Safety storage method for data
CN102508733B (en) A kind of data processing method based on disk array and disk array manager
CN106990918A (en) Trigger the method and device that RAID array is rebuild
US20120096309A1 (en) Method and system for extra redundancy in a raid system
JP2005122338A (en) Disk array device having spare disk drive, and data sparing method
CN102272731A (en) Apparatus, system, and method for predicting failures in solid-state storage
CN110750213A (en) Hard disk management method and device
CN113535474B (en) Method, system, medium and terminal for automatically repairing heterogeneous cloud storage cluster fault
CN103136075A (en) Disk system, data retaining device, and disk device
US8370688B2 (en) Identifying a storage device as faulty for a first storage volume without identifying the storage device as faulty for a second storage volume
US20060215456A1 (en) Disk array data protective system and method
CA2532998C (en) Redundancy in array storage system
US7992072B2 (en) Management of redundancy in data arrays
CN108170375B (en) Overrun protection method and device in distributed storage system
US7457990B2 (en) Information processing apparatus and information processing recovery method
US8001425B2 (en) Preserving state information of a storage subsystem in response to communication loss to the storage subsystem
CN109375869A (en) Realize the method and system, storage medium of data reliable read write
US11537468B1 (en) Recording memory errors for use after restarts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170728

RJ01 Rejection of invention patent application after publication