CN1306381C

CN1306381C - Read-write method for disc array data and parallel read-write method

Info

Publication number: CN1306381C
Application number: CNB2004100585825A
Authority: CN
Inventors: 张巍; 唐小松; 黄玉环; 张国彬; 张粤; 任雷鸣; 陈绍元
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2004-08-18
Filing date: 2004-08-18
Publication date: 2007-03-21
Anticipated expiration: 2024-08-18
Also published as: CN1737745A

Abstract

The present invention relates to a data reading method for magnetic disk arrays, a data writing method for magnetic disk arrays, a data reading method based on the data writing process, and a data writing method based on the normal data reading process. In the present invention, in the writing and the reading processes, bar units are used for reading data or writing data for units, and when calculation is made according to read-write rules, the bar units are used for making calculation for the units. In addition, the data reading method based on the data writing process of the present invention can realize the concurrency of normal read or downgrading read in the processes of uppercase writing, lowercase writing, reestablishing writing or degradating writing, and the data writing method based on the normal data reading process of the present invention can realize the concurrency of uppercase writing, lowercase writing, reestablishing writing or degradating writing in the normal reading process. The present invention can increase the read-write speed and realizes read-write concurrency as much as possible.

Description

A kind of reading/writing method of data of magnetic disk array and concurrent reading and concurrent writing method

Technical field

The present invention relates to the reading and writing data technology of disk array (RAID), be meant a kind of reading/writing method and concurrent reading and concurrent writing method of data of magnetic disk array especially.

Background technology

Disk array (RAID, Redundant Access Independent Disk) technology is applied in the data backup memory as a proven technique.RAID5 (RAID LEVEL 5) is a standard of disk array.Fig. 1 is a RAID5 who is made up of 4 disks, and 4 disks are logically regarded a disk as and carried out itemize (Strip), in this example, are divided into itemize 0～3.For each itemize, 3 disks wherein are used for store data, and the another one disk is used for depositing the checking data of this itemize, and the data that each itemize is distributed on each disk are called stripe unit (SU, Strip Unit), stripe unit also can be described as segmentation (Block/Segment).Decide as for the big or small viewing system of stripe unit, the system that has or with the 1KB full blast, or with 4KB, or with 6KB, or even 4MB or 8MB.Because the read-write of disk is to be unit with sector (Sector, 512 bytes), therefore, stripe unit should be the multiple of 512 bytes.RAID5 as shown in Figure 1, itemize 0 comprise SU0～SU2 and verification unit P, and each stripe unit comprises the data of 10 disk sectors.For convenience of description, the sector of each disk that in this example itemize 0 is comprised and the data number of storage are 0～29, deposit the sector of checking data and the checking data of storage and are numbered P0～P9, and P0～P9 is called the checking data unit.

Be example still below, provided under the situation of front disk 0 fault, write the process of data to

sector

9,10 with Fig. 1.Because disk 0 lost efficacy, this write request can be considered as splitting into two following different " son is write " requests, may further comprise the steps:

1) writes for the son that writes new data 9, comprising: read sector 19, read sector 29, be stored in the buffer memory; Because therefore the disk failure that will write writes the former data 29XOR new data 9 of (Degraded Write) former data 19XOR of algorithm: P9=according to degradation, generates new checking data P9 and is stored in the buffer memory, wherein XOR represents XOR; New checking data P9 is write sector P9;

2) write for the son that writes new data 10, comprising: read sector 10, read sector P0, be stored in the buffer memory; According to small letter (Small Write) or be called read-revise-write (Read-Modify-Write) algorithm: P10=legacy data 10XOR new data 10XOR P0, generate new checking data P0, be stored in the buffer memory; New data 10 is write sector 10, new checking data P0 is write sector P0.

Analyze above process, because read-write is to be that unit carries out at the sector, therefore read and write more complicated, in this example, just comprised and four times disk having been read, twice calculating is write step to disk three times.On the other hand, because with the sector is that unit reads and writes, if the sector of being read is also discontinuous, even normal read (Fault-Free Read) data at a SU, for example read sector 11,13,, also can be divided into twice and read action owing to be not read operation at contiguous sector, frequent operates disk, influences data write speed.

Wherein, it is writing technology among the RAID5 that degradation above-mentioned is write algorithm and small letter algorithm, writing technology also comprises capitalization algorithm (Large Write) and reconstruction to be write (Reconstruct Write) wherein small letter, capitalization, reconstruction is written as normally and writes (Fault-Free Write), it is the technology of reading among the RAID5 that normal read above-mentioned (Fault-FreeRead) and the degradation of mentioning are in the back read (Degraded Read) algorithm, described read-write algorithm can be referring to read-write rule algorithm synoptic diagram shown in Figure 2, wherein D represents data among the figure, and P represents checking data.Read-write technology among the RAID also can be referring to the data of relevant RAID, and " the RAIDFrame:A Rapid Prototyping Toolfor RAID System " that " RAIDFrameBook " that publishes in August, 1996 as CMU university (Carnegie Mellon University) or this university publish in August, 1996 is described in detail RAID structure and the read-write technology that relates to.

In addition, be understood that, may have read/write conflict when reading and writing order during at same sector.Even but the reading and writing order is not during at same sector, read/write conflict may occur yet.For example, be in degradation (Degraded) state for RAID5, it is the state that there is a disk failure in the RAID group, when reading the failed disk data, will be by data and the checking data that reads other disks, use degradation to read algorithm computation and go out needed data, even when therefore the write order that exists not is sector at this failed disk, also may exist read/write conflict, cause read data inconsistent.This situation of following labor.

Be example still, suppose that disk 0 breaks down and lost efficacy, suppose the current order that exists the sector 11～18 of writing itemize 0, have the order of sector 0～10 data of reading itemize 0 simultaneously with RAID5 shown in Figure 1.

The process that data write sector 11～18 may further comprise the steps:

Step 1: according to RAID5 small letter (Small Write) algorithm, at first read legacy data 11～18_old that SU1 goes up sector 11～18, and read out corresponding old checking data P1～P8_old, the data that read are stored in the buffer memory (Cache) from disk 3.

Step 2: legacy data that will read from disk 1 sector 11～18 and corresponding old checking data and the new data that will write this sector 11～18 the Cache are according to the small letter algorithm: (P1～P8_new)=(((P1～P8_old) generates new checking data (P1～P8_new) to 11～18_new) XOR to 11～18_old) XOR.Because the process of step 1～2 is not carried out " writing " action to disk, above step 1～2 also are collectively referred to as to be write the pre-write-back in the action or reads (Preread) step in advance.

Step 3: new data 11～18_new and new checking data P1～P8_new are written back into corresponding disk sector respectively, finish.

Sector 0～10 reading data course from disk 0 may further comprise the steps:

Step 1: because disk 0 is bad, adopt degradation to read algorithm, the data of sector 20～29, checking data P0～P9 on sector 10～19 data, the disk 2 on the reading disk 1 are stored in Cache with the data that read.

Step 2: the data that step 1 reads are read algorithm according to degradation: (P0～P9) obtains the data of disk 1 sector 0～9 to (0～9)=(10～19) XOR (20～29) XOR, is kept on the Cache.

Step 3: with the data 0～9 calculated and in Cache data in buffer 10 offer the user as wanting reading of data 1～10, finish.

Analyze above-mentioned read-write step, read data need relate to read different disk SU1 10～19, P0～P9 of 20～29 and the P of SU2 owing to be that contiguous sector at different disk reads, therefore be equivalent to be divided into 3 sub-read commands.And write operation will be write P1～P8 of 11～18 and the P of SU1, also can think two sub-write orders.When read-write takes place simultaneously, because the read-write subcommand that arrives on each disk can't the assurance order, mistake just may appear reading and writing, for example finish sub-read command when read data to 10～19, and carry out as yet when reading P0～P9 subcommand, if this moment, write operation was finished sub-write order to P1～P8, then read data can be read the checking data after the renewal.So when the degradation read data just occurring, the data of reading from disk 1 are legacy datas, and contain the checking data after the renewal from the data that disk 3 is read, and cause when the data that will read are carried out XOR data 0～9 mistake of generation.

Therefore, parallel generation for fear of read-write operation, parallel work-flow for read-write at present all locks, and the unit that locks is an itemize, here parallel is meant beginning to occurring another operation between finishing an operation, read operation occurs before for example write operation is finished, write operation perhaps before read operation is finished, occurs.Locking of read-write promptly locks during read data and forbids the parallel write operation that takes place at this itemize; Lock during write data and forbid the parallel read operation that takes place at this itemize.

By being that unit locks with the itemize, in the time of not only can avoiding the read/write conflict at same sector, also can avoid above-mentioned at the read/write conflict that is not same sector.And, by locking, write when capitalization, small letter, degradation appear in write order with itemize unit, perhaps normal read appears in read command, degradation is read, no matter be which kind of write order and read command can be avoided occurring simultaneously, avoid read/write conflict, guaranteed that data write is correct.

Though present locking method has guaranteed the correctness that reads and writes data,, no matter owing to be which kind of situation all locks, make that the reading and writing operation to itemize can't walk abreast, influence the performance of read-write operation.Can cause in the write/read operation process behind the transmitting order to lower levels, the user just can carry out read/write data to this itemize after must waiting for that current operation process finishes release.Performance impact when therefore the mode that locks is operated read-while writing is bigger, has influenced read or write speed.

Summary of the invention

In view of this, fundamental purpose of the present invention is to provide a kind of method of reading of data of magnetic disk array, with reduce read command to disk read instruction issue number of times, improve the data read rate.

Another fundamental purpose of the present invention is to provide a kind of write method of data of magnetic disk array, to reduce the issue number of times of write order to the reading in advance of disk, write-back instruction, improves the data writing rate.

Further purpose of the present invention provide a kind of data of magnetic disk array based on the method for reading data in the write data process, to reduce the situation that locks in the write data, realize the parallel work-flow of read-write to read operation as far as possible, improve reading and writing data speed.

Further purpose of the present invention provide a kind of data of magnetic disk array based on the data write method in the normal read data procedures, to reduce in the normal read data procedures, realize the parallel work-flow of read-write to the situation that locks of write operation as far as possible, improve reading and writing data speed.

The invention provides a kind of method for reading data of disk array, may further comprise the steps:

A, according to the current rule of reading, determine the sector, data place that will read;

B, determine stripe unit or checking data unit under the sector that to read, read each stripe unit and the checking data unit of determining, deposit buffer memory in from disk;

C, judge whether current reading is normal read, if, execution in step D then; Otherwise, according to the current rule of reading, use the stripe unit of buffer memory and checking data unit to calculate and generate the affiliated stripe unit of institute's reading of data of wanting, be stored in buffer memory;

D, the stripe unit from buffer memory read out needed data, offer the user.

The present invention also provides a kind of data write method of disk array, may further comprise the steps:

A, according to the current rule of writing, determine the sector, data place that will read in advance;

B, determining and will read stripe unit or checking data unit under the sector in advance, is unit with the stripe unit, reads each stripe unit and the checking data unit of determining from disk, deposits buffer memory in;

C, the data that will write are handled with the corresponding stripe unit in the buffer memory, generate to contain the stripe unit that will write data, and the stripe unit as wanting write-back is stored in buffer memory;

D, according to the current rule of writing, utilize the stripe unit of wanting write-back of stripe unit buffer memory, that read out, generation and checking data unit to calculate the checking data unit of wanting write-back, be stored in buffer memory;

E, be unit, with the stripe unit of want write-back, want the checking data unit of write-back to be written back in the corresponding disk with the stripe unit.

Wherein in this write method, steps A is described according to the current rule of writing, and determines the sector, data place that will read in advance and comprises:

Judge that the described sector of determining comprises: pairing sector of the data that will write and sector, corresponding check data place when being written as small letter;

Judge that the described sector of determining comprises: will write the pairing sector of data when being written as capitalization;

Judge to be written as and rebuild when writing that the described sector of determining comprises: rebuild write write rule the sector of reading in advance that requires and will write the pairing sector of data;

When judging that be written as degradation writes, the described sector of determining comprises: degradation is write writes rule and requires the sector and the sector, checking data place of reading in advance.

The present invention is also corresponding to provide a kind of based on the method for reading data in the write data process, wherein, write data comprises that the data place stripe unit that will will write reads in the buffer memory, computing generates and contains reading in advance of the stripe unit that will write data, the write-back that the stripe unit that generates and checking data unit are write disk; Described method for reading data may further comprise the steps:

A, according to the current rule of reading, judge that whether storing all in the buffer memory wants stripe unit under the reading of data, is, then execution in step C; Otherwise, execution in step B;

B, determine the stripe unit under the reading of data wanted that is not stored in the buffer memory; From disk, read the described stripe unit of determining, deposit buffer memory in;

D, from the stripe unit of buffer memory, take out needed data, offer the user.

Wherein, describedly be written as capitalization, small letter, reconstruction is write or demote and write.

Reading in the method in this write data process wherein, the described stripe unit of step D is the stripe unit of reading in advance of writing process, perhaps the stripe unit of wanting write-back for calculating after reading in advance.

The present invention is also corresponding to provide a kind of based on the data write method in the normal read data procedures, and wherein, the normal read data comprise desired data place stripe unit is read in the buffer memory, take out/calculate desired data and offer the user from the stripe unit of buffer memory; Described data write method may further comprise the steps:

A, judge whether to store in the buffer memory and write all stripe units under that rule is determined, the data that will read in advance or/and the checking data unit according to current, if, execution in step C then; Otherwise execution in step B;

B, determine stripe unit under the data that are not stored in the buffer memory or/and the checking data unit; From disk, read the described stripe unit of determining or/and the checking data unit deposits buffer memory in;

Stripe unit corresponding in C, the data that will write and the buffer memory is handled, and generates and contains the stripe unit that will write data, and the stripe unit as wanting write-back is stored in buffer memory;

Then according to the rule of writing that is adopted, utilize the stripe unit of wanting write-back of the stripe unit that is read out of buffer memory and checking data unit, generation to calculate and generate the checking data unit of wanting write-back;

D, will need write-back stripe unit, want the checking data unit of write-back to be written back in the corresponding disk.

Wherein, in this reading data course in the data write method, steps A described according to current write under that rule is determined, the data that will read in advance all stripe units or/and the checking data unit comprise:

Judge that described stripe unit comprises when being written as small letter: stripe unit and checking data unit under the data that will write;

Judge that described stripe unit comprises when being written as capitalization: will write the affiliated stripe unit of data;

Judge to be written as and rebuild when writing that described stripe unit comprises: rebuild the affiliated stripe unit of writing of data of writing stripe unit under the rule institute pre-read data that requires and will writing;

When judging that be written as degradation writes, described stripe unit comprises: degradation is write writes rule and requires described stripe unit of data and the checking data unit read in advance.

Wherein, in this reading data course in the data write method, step D further comprises: in the time of will needing the stripe unit of write-back to be written back into corresponding disk, the stripe unit of write-back is during just for the current stripe unit that is reading of normal read data manipulation, use described write back operations to issue before this stripe unit of buffer memory as the current stripe unit that reads of described normal read data manipulation.Wherein, described stripe unit is the stripe unit of reading in advance of writing process, perhaps calculates the stripe unit of wanting write-back after reading in advance.

By said method as can be seen, the present invention is directed to stripe unit and carry out reading and writing when operation, is not at each sector, therefore can once read and write at the read-write of a plurality of discontinuous sectors of a stripe unit, has reduced the read-write number of times to disk.On the other hand, calculating also is to be that unit carries out with the stripe unit, the computing at the discontinuous sector of a stripe unit can be simplified to a calculating process, reduces operation times, has improved read or write speed.

On the other hand, no matter which kind of situation all locks during with respect to the read-while writing of RAID prior art, adopt provided by the invention based on the method for reading data in the write data process with based on the data write method in the normal read data procedures, only need carry out itemize when degradation write-back occurs in the read procedure locks and forbids write-back, read-write under other situations generation that can walk abreast does not need to lock.That is to say, use the present invention, in the parallel procedure that reads and writes data, in the write data process, normal read occurs or degradation is read; Perhaps write data occurs in the normal read data procedures, all need not to lock, writing here comprises that capitalization, small letter, degradation write, rebuild and write.

Because the realization that can try one's best does not lock, and makes the read/write data request not be subjected to the command affects of Writing/Reading data as far as possible, the data write speed when having accelerated read-while writing has improved readwrite performance.

Description of drawings

Fig. 1 is the RAID5 synoptic diagram.

Fig. 2 is the read-write rule algorithm synoptic diagram among the RAID5.

Fig. 3 is a read data flow process of the present invention.

Fig. 4 is a write data flow process of the present invention.

Fig. 5 is the read data flow process in the write data of the present invention.

Fig. 6 is the write data flow process in the read data of the present invention.

Embodiment

It is that unit carries out read-write operation that the present invention has adopted with the stripe unit, rather than is that unit carries out read-write operation with the sector, that is to say, when read-write operation is carried out in certain sector, can read and write the whole stripe unit at this place, sector.Based on stripe unit is the reading/writing method of unit, can simplify the read-write process, improves read or write speed.Below by specific embodiments and the drawings, the present invention is described in more detail.

Data reading method provided by the invention, the normal read, the degradation that can be applicable to RAID5 are read, and the process flow diagram referring to shown in Figure 3 may further comprise the steps:

Step 301:, determine the sector that to read according to the current rule of reading.Just according to being normal read or demoting and read the definite sector that will read.As Fig. 2, normal read, definite sector is positioned at D0; If degradation is read, the sector of determining is positioned at D1, D2, P.This step is the same with existing technology.

Step 302: determining the affiliated stripe unit of data that will read, is unit with the stripe unit, reads each stripe unit of determining or/and the checking data unit deposits buffer memory in.As Fig. 2, during for normal read, only the stripe unit D0 under the reading of data deposits buffer memory in; When being degradation when reading, read stripe unit D1, D2, checking data unit P deposits buffer memory in.

Step 303:, calculate the reading of data place stripe unit of wanting according to reading rule.When normal read, this step is empty in fact, skips direct execution in step D; When reading for degradation, according to the rule that degradation is read, the stripe unit that stripe unit that utilization reads out and checking data unit calculate the data place that will read is stored in buffer memory.Wherein, the algorithmic rule of calculating identical with shown in Fig. 2 is that unit calculates with the stripe unit here just.

Step 304: the stripe unit from buffer memory reads out needed sector data, offers the user.Stripe unit in the buffer memory described here comprises the stripe unit of step B and C institute buffer memory.

RAID data writing method provided by the invention can be applicable to capitalization, small letter, degradation and writes, rebuilds and write, and the process flow diagram referring to shown in Fig. 4 may further comprise the steps:

Step 401: according to the current rule of writing, determine the sector that to read (Preread) in advance, just according to being that capitalization, small letter, degradation are write, rebuild and write to determine the sector that will read.For small letter, the sector of determining is the sector, data place that will write-back and the sector of corresponding check data; This step is the same with existing technology.And write for capitalization and reconstruction, the sector that the present invention determines not only comprises the sector that will read in advance, also to comprise will write-back sector, data place.Write for degradation, the sector that the present invention determines comprises the sector, data place and the sector, checking data place of wanting write-back.

As Fig. 2, for small letter, the sector of determining is positioned at D0, P; For capitalization, the sector of determining is positioned at D0, D1, D2; Write for reconstruction, the sector of determining is positioned at D0, D1, D2; Write for degradation, the sector of determining is positioned at D1, D2, P.

Step 402: determining and will read stripe unit or checking data unit under the sector in advance, is unit with the stripe unit, reads each stripe unit and the checking data unit determined, deposits buffer memory in.

Step 403: data that will write and the stripe unit in the buffer memory are handled, and generate to contain the stripe unit that will write data, and the stripe unit as wanting write-back is stored in buffer memory.Completely become a stripe unit because this processing procedure can be regarded as the data benefit that will write, so be called the full process of mending among the present invention of this process.As Fig. 2,, therefore, in step 401～402,, D0, D1, D2 stripe unit be read buffer memory among the present invention for capitalization exactly because also exist this to mend full process; Write for reconstruction, D0, D1 stripe unit be read buffer memory; And write for degradation, together read checking data unit P, go out stripe unit D0 to use degradation to read algorithm computation, be equivalent to read D0.

Step 404: according to the current rule of writing, utilize the stripe unit that reads out of buffer memory, the stripe unit of wanting write-back and/or the checking data unit of mending after expiring to calculate the checking data unit of wanting write-back, be stored in buffer memory.Wherein, the algorithmic rule of calculating identical with shown in Fig. 2 is that unit calculates with the stripe unit here just.The small letter among Fig. 2 for example then will utilize step 402 to read stripe unit D0, the P of buffer memory, benefit and want the stripe unit D0 of write-back to calculate the checking data unit P that wants write-back after full.

Step 405: with the stripe unit is unit, and stripe unit, the checking data unit of write-back is written back in the corresponding disk.

Analyze the advantage of employing stripe unit below in conjunction with specific embodiment for the read-write unit.Still read-write of the present invention is described: under the situation of front disk 0 fault, write the process of

data

9,10 with the example in the background technology.In order to compare with the described process of background technology, following step does not contrast and writes with the flow process of writing shown in Fig. 4, and still, its treatment scheme remains according to the step shown in Fig. 4.May further comprise the steps:

Step 1:

judge

9,10 place stripe units, read stripe unit 10～19,20～29, checking data unit P0～P9; XOR generates legacy data 0～9, is stored in buffer memory;

Step 2: utilize the stripe unit data in the buffer memory, and the data that will write, mend full SU0, the SU1 stripe unit of wanting write-back in Cache.In this example, the SU0 that mends after expiring is: 0～8th, and legacy data, the 9th, the new data that write; The SU1 that mends after expiring is: the 10th, and the new data that write, 11～19 is legacy data.

Step 3: according to big WriteMode, utilize the stripe unit SU0, the SU1 that mend among the Cache after expiring, and SU2, XOR generates new checking data unit P; Write-back is mended the new checking data unit P of full stripe unit SU1, write-back generation to disk then.

More than can analyze, this is write process and comprises: three times disk is read, and twice XOR, twice pair of disk write.Compare the way in the background technology, lacked once disk is read, once disk write and simplified treatment scheme, improved write performance.

When especially carrying out read/write with respect to the non-conterminous sector of mentioning in the background technology to a SU, because the sector is discontinuous, can separately carry out read/write twice in the background technology, but be to use the present invention, because read is at stripe unit, adopted benefit to expire technology and write, thereby to write also be at stripe unit, all be to operate at contiguous sector, therefore read/write also can be regarded read/write process as one time, reduced read-write number of times greatly, simplified treatment step, improved the performance of read-write disk.

Owing to adopted with itemize unit and read and write, reduced the complicacy of read-write, therefore, the invention provides the possibility of read-while writing.Analyze below:

At first analyzing RAID 5 normal read with write process.As shown in Figure 1, suppose to write sector 8,11～18, P1;

Read

10,19,20～29.When read-while writing takes place, may there be following situation:

Situation 1: during normal read, read in advance to start; Or when reading in advance, normal read issues again:

Reading step in advance is:

1., according to the data that will write, judge and stripe unit SU0, SU1, the checking data unit P at reading of data place, be stored in Cache;

Use the stripe unit of mending in the full technical finesse buffer memory, generate the stripe unit after the benefit of wanting write-back expires, be stored in Cache.In this example, mend the data 8 that completely will write with 0～7 and 9 of buffer memory and generate the SU0 that wants write-back, mend the data 11～18 of completely wanting write-back with 10 and 19 of buffer memory and generate the SU1 that wants write-back;

2., according to the small letter algorithm, utilize to mend SU0 after full, mend SU1, old checking data unit P after full and carry out XOR and generate new checking data unit P, be stored in Cache.

Be understood that,, any conflict can not occur with normal read owing to read only have read request to issue in advance.

Situation 2: before normal read was finished, write-back started, and just occurs write-back in read procedure:

In this example, there is not common factor in the sector of read and write-back, therefore in fact, though the stripe unit that can will also not read in the write-back process upgrades, causing some stripe units of reading is old stripe units, some are new stripe units, but do not change owing to finally offer the sector data of user's data correspondence, therefore, write-back impacts also the influence in the process just, and does not influence for the final result who reads to the user, therefore, when there is not the situation of common factor in the sector as this example read and write-back, can not lock to writing, allow this to write and carry out.

But, if there are the words of occuring simultaneously in the sector of read and write-back, for example, the data of write-back also comprise 10, then when reading SU1, if find to exist at the write-back of this unit, then expression reads to finish in advance, can use start the write-back action before SU1 that generated, that be kept in the buffer memory replace the SU1 that will read.Like this, reading and offer user's data will be up-to-date data, and that just upgrades wants 10 of write-back, so read/write conflict can not occur in this case, can not lock to writing, and allows this to write and carries out; Certainly, also can adopt the SU1 that is read in advance before the startup write-back to replace the SU1 that wants normal read, just finally offering user data is that write operation starts preceding data.

Situation 3: before write-back was finished, normal read issued again, just in the process of writing, and the reading of appearance:

Because write-back begins, expression reads to finish in advance, and according to the step of reading in advance, 10～19 data can be directly from the Cache sense data in Cache; Do not have among data 20～29Cache, then issue and ask disk to read SU2, data read behind buffer memory, is extracted data again and sends to the user.Read/write conflict can not occur in this case, can not lock, allow this to write and carry out writing.And when adopt be to calculate when reading in advance after SU1 the time, finally reading and offering user's data is latest data.

The parallel process that occurs is read and write to analyzing RAID 5 degradations more below.

Be example still, suppose that disk 0 is bad, read data 0～10, write data 11～18 with Fig. 1.When read-write takes place simultaneously, may there be following situation:

Situation 1: degradation is when reading, and reads in advance and starts; Or when reading in advance, degradation is read and is issued:

Degradation is read step and is:

1.: according to the data that will read, judge and read stripe unit SU1, SU2, checking data unit P, be stored in Cache;

Read algorithm according to degradation, XOR generates SU0, is stored in Cache, from buffer memory data 0～10 is offered the user.

Reading step in advance is:

1.:, judge the also stripe unit SU1 at reading of data place according to the data that will write; Read checking data unit P, be stored in Cache;

Use the stripe unit of mending in the full technical finesse buffer memory, generate the stripe unit after the benefit of wanting write-back expires, in this example, mend the data 11～18 that completely will write with 10 and 19 of buffer memory and generate the SU1 that wants write-back, be stored in Cache;

2.: according to the small letter algorithm, utilize new SU1, old P to generate new checking data unit P, be stored in Cache.

Be understood that degradation is read and read in advance is two read commands, therefore can not produce conflict, do not need to lock,, just can not conflict between reading in the unnatural death and reading in advance as long as before promptly running through, write-back does not begin.

Situation 2: before degradation ran through, write-back started, and promptly demotes in the read procedure, write-back occurs:

Because write-back will revise P1, and degradation reads to read again the P1 legacy data, therefore degradation read and write-back between have conflict, need increase " itemize lock " to guarantee the consistance of data.Promptly when degradation was read to begin, degradation was read to add " itemize lock " to this itemize and is forbidden write-back, and write back command will be suspended execution, to guarantee data consistency.

Situation 3: after write-back began, degradation was read and is issued, and degradation promptly occurs and read in the process of write-back:

According to the write-back algorithm, when write-back begins, finished and read step in advance, SU1 and checking data unit P must be stored among the Cache through reading step in advance, and therefore, degradation reads to read new SU1, new checking data unit P from Cache, and obtaining SU2 by reading disk, XOR generates SU0.Be understood that degradation reads also can read old SU1, old checking data unit P from Cache, and obtain SU2 by reading disk, same XOR generates SU0.What then will generate from buffer memory 0～9 and 10 offers the user.As can be seen, also can lock in this case, realize demoting and read.

Situation 3 in this example is to be example with the small letter, also may occur degradation under the situation that degradation is read writes, write method from Fig. 4 correspondence of the present invention, the pre-read procedure that degradation is write is before mending full operation, the inevitable stripe unit that at first generates failed disk, therefore, when demoting the write-back startup of writing, inevitable all stripe units and the verification unit that has existed an itemize in the buffer memory, therefore if any reading occur, therefore the data that will read can be from buffer memory, found, writing in this case can be realized.

By above analysis, use reading/writing method based on stripe unit, do not need equally with existing RAIDFrame, all to lock for any situation, only in reducing read procedure, the write-back of appearance just need lock.There is not any mutual exclusion in all the other situations between read-write, do not need to lock, and that can realize reading and writing is parallel, and therefore the wait time delay of shortening has improved reading and writing data speed, has improved performance.

Based on above analysis, at the write method of RAID of the present invention, read on the basis of method, the present invention the method for reading in the process of writing is provided again and the process read in the method for writing, realizing that described RAID does not need the method for the read-while writing process that locks as far as possible, as follows:

Provided by the invention based on the data reading method in the process of write data, be applied to described read to be normal read (Fault-Free Read), perhaps degradation is read (Degraded Read), writes to be situation about writing arbitrarily.Wherein, the process of writing comprises to be read and the write-back process in advance, describedly writes reading in the process and may further comprise the steps:

Step 501: according to reading rule, judging that at first whether storing all in the buffer memory wants read data place stripe unit, is then the stripe unit of buffer memory to be operated execution in step 503; Otherwise, execution in step 502.With Fig. 2 is the example explanation, when being normal read, then judges whether there is stripe unit D0 in the buffer memory; When being degradation when reading, at first judge whether to exist itemize D0, if do not have, then further judge whether to exist D1, D2 and P.

Step 502: determine the stripe unit that is not kept at data place in the buffer memory, that will read, reading of data place stripe unit deposits buffer memory in the unnatural death.Here read to be meant directly reading of data from disk in the unnatural death of saying, corresponding hitting reads then to refer to directly reading of data from buffer memory.

Step 503: according to the current rule of reading, from the stripe unit of buffer memory, take out or calculate needed data, offer the user.As shown in Figure 4, for normal read, directly from the stripe unit D0 of buffer memory, read out desired data and offer the user and get final product; Read for degradation, need that then the D1 in the buffer memory, D2 and P are carried out XOR and go out stripe unit D0, and then from the stripe unit D0 of buffer memory, read out desired data and offer the user.

The present invention also provides based on the data writing method in normal read (Fault-Free Read) data procedures, use the situation that all are write, wherein, the normal read data comprise desired data place stripe unit are read in the buffer memory, take out desired data and offer the user from the stripe unit of buffer memory; Writing in the described normal read data procedures may further comprise the steps:

Step 601: in (Preread) process of reading in advance of writing rule, judging the stripe unit that whether has all the data places that need read in advance in the buffer memory, is that then execution in step 603; Otherwise step 602; Here write the rule that rule comprises that capitalization, small letter, degradation are write.Wherein, when definite which data are to read, can be referring to step 401.

Step 602: determine and be not kept at stripe unit in the buffer memory, that need the pre-reading data place; Read the described data place stripe unit of determining then in the unnatural death, deposit buffer memory in;

Step 603-604: data that will write-back and the stripe unit in the buffer memory are handled, generation contain want write-back, mend the stripe unit after full, as the stripe unit of wanting write-back, be stored in buffer memory, then according to the rule of writing that is adopted, with the stripe unit is that unit carries out computing, calculates the checking data unit of wanting write-back.Wherein, the rule of calculation check unit can be with reference to shown in Figure 2, and only the present invention is to be that the unit calculates with the stripe unit.

Step 605: the stripe unit that needs write-back, the checking data unit that will calculate are written back in the corresponding disk.If when write-back, the stripe unit of write-back is during just for the current stripe unit that is reading of normal read data, then continue write-back, and normal read hereto, stripe unit after then using the itemize of this write operation time institute buffer memory or mending full calculating gets final product as the stripe unit that will read, and that is to say, when the data of reading with disk deposit buffer memory in, if have the common factor part with the data in the write operation buffer memory, then must be as the criterion with the data that have been buffered in the buffer memory in the write operation process.

By as can be seen above, need the itemize lock to forbid writing when only in the degradation read procedure, write-back occurring, the read-write under other situations can use top step to carry out, and does not need to lock.

Because RAID0 (RAID LEVEL 0), RAID1 (RAID LEVEL 1) do not have verification unit, only there is the step of directly reading, directly writing to disk, do not use the computation process (specifically can referring to two books of the CMU university press of mentioning in the background technology) of verification, therefore of the present invention is that unit reads and writes with the stripe unit, also be applicable to RAID0, RAID1, it is similar with the method for the invention with write method that it reads method, and difference is to save according to regular calculation process described.And read method, write method according to the present invention, just can realize the situation of RAID0, RAID1 institute read-while writing, read-write all need not to lock.

The above only is preferred embodiment of the present invention, and is in order to restriction the present invention, within the spirit and principles in the present invention not all, any modification of being done, is equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1, the method for reading data of a kind of disk array RAID may further comprise the steps:

D, the stripe unit from buffer memory read out needed data.

2, the data write method of a kind of disk array RAID may further comprise the steps:

B, determine and to read stripe unit or checking data unit under the sector in advance, read each stripe unit and the checking data unit of determining, deposit buffer memory in from disk;

D, according to the current rule of writing, utilize the stripe unit of wanting write-back of the stripe unit that is read out of buffer memory and checking data unit, generation to calculate and generate the checking data unit of wanting write-back, be stored in buffer memory;

E, with the stripe unit of want write-back, want the checking data unit of write-back to be written back in the corresponding disk.

3, method according to claim 2 is characterized in that, steps A is described according to the current rule of writing, and determines the sector, data place that will read in advance and comprises:

4, a kind of disk array RAID's based on the method for reading data in the write data process, wherein, write data comprises that the data place stripe unit that will will write reads in the buffer memory, computing generates and contains reading in advance of the stripe unit that will write data and checking data unit, the write-back that the stripe unit that generates and checking data unit are write disk; Described method for reading data may further comprise the steps:

A, according to the current rule of reading, judge that whether storing all in the buffer memory wants stripe unit under the reading of data, if, execution in step C then; Otherwise, execution in step B;

D, from the stripe unit of buffer memory, take out needed data.

5, method according to claim 4 is characterized in that, describedly is written as capitalization, small letter, reconstruction is write or demote and write.

6, method according to claim 4 is characterized in that, the described stripe unit of step D is the stripe unit of reading in advance of writing process, perhaps the stripe unit of wanting write-back for calculating after reading in advance.

7, a kind of disk array RAID based on the data write method in the normal read data procedures, wherein, the normal read data comprise desired data place stripe unit are read in the buffer memory, take out/calculate desired data and offer the user from the stripe unit of buffer memory; Described data write method may further comprise the steps:

8, method according to claim 7 is characterized in that, steps A described according to current write under that rule is determined, the data that will read in advance all stripe units or/and the checking data unit comprise:

9, method according to claim 7 is characterized in that, step D further comprises:

When the stripe unit that needs write-back is written back into corresponding disk, the stripe unit of write-back is during just for the current stripe unit that is reading of normal read data manipulation, use described write back operations to issue before this stripe unit of buffer memory as the current stripe unit that reads of described normal read data manipulation.

10, method according to claim 9 is characterized in that, described stripe unit is to write the stripe unit that process is read in advance, perhaps calculates the stripe unit of wanting write-back after reading in advance.