WO2019090756A1 - 资源全局共享的基于raid机制的数据存储系统 - Google Patents

资源全局共享的基于raid机制的数据存储系统 Download PDF

Info

Publication number
WO2019090756A1
WO2019090756A1 PCT/CN2017/110662 CN2017110662W WO2019090756A1 WO 2019090756 A1 WO2019090756 A1 WO 2019090756A1 CN 2017110662 W CN2017110662 W CN 2017110662W WO 2019090756 A1 WO2019090756 A1 WO 2019090756A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
disk
strip
block
disks
Prior art date
Application number
PCT/CN2017/110662
Other languages
English (en)
French (fr)
Inventor
张广艳
郑纬民
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Priority to PCT/CN2017/110662 priority Critical patent/WO2019090756A1/zh
Priority to CN201780091514.4A priority patent/CN111095217B/zh
Publication of WO2019090756A1 publication Critical patent/WO2019090756A1/zh
Priority to US16/856,133 priority patent/US10997025B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • G06F11/1092Rebuilding, e.g. when physically replacing a failing disk
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/08Error detection or correction by redundancy in data representation, e.g. by using checking codes
    • G06F11/10Adding special bits or symbols to the coded information, e.g. parity check, casting out 9's or 11's
    • G06F11/1076Parity data used in redundant arrays of independent storages, e.g. in RAID systems
    • G06F11/1096Parity calculation or recalculation after configuration or reconfiguration of the system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements
    • G06F11/2053Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements where persistent mass storage functionality or persistent mass storage control functionality is redundant
    • G06F11/2089Redundant storage control functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2211/00Indexing scheme relating to details of data-processing equipment not covered by groups G06F3/00 - G06F13/00
    • G06F2211/10Indexing scheme relating to G06F11/10
    • G06F2211/1002Indexing scheme relating to G06F11/1076
    • G06F2211/104Metadata, i.e. metadata associated with RAID systems with parity

Definitions

  • the present invention relates to storage array technology, and more particularly to a fast recovery storage array data organization.
  • RAID Redundant Array of Inexpensive/Independent Disks
  • Storage arrays use striping technology to organize multiple disks together to provide greater storage capacity and processing power.
  • RAID5 and RAID6 parity-based RAID levels
  • disks are divided into disk groups with different RAID levels and different disk numbers, and disk groups form storage islands with each other.
  • a disk fails all surviving disks in the failed disk group participate in data reconstruction. It is necessary to completely read the entire disk data and write the reconstructed data to a single hot spare disk.
  • This isolated mode causes the remaining free disk groups to fail to share the load of data reconstruction.
  • large-capacity disk arrays have been tens of hours or even days of rebuilding.
  • the upper-layer business faces a high risk of data loss and a long-term decline in I/O performance. It can be seen that the data organization of the current storage array has a problem of slow data reconstruction.
  • the present invention has been made in view of the above problems.
  • a RAID-based data storage system (hereinafter sometimes referred to as RAID+) for resource global sharing is provided, including a first number of disks, and a RAID mechanism is used to store data on each disk.
  • the partitions on different disks form a stripe, and at least one of the strips of the strip stores parity information, wherein the stripe has a width smaller than the first number, and the data layout of the data storage system satisfies the following characteristics:
  • the two physical blocks are distributed on different disks;
  • the data blocks distributed on each disk are the same, and the distributed check blocks are also the same; the other data of the strips associated with any one of the disk data are evenly distributed on all the remaining disks.
  • the first number is a prime power
  • the first number is represented by a number n
  • n is greater than or equal to 4
  • the width of the strip is represented by a number k
  • (n-1) can be obtained for n Orthogonal Latin square
  • the data layout of the data storage system is generated as follows: obtain k orthogonal Latin squares in (n-1) orthogonal Latin squares, and complete the element values in k orthogonal Latin squares The same row is ignored, and then all the remaining positions in the orthogonal pull side are traversed in a row-first manner, and the element values in the same position in the row and column are combined into a mapping group, and each mapping group corresponds to one stripe.
  • the value of each element in the mapping group indicates the number of the disk on which each block on the corresponding strip is placed.
  • k orthogonal Latin squares are generated according to the following theorem, and the first row of each orthogonal Latin square is ignored, and the first orthogonal Latin square is represented by L 0 , and the mth orthogonal Latin is set.
  • the element value of the j-th column is L m-1 [i j] , then the mapping group (L 0 [i j], L 1 [i j], ...
  • L m-1 [i j],..., L k -1 [i j]) indicates the number of the disk on which each block on the (i-1)*n+jth strip is placed, where the first block is placed on the L 0 [ij] disk, the mth The block is placed on the L L -1 [i j] disk, and the kth block is placed on the L k-1 [i j] disk.
  • the data for each of these disks is placed in blocks.
  • the reconstructed data is stored in the free space reserved by all other disks, wherein the number of the disk on which the reconstructed data is written is determined as follows: (n-1) orthogonal Latin squares are selected An orthogonal Latin square other than the k orthogonal Latin squares, referred to as the (k+1)th orthogonal Latin square, for the failed disk, for each associated faulty strip, acknowledge Obtaining an element value at the position on the (k+1)th orthogonal Latin square corresponding to the position of the associated faulty strip on the orthogonal Latin square, the element value indicating the reconstructed data The number of the disk to which the block is allocated, and the reconstructed data block is stored in the free space of the disk indicated by the number.
  • an orthogonal Latin square of (n-1) orthogonal Latin squares other than the (k+1) orthogonal Latin squares is selected.
  • the number of the disk to which the data block is allocated, and the reconstructed data block is stored in the free space of the disk indicated by the number.
  • the data storage system stores data having different storage templates, wherein the first one of the first number of disks is obtained for the first data to be stored in the first template manner
  • the space is allocated to the first data; for the first corresponding space in the first number of disks, the mapping relationship between the data strips for the first template and the first corresponding space is in accordance with the above-described orthogonal Latin square-based manner.
  • the corresponding space the mapping relationship between the data strips of the second template and the second corresponding space is established according to the above-described orthogonal Latin square-based manner.
  • the different templates include at least one of a RAID level, a stripe width, a physical block size, and an inter-strip addressing strategy.
  • the corresponding space is called a logical volume
  • each logical volume uses the same type of data template as the granularity of the storage space allocation, and the physical location of each data template in the logical volume is tracked by the index technology, and the metadata is maintained.
  • a specific physical access location is located by locating a data template, querying an index table, locating a stripe, locating an internal location of the stripe, and calculating a global location.
  • the mapping relationship between the stripe and the disk is determined according to the above-described orthogonal Latin square-based manner, wherein one strip is in the middle.
  • the check block is the last data block of the strip, and the strips are sorted so that the last block in a strip is located on the disk where the first block in the next strip is located. The number.
  • the check block in one of the strips is the last data block of the strip, and the strips are sorted so that the last block in one strip is located in the next strip.
  • the number of the disk on which the first data block is located is a small number, which is the line number of the position of the mapping group corresponding to the stripe in the Latin square used.
  • the foregoing embodiment of the present invention provides a normal data layout, and the normal data layout is used to realize a balanced distribution of short data strips in a larger disk pool;
  • the above embodiments of the present invention provide for designing a degraded data layout, using the degraded data layout to guide the redistribution of reconstructed data after a disk failure.
  • the above embodiments of the present invention provide a method for sustained self-healing recovery.
  • the storage array implements self-healing recovery under a series of disk failures.
  • the normal data layout designed by the embodiment of the present invention can remove the limitation of the number of disks in the normal data storage system and the stripe width, break the resource isolation between the disk groups, and realize the reconstruction of the read load when the disk fails. Completely balanced.
  • the degraded data layout of the design can distribute the lost data evenly to all the remaining disks without manual replacement of the bad disk, achieving complete balance of the reconstructed write load. Under the guarantee of double balance of read load and write load, the speed of storage array reconstruction is improved. And in a series of disk failures, the storage array achieves continuous self-healing without manual intervention.
  • Figure 1 shows an exemplary complete orthogonal Latin square group diagram
  • Figure 2 shows 42 legal values that satisfy the prime power limit when the number of disks is between 4 and 128;
  • 3(a)-(c) illustrate schematic process diagrams of an exemplary construction method for orthogonal data processing of a data storage array based on orthogonal Latin squares, in accordance with an embodiment of the present invention
  • FIGS. 4(a)-(c) are diagrams showing an exemplary configuration of a degraded data layout provided by an embodiment of the present invention.
  • the same data in the case where the data blocks distributed on the respective disks are the same and the parity blocks in the distribution are the same is understood to be identical or substantially the same, that is, when the amount of data is sufficiently large
  • the present invention utilizes the orthogonal Latin square data layout mechanism to ensure this, but does not preclude the case where the data blocks distributed on the disk are substantially the same but not identical when the amount of data is relatively small in practical applications.
  • other data of a stripe associated with any piece of disk data is evenly distributed across all remaining disks also refers to the case when the amount of data is large. Obviously, in extreme cases, for example, when there is only one or two strips, Other data that cannot satisfy all the stripes associated with the data on the A disk are evenly distributed on the disks B, C, D, E, .
  • Stripe width or stripe size refers to the number of disks spanned by a stripe.
  • the embodiment of the present invention constructs a data layout of a data storage system (hereinafter sometimes referred to as RAID+) by means of the mathematical characteristics of the orthogonal Latin square in the combined mathematics, and the elegant characteristics of the orthogonal Latin square enable it to support the desired equalized data of the present invention. distributed.
  • RAID+ data storage system
  • Definition 1 In a matrix consisting of n ⁇ n, if there are only n different element values, and it is guaranteed that each element takes place in each row and each column of the matrix and appears only once, then the matrix is called It is the n-order Latin square.
  • L 1 and L 2 be two Latin squares of the same n-order. When the two are superimposed together, the elements at the same position combine to form an ordered pair. If the n 2 ordered pairs are only present once, then L 1 and L 2 are said to be orthogonal to each other.
  • Definition 3 In a set consisting of a plurality of Latin squares of the same order, if any two Latin squares inside the set satisfy orthogonality, the set is called an orthogonal Latin square group (MOLS).
  • MOLS orthogonal Latin square group
  • Theorem 1 For the n-th order orthogonal Latin square group, the number of Latin squares contained in the set cannot exceed n-1, and when n is a prime power, the upper limit can be reached.
  • the operators " ⁇ " and “+” here represent multiplication and addition in finite fields, respectively.
  • Theorem 3 If two n-order Latin squares are orthogonal to each other, for any position where any number d(d ⁇ [0;n-1]) appears on any Latin square, the other Latin square corresponds exactly at these positions. n different elements.
  • n-1 different orthogonal Latin squares can be constructed. For example, as shown in FIG. 1, when the order n is 5, 4 different orthogonal Latin squares can be generated.
  • the storage array technology according to the embodiment of the present invention satisfies the three characteristics by the mathematical characteristics of the orthogonal Latin square.
  • any two physical blocks inside the strip are distributed on different disks, so that the layout has Fault tolerance;
  • the same number of data blocks and check blocks are distributed on each disk, so that each disk load in the layout is completely balanced.
  • Third, the data related to any disk data is evenly distributed among all the remaining disks. On, thus achieving a complete balance of read load during data reconstruction.
  • the first number is a prime power
  • the first number is represented by a number n
  • n is greater than or equal to 4
  • the width of the strip is represented by a number k, wherein (n-1) can be obtained for n
  • the orthogonal Latin square, the data layout of the data storage system is generated as follows:
  • the limitation that the number of disks must be a prime power may sound very harsh, but the actual results indicate that there are a large number of valid values to choose from, and the gap between them is not large enough to meet the needs of users.
  • the number of disks is between 4 and 128, the following 42 legal n values can be obtained: 4,5,7,8,9,11,13,16,17,19,23,25,27, 29, 31, 32, 37, 41, 43, 47, 49, 53, 59, 61, 64, 67, 71, 73, 79, 81, 83, 89, 97, 101, 103, 107, 109, 113, 121, 125, 127 and 128.
  • Figure 2 shows these 42 legal values, and the y-axis represents the interval between adjacent legal values. It can be seen from the figure that there are only two points with a gap of 8 and the gap between most points is [2, 4]. In other words, for An illegal disk number configuration, users only need to increase or decrease a small number of disks to meet the RAID+ layout requirements. Since the envisioned disk pool accommodates tens to hundreds of disks, there are a large number of legally available n values to choose from.
  • k orthogonal Latin squares are generated according to the above theorem 2, and the first row of each orthogonal Latin square is ignored, and the first orthogonal Latin square is represented by L 0 , and the mth orthogonal is set.
  • the element value of the j-th column is L m-1 [i j] , then the mapping group (L 0 [i j], L 1 [i j], ...
  • L m-1 [i j],...,L K-1 [i j]) indicates the number of the disk on which each block on the (i-1)*n+jth strip is placed, where the first block is placed on the L 0 [ij] disk, mth
  • the blocks are placed on the L L -1 [i j] disks, and the kth block is placed on the L k-1 [i j] disks, where the data for each disk is placed in blocks.
  • FIG. 3(a)-(c) illustrate schematic process diagrams of an exemplary construction method for a normal data layout of a data storage array based on orthogonal Latin squares, in accordance with an embodiment of the present invention.
  • the construction of the data layout of the stripe size k (example 3 in Fig. 3(a)-(c)): first generate k nth order The orthogonal Latin square, as shown in Fig.
  • mapping tuple of the strip As shown in Figure 2(b), the mapping tuple of the first strip a is three orthogonal Latin squares in the first row and the first column.
  • the combination of numbers is a: (1, 2, 3), because the number of orthogonal Latin squares L 0 in the first row of the first row is 1, and the orthogonal Latin square L 1 is in the first row of the first row.
  • the number is 2, the number of orthogonal Latin squares L 2 in the first row of the first row is 3, and the mapping tuple of the next second strip b is three orthogonal Latin squares in the second row of the first row.
  • the combination of the values of the Latin square in the fourth row and the fifth column is t(3, 2, 1).
  • the number in the strip tuple in Figure 3(b) represents the disk number placed in the physical block in the strip, for example a: (1, 2, 3), representing the first physical strip (numbered a)
  • the physical block is placed on the first, second, and third disks. That is, the first physical strip (numbered a) in Figure 3(b) has three physical blocks placed on the first, second, and third disks; b: (2, 3, 4), indicating that Figure 3(b) shows that the three physical blocks of the second physical strip (numbered b) are placed on the first, second, and third disks, that is, in Figure 3(b).
  • a physical strip (numbered a) There are 3 physical blocks placed on disks 1, 2, and 3, and so on.
  • the generated normal data layout satisfies three characteristics: first, any two physical blocks inside the strip are distributed on different disks, so that the layout is fault-tolerant; Second, the same number of data blocks and check blocks are distributed on each disk, so that each disk load in the layout is completely balanced. Third, the data associated with any disk data is evenly distributed across all the remaining disks. A complete balance of read load when implementing data reconstruction.
  • the reconstructed data is evenly written into the remaining surviving disks.
  • the reconstructed data is stored in the free space reserved by all other disks.
  • the number of the disk on which the reconstructed data is written can be determined as follows:
  • the reconstructed data block can be stored in the free space of the disk indicated by the number.
  • an orthogonal Latin square of (n-1) orthogonal Latin squares other than the (k+1) orthogonal Latin squares is selected.
  • the reconstructed data block can be stored in the free space of the disk indicated by the number.
  • processing may be performed to determine a strip associated with any of the P-block disks for any block in the P-block disk Corresponding any strip, determining the number of data blocks in the data block of the strip that are located in the P-block disk; and assigning a higher number of strips that are located in the P-block disk with a larger number of data blocks Restore the priority; first restore the stripe with the higher priority.
  • the following example shows how to degrade the data layout when a disk fails.
  • the degraded data layout stores the reconstructed data by preserving free space in advance on each disk, and guides the redistribution of the reconstructed data by means of orthogonal Latin squares.
  • n of the matrix is a prime power
  • n-1 different orthogonal Latin squares can be constructed.
  • the normal data layout is constructed using k orthogonal Latin squares, and nk-1 orthogonal Latin squares remain. use.
  • When constructing the degraded data layout first select a new orthogonal Latin square from the nk-1 orthogonal Latin squares; then select the number of the error stripe at the same position of the new orthogonal Latin square as the stripe missing data weight. The distributed disk number; eventually the lost data of the error strip is reconstructed into the free area on the disk.
  • the disk data array shown in Figures 3(a)-(c) is used as 5 disks, and the fifth-order orthogonal Latin square is used, in which the fifth-order orthogonal Latin is used.
  • the orthogonal Latin squares L0, L1 and L2 in the square complete group.
  • the data storage system In order to repair 12 erroneous stripes, the data storage system according to an embodiment of the present invention not only needs to know the physical location of other good data of these stripes, but also needs to re-search for free areas in the disk space to store the reconstructed data.
  • the middle column of Figure 4(b) gives the complete stripe mapping tuple information
  • the mapping tuple of stripe c is (3, 4, 0), which indicates that stripe c has lost the check block (element
  • the last digit in the group is used as the disk of the check block.
  • the remaining two data blocks are stored on disk D3 and disk D4.
  • the mapping tuple of strip d is (4, 0, 1), which loses one data block, the remaining one is distributed on disk D4, and the parity block is distributed on disk D1.
  • the mapping tuple ( ⁇ , ⁇ , ⁇ ) of the error strip in the figure also carries a new physical position ⁇ .
  • denotes the new disk number of the physical block remapping that is lost inside the stripe. Still taking the strip c as an example, the mapped new disk number is 1, indicating that after the lost data is reconstructed, it is stored on the disk D1. The number 1 is not equal to (3, 4, 0) in the original mapping relationship, so the physical blocks in the same strip continue to be distributed on different disks.
  • the letter A in Figure 4(c) assists in showing the new physical blocks on each disk in braces. The 12 missing physical blocks happen to form a 4x3 degraded data layout.
  • Figure 4(a) shows the method of generating a new disk number. Under the original k orthogonal Latin squares stacked together, a new orthogonal Latin square L3 is introduced. If the position of the error strip in the table is (x, y), the number on the same position (x, y) of the orthogonal Latin square L3 is taken as the remapping disk number. In the degraded data layout, it can be seen that the write load for data reconstruction is spread across all surviving disks and the layout is still fault tolerant. If you carefully count the read load on each disk, each surviving disk in the data template reads 6 physical blocks, that is, the read load is also balanced.
  • the generated degraded data layout has the following characteristics: First, any two physical blocks in the error strip are still placed on different disks, so that the degraded data layout retains fault tolerance; second, each block The number of new physical blocks added to the disk is equal, which achieves a complete balance of data reconstruction write load. Since the data reconstruction read load is fully balanced in the normal data layout, under the guarantee of data reconstruction and load balancing, all disks participate in reading and writing in parallel, which improves the reconstruction speed of the storage array.
  • the construction process of the degraded data layout is repeated with the new orthogonal Latin square to achieve continuous fast recovery.
  • an orthogonal Latin square is selected from n-k-t orthogonal Latin squares to guide the construction of the degraded data layout, and the degraded data layout under the n-t block disk is obtained.
  • the number of disks is reduced to k+1.
  • the stripe width k is equal to the number of disks, and only the unique disk selection exists for the reconstructed data of the error strip.
  • Degraded data layout under k-block disk In this process, fault-tolerant energy has been retained due to the degraded data layout. Force, so the storage array can achieve sustained self-healing recovery without the need to manually replace the bad disk.
  • the data storage system stores data having different storage templates, wherein the first number of disks to be stored in the first template manner is obtained
  • the first corresponding space is allocated to the first data; for the first corresponding space in the first number of disks, the mapping relationship between the data strips of the first template and the first corresponding space is borrowed according to the above-mentioned orthogonal Latin a normal data layout manner of the party is established; wherein the second corresponding data of the first number of disks is allocated to the second data for the second data that is to be stored in the second template manner;
  • a second corresponding space of the plurality of disks, the mapping relationship between the data strips of the second template and the second corresponding space is established according to the normal data layout manner of borrowing orthogonal Latin squares as described above.
  • different templates include at least one of a RAID level, a stripe width, a physical block size, and an inter-strip addressing policy.
  • a major advantage of the data storage system RAID+ of the embodiment of the present invention is that multiple virtual RAID arrays (ie, RAID volumes) can be provided inside the same disk pool, and each user volume serves different users or loads. Under multiple users, assign each user a different logical volume. Each logical volume uses the same type of data template as the granularity of storage space allocation.
  • RAID+ can track the physical location of each data template in a logical volume through indexing techniques. Since each data template is relatively large, including n ⁇ (n-1) ⁇ k physical blocks, and the metadata information only needs to record the physical location of the data template, it is not necessary to separately map each physical block inside the template, so RAID+ only The mapping between user volumes and physical space can be achieved by additionally maintaining a small amount of metadata overhead. By caching metadata in memory, RAID+ enables fast query of the physical location of the template, reducing processing delays during address translation.
  • Positioning data template According to the user request logical position x, by combining x with the internal user data space size l t of a single data template, the accessed template number # t and the offset ⁇ t inside the template are calculated;
  • a data storage system for data layout using orthogonal Latin square properties for data to be stored, when it is desired to store in a sequential read friendly manner, data layout based on the orthogonal Latin square property is used.
  • the disk number is the number of the disk on which the first data block in the next tape is located.
  • the check block in one of the strips is the The last data block of the stripe, sorting the strips such that the last block in a strip is located at a smaller number than the number of the disk on which the first block in the next strip is located.
  • the number is the line number of the position of the mapping group corresponding to the stripe in the Latin square used.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种资源全局共享的基于RAID机制的数据存储系统,包括第一数目个磁盘,以RAID机制来在各个磁盘上存储数据,不同磁盘上的分块组成条带,条带的分块中的至少一块存储奇偶校验信息,其中条带的宽度小于第一数目,以及数据存储系统的数据布局满足如下特征:条带内部任意的两个物理块分布在不同磁盘上;各个磁盘上分布的数据块相同,分布的校验块也相同;与任意一块磁盘数据相关的条带的其它数据均衡分布在剩余所有磁盘上。可以通过正交拉丁方来具体实现正常数据布局和降级数据布局。该系统能够解除通常数据存储系统中磁盘的数目与条带宽度相等的限制,打破各个磁盘组之间的资源隔离,并在磁盘发生故障时,实现重建读取负载的完全均衡。

Description

资源全局共享的基于RAID机制的数据存储系统 技术领域
本发明涉及存储阵列技术,特别涉及一种快速恢复的存储阵列数据组织。
背景技术
经过三十年的发展,独立磁盘冗余阵列(Redundant Array of Inexpensive/Independent Disks,RAID)已经发展成为当今主流的高可靠和高可用的持久化存储技术,并被广泛用于企业服务器、现代数据中心、云平台和超级计算机。
存储阵列利用条带化技术将多块磁盘组织在一起,对外提供更大的存储容量和更高的处理能力。目前,大多数组织使用基于奇偶校验的RAID级别(RAID5和RAID6),借助条带内部的冗余数据,实现数据的容错保护。
在传统的存储阵列数据组织方式中,磁盘被划分为不同RAID级别和不同磁盘数量的磁盘组,磁盘组彼此之间形成存储孤岛。当磁盘失效时,故障磁盘组中的所有幸存磁盘参与数据重建,需要完整读取全盘的数据,并将重建数据写入单块热备盘当中。这种孤立模式造成其余空闲磁盘组无法分担数据重建的负载。近些年随着磁盘容量的不断增长,大容量的磁盘阵列面临着数十小时乃至数天的重建时间。漫长的修复过程中,上层业务面临着较高的数据丢失风险和长时间的I/O性能下降。可见,当前存储阵列的数据组织方式存在数据重建缓慢的问题。
发明内容
针对以上问题,提出了本发明。
根据本发明的一个方面,提供了一种资源全局共享的基于RAID机制的数据存储系统(下文中有时称之为RAID+),包括第一数目个磁盘,以RAID机制来在各个磁盘上存储数据,不同磁盘上的分块组成条带,条带的分块中的至少一块存储奇偶校验信息,其中条带的宽度小于第一数目,以及数据存储系统的数据布局满足如下特征:条带内部任意的两个物理块分布在不同磁盘上; 各个磁盘上分布的数据块相同,分布的校验块也相同;与任意一块磁盘数据相关的条带的其它数据均衡分布在剩余所有磁盘上。
根据本发明一个实施例,所述第一数目为质数幂,用数字n表示第一数目,n大于等于4,用数字k表示条带的宽度,其中对于n,能够得到(n-1)个正交拉丁方,所述数据存储系统的数据布局按下述方式生成:得到(n-1)个正交拉丁方中的k个正交拉丁方,将k个正交拉丁方中元素值完全相同的行忽略,然后按照行优先的方式,遍历正交拉定方中剩余的所有位置,将处于就行列而言相同位置的元素值组合成映射组,每个映射组对应于一个条带,映射组中的各个元素值指示对应条带上的各个块所放置的磁盘的编号。
根据本发明一个实施例,按照下述定理生成k个正交拉丁方,忽略每个正交拉丁方的第一行,第1个正交拉丁方用L0表示,设第m个正交拉丁方中第i行,第j列的元素值为Lm-1[i j]则映射组(L0[i j],L1[i j],…Lm-1[i j],…,Lk-1[i j])指示第(i-1)*n+j个条带上的各个块所放置的磁盘的编号,其中第1个块放置于第L0[i j]个磁盘,第m个块放置在第Lm-1[i j]个磁盘,第k个块放置在第Lk-1[i j]个磁盘,
其中每个磁盘的数据是按块为单位放置的。
定理:n为质数幂时,对于完备正交拉丁方组中的第i个拉丁方fi(i∈[1,n-1]),其在第x行,第y列的元素值fi[x,y]=i·x+y。这里的操作符“·”和“+”分别代表有限域中的乘法和加法。
根据本发明一个实施例,其中当一块磁盘出故障时,针对与该出故障磁盘相关联的每个故障条带,从和所述故障条带相关联的其它磁盘上并发读取用于计算重构的数据,将重构的数据存储在其它所有磁盘保留的空闲空间中,其中重构的数据所被写入的磁盘的编号是如下确定的:选择(n-1)个正交拉丁方中除了所述k个正交拉丁方之外的一个正交拉丁方,称之为第(k+1)个正交拉丁方,对于出故障磁盘,对于每个相关联的出故障条带,确认与该相关联的出故障条带对应的、在正交拉丁方上的位置,获得第(k+1)个正交拉丁方上的该位置处的元素值,该元素值指示重构的数据块被分配到的磁盘的编号,重构的数据块被存储在该编号指示的磁盘的空闲空间中。
根据本发明一个实施例,当又一块磁盘出现故障时,选择(n-1)个正交拉丁方中除了所述(k+1)个正交拉丁方之外的一个正交拉丁方,称之为第(k+2)个正交拉丁方,对于出故障磁盘,对于每个相关联的出故障条带, 确认与该相关联的出故障条带对应的、在正交拉丁方上的位置,获得第(k+2)个正交拉丁方上的该位置处的元素值,该元素值指示重构的数据块被分配到的磁盘的编号,重构的数据块被存储在该编号指示的磁盘的空闲空间中。
根据本发明一个实施例,当p块磁盘同时出现故障时,确定与该P块磁盘中的任一块相关联的条带,对于与该P块磁盘中的任一块相关联的任一条带,确定所述条带的数据块中的位于该P块磁盘中数据块的数目;以及为位于该P块磁盘中数据块的数目较多的条带,分配较高的恢复优先级;优先恢复该恢复优先级较高的条带。
根据本发明一个实施例,所述数据存储系统存储具有不同存储模板的数据,其中针对来到的要以第一模板方式存储的第一数据,将所述第一数目个磁盘中的第一相应空间分给所述第一数据;对于第一数目个磁盘中的第一相应空间,针对第一模板的数据条带与第一相应空间之间的映射关系按照上述基于正交拉丁方的方式来建立;其中针对来到的要以第二模板方式存储的第二数据,将所述第一数目个磁盘中的第二相应空间分给所述第二数据;对于第一数目个磁盘中的第二相应空间,针对第二模板的数据条带与第二相应空间之间的映射关系按照上述基于正交拉丁方的方式来建立。
根据本发明一个实施例,不同模板包括RAID级别、条带宽度、物理块大小、条带间编址策略中的至少一项不同。
根据本发明一个实施例,称相应空间为逻辑卷,每个逻辑卷使用同一类数据模板作为存储空间分配的粒度,通过索引技术跟踪逻辑卷中每个数据模板的物理位置,维护元数据,来实现用户卷与物理空间之间的映射,运行中将元数据缓存与内存中。
根据本发明一个实施例,当用户请求到来时,通过定位数据模板、查询索引表、定位条带、定位条带内部位置、计算全局位置来定位具体的物理访问位置。
根据本发明一个实施例,对于待存储的数据,当希望以顺序读友好方式来存储时,根据上述基于正交拉丁方的方式确定了条带和磁盘之间的映射关系,其中一个条带中的校验块是该条带的最后一个数据块,对各个条带进行排序,使得一个条带中的最后一个块所位于的磁盘编号是下一条带中的第一个数据块所位于的磁盘的编号。
根据本发明一个实施例,对于待存储的数据,当希望以顺序写友好方 式来存储时,其中一个条带中的校验块是该条带的最后一个数据块,对各个条带进行排序,使得一个条带中的最后一个块所位于的磁盘编号比下一条带中的第一个数据块所位于的磁盘的编号小一个数目,该数目为所述条带对应的映射组在所用拉丁方中的位置的行号。
本发明上述实施例提供了一种正常数据布局,利用所述正常数据布局实现短数据条带在更大磁盘池中的均衡分布;
本发明上述实施例提供了设计了一种降级数据布局,利用所述降级数据布局来指导磁盘故障后重建数据的重分布。
本发明上述实施例提供了一种持续自愈恢复的方法,利用所述恢复方法,存储阵列实现一系列磁盘故障下的自愈恢复。
上述技术方案中具有如下优点或有益效果中的一个或多个:
本发明实施例设计的正常数据布局能够解除通常数据存储系统中磁盘的数目与条带宽度相等的限制,打破各个磁盘组之间的资源隔离,并在磁盘发生故障时,实现重建读取负载的完全均衡。设计的降级数据布局能够在无人工替换坏盘的情况下,将丢失数据均衡分布到剩余的所有磁盘上,实现重建写入负载的完全均衡。在读负载和写负载双重均衡的保证下,提高了存储阵列重建的速度。并在一系列磁盘故障下,无需人工参与,存储阵列实现持续的自愈恢复。
附图说明
图1示出了示例性完备正交拉丁方组示意图;
图2展示了当磁盘数量介于4到128之间时满足为质数幂限制的42个合法的数值;
图3(a)-(c)阐述了根据本发明实施例的基于正交拉丁方的、数据存储阵列正常数据布局的示例性构造方法过程示意图;
图4(a)-(c)示出了为本发明实施例提供的降级数据布局的示例性构造示意图;
具体实施方式
为使本发明要解决的技术问题、技术方案和优点更加清楚,下面将结合 附图及具体实施例进行详细描述。
需要说明的是,本发明这里的“各个磁盘上分布的数据块相同,分布的校验块也相同”中的“相同”应该理解为完全相同或基本相同,是指在数据量足够大时,本发明利用正交拉丁方的数据布局的机制能够保证此点,但并不排除在实际应用中数据量比较少时磁盘上分布的数据块基本相同而并非完全相同的情况。类似地,“与任意一块磁盘数据相关的条带的其它数据均衡分布在剩余所有磁盘上”同样是指数据量较大时的情况,显然极端情况下,例如仅有一两个条带时,是无法满足与A磁盘上的数据相关的所有条带的其它数据均匀分布在磁盘B、C、D、E……上的。
条带宽度(stripe width)或条带大小(stripe size)指一个条带跨越的磁盘的数目。
本发明实施例借助组合数学中的正交拉丁方的数学特性构建数据存储系统(下文中有时称之为RAID+)的数据布局,正交拉丁方的优美特性使其能够支持本发明希望的均衡数据分布。
下面概要给出有关正交拉丁方的定义和定理,有关详细介绍,可以参考关于正交拉丁方的资料,例如参考文献1:Bose R C,Shrikhande S S.On the construction of sets of mutually orthogonal latin squares and the falsity of a conjecture of euler.Transactions of the American Mathematical Society,1960,95(2):191–209.。
定义1:在n×n构成的矩阵中,如果只存在n种不同的元素取值,并且保证在矩阵的每一行和每一列上,每个元素取值出现且仅出现一次,则称该矩阵为n阶拉丁方。
定义2:令L1和L2是两个同为n阶的拉丁方,当将两者叠加在一起时,相同位置上的元素组合形成有序数对。如果这n2个有序数对仅且出现一次,则称L1和L2相互正交。
定义3:在由多个相同阶数拉丁方构成的集合中,如果集合内部的任意两个拉丁方均满足正交,则称该集合为正交拉丁方组(MOLS)。
定理1:对于n阶的正交拉丁方组,其集合中包含的拉丁方个数无法超过n-1,当n为质数幂时,能够达到该上限。
定理2:n为质数幂时,对于完备正交拉丁方组中的第i个拉丁方fi(i∈[1,n-1]),其在第x行,第y列的元素值fi[x;y]=i.x+y。具体,请见上述参考 文献1。这里的操作符“·”和“+”分别代表有限域中的乘法和加法
定理3:如果两个n阶拉丁方相互正交,对于任意数字d(d∈[0;n-1])在任一拉丁方上出现的n个位置,另一个拉丁方在这些位置上恰好对应n个不同的元素。
如上所述,理论发现,当矩阵的阶数n为质数幂时,可以构造n-1个不同的正交拉丁方。例如,如图1所示,当阶数n为5时,可以生成4个不同的正交拉丁方。
根据本发明实施例的存储阵列技术借助正交拉丁方的数学特性,生成的正常数据布局满足三个特征:第一,条带内部任意的两个物理块分布在不同的磁盘上,从而布局具有容错能力;第二,每块磁盘上分布相同数量的数据块和校验块,从而让布局中的每块磁盘负载完全均衡;第三,与任意一块磁盘数据相关的数据均衡分布在剩余所有磁盘上,从而实现数据重建时读负载的完全均衡。
根据本发明一个实施例,所述第一数目为质数幂,用数字n表示第一数目,n大于等于4,用数字k表示条带的宽度,其中对于n,能够得到(n-1)个正交拉丁方,所述数据存储系统的数据布局按下述方式生成:
得到(n-1)个正交拉丁方中的k个正交拉丁方,将k个正交拉丁方中元素值完全相同的行忽略,然后按照行优先的方式,遍历正交拉定方中剩余的所有位置,将处于就行列而言相同位置的元素值组合成映射组,每个映射组对应于一个条带,映射组中的各个元素值指示对应条带上的各个块所放置的磁盘的编号。
需要说明的是,磁盘数量必须为质数幂的限制可能听起来十分苛刻,但实际结果表明存在大量有效的数值可供选择,并且它们之间的间隔差距不大,足以满足用户的需求。比如,当磁盘数量介于4到128之间时,能够得到以下42个合法的n值:4,5,7,8,9,11,13,16,17,19,23,25,27,29,31,32,37,41,43,47,49,53,59,61,64,67,71,73,79,81,83,89,97,101,103,107,109,113,121,125,127和128。图2展示了这42个合法的数值,y轴代表相邻合法值之间的间隔。从图中可以看出,只存在2个间隔差距为8的点,绝大部分点的间隔差距为[2,4]。换而言之,对于 一个不合法的磁盘数量配置,用户只需额外增加或者减少小部分磁盘,即可满足RAID+的布局要求。由于设想的磁盘池容纳数十到上百的磁盘,因此这里存在大量合法可用的n值可供选择。
具体地,在一个示例中,按照上述定理2生成k个正交拉丁方,忽略每个正交拉丁方的第一行,第1个正交拉丁方用L0表示,设第m个正交拉丁方中第i行,第j列的元素值为Lm-1[i j]则映射组(L0[i j],L1[i j],…Lm-1[i j],…,Lk-1[i j])指示第(i-1)*n+j个条带上的各个块所放置的磁盘的编号,其中第1个块放置于第L0[i j]个磁盘,第m个块放置在第Lm-1[i j]个磁盘,第k个块放置在第Lk-1[i j]个磁盘,其中每个磁盘的数据是按块为单位放置的。
图3(a)-(c)阐述了根据本发明实施例的基于正交拉丁方的、数据存储阵列正常数据布局的示例性构造方法过程示意图。对于磁盘数量n(图2(a)-(c)中为5),条带大小k(图3(a)-(c)中示例为3)的数据布局的构造:首先生成k个n阶正交拉丁方,如图3(a)所示,去掉k个正交拉丁方完全相同的第一行,下文的行和列按照去除了第一行后的矩阵来进行顺序编号,例如下文中所称的矩阵L0的第一行为{1,2,3,4,0}。然后以行优先的方式,遍历正交拉丁方中剩余的所有位置,并将k个正交拉丁方上相同位置(即,处于所在正交拉丁方中的相同行号和列号指示的位置处)的数字组合在一起,形成如图2(b)所示的条带的映射元组,例如,第一条带a的映射元组为三个正交拉丁方的处于第一行第一列的数字的组合即a:(1,2,3),这是因为正交拉丁方L0处于第一行第一列的数字为1,正交拉丁方L1处于第一行第一列的数字为2,正交拉丁方L2处于第一行第一列的数字为3,接下来的第二条带b的映射元组为三个正交拉丁方的处于第一行第二列的数值的组合即b:(2,3,4),…,如此继续下去,第20个条带(4*5=20,四行五列的元素数目)t的映射元组为三个正交拉丁方的处于第4行第5列的数值的组合即t(3,2,1)。
图3(b)中的条带元组中的数字代表条带中物理块放置的磁盘编号,例如a:(1,2,3),表示第一物理条带(编号为a)的3个物理块放置于第1、2、3号磁盘,即图3(b)中第一物理条带(编号为a)有3个物理块,分别放置在第1、2、3号磁盘;b:(2,3,4),表示图3(b)表示第二物理条带(编号为b)的3个物理块分别放置于第1、2、3号磁盘,即图3(b)中第一物理条带(编号为a) 有3个物理块,分别放置在第1、2、3号磁盘,如此进行下去,最后t:(3,2,1),表示第二十个物理条带(编号为t)的三个物理块分别放置于第3,2,1号磁盘。最后按照条带的顺序依次把物理块放置在对应的磁盘上,得到正常数据布局,如图3(c)所示。对于条带内部数据块和校验块的划分,只需使用映射元组中固定的列来指导校验块的放置。
根据本发明实施例的借助正交拉丁方的数学特性,生成的正常数据布局满足三个特征:第一,条带内部任意的两个物理块分布在不同的磁盘上,从而布局具有容错能力;第二,每块磁盘上分布相同数量的数据块和校验块,从而让布局中的每块磁盘负载完全均衡;第三,与任意一块磁盘数据相关的数据均衡分布在剩余所有磁盘上,从而实现数据重建时读负载的完全均衡。
上文介绍了根据本发明实施例的数据存储阵列的正常数据布局,其中使用分散式空闲块取代了集中式热备磁盘。下面描述根据本发明实施例的、当存储阵列中出现磁盘故障时的解决方案。
根据本发明实施例,当存储阵列中,一块磁盘发生故障时,重建的数据将被均匀写入剩余的幸存磁盘中。
根据本发明一个实施例,当一块磁盘出故障时,针对与该出故障磁盘相关联的每个故障条带,从和所述故障条带相关联的其它磁盘上并发读取用于计算重构的数据,将重构的数据存储在其它所有磁盘保留的空闲空间中。
具体地,重构的数据所被写入的磁盘的编号可以如下确定:
选择(n-1)个正交拉丁方中除了所述k个正交拉丁方之外的一个正交拉丁方,称之为第(k+1)个正交拉丁方,
对于出故障磁盘,对于每个相关联的出故障条带,确认与该相关联的出故障条带对应的、在正交拉丁方上的位置,获得第(k+1)个正交拉丁方上的该位置处的元素值,该元素值指示重构的数据块被分配到的磁盘的编号,
这样,就可以将重构的数据块存储在该编号指示的磁盘的空闲空间中。
根据本发明的实施例,当又一块磁盘出现故障时,选择(n-1)个正交拉丁方中除了所述(k+1)个正交拉丁方之外的一个正交拉丁方,称之为第(k+2)个正交拉丁方,对于出故障磁盘,对于每个相关联的出故障条带,确认与该相关联的出故障条带对应的、在正交拉丁方上的位置,获得第(k+2)个正交拉丁方上的该位置处的元素值,该元素值指示重构的数据块被分配到的磁盘的编 号,这样,就可以将重构的数据块存储在该编号指示的磁盘的空闲空间中。
根据本发明的另一实施例,当p块磁盘同时出现故障时,可以进行下述处理:确定与该P块磁盘中的任一块相关联的条带,对于与该P块磁盘中的任一块相关联的任一条带,确定所述条带的数据块中的位于该P块磁盘中数据块的数目;以及为位于该P块磁盘中数据块的数目较多的条带,分配较高的恢复优先级;优先恢复该恢复优先级较高的条带。
下面举例说明当一块磁盘出现故障时,如何进行降级数据布局的构造。
本发明实例中降级数据布局通过在每块磁盘上提前预留空闲空间来存放重建的数据,并借助正交拉丁方来指导重建数据的重分布。当矩阵的阶数n为质数幂时,可以构造n-1个不同的正交拉丁方,正常数据布局的构造使用了其中k个正交拉丁方,还剩余n-k-1个正交拉丁方未使用。构建降级数据布局时,首先从这n-k-1个正交拉丁方中任意挑选出一个新正交拉丁方;然后选择出错条带在新正交拉丁方相同位置上的数字作为条带丢失数据重分布的磁盘编号;最终将出错条带的丢失数据重建至该磁盘上的空闲区域中。
上面介绍正常数据布局时,使用了如图3(a)-(c)所示的磁盘数据阵列为5个磁盘,使用的是五阶正交拉丁方的例子,其中使用了5阶正交拉丁方完备组中的正交拉丁方L0,L1和L2。继续以此为例,下面参考图4(a)-(c)给出一个磁盘出故障时的降级数据布局的构造的操作过程。降级数据布局的构造,将借助于剩余的最后一个正交拉丁方L3。
在不失一般性的情况下,这里假设磁盘D0发生故障。
如图4(a)-(c)所示,在5×12的正常数据布局中,每块磁盘上存放了12个物理块,因而磁盘D0的损坏将造成12个物理块中数据的丢失。图4(c)中右列清晰显示了条带在磁盘上的分布,磁盘D0上的第一个物理块属于条带c,也就是说,条带c内部的冗余信息集合(2个数据块+1个校验块)不再完整。从上往下以此类推,可以得到出错的条带集合{c,d,e,...,s}。集合中恰好包含了12个不同的条带,这是因为正常数据布局保证了同一条带内的任意两个物理块不会重复分配至相同的磁盘。
为了修复12个出错的条带,根据本发明实施例的数据存储系统不仅需要知道这些条带其它完好数据的物理位置,同时需要重新在磁盘空间中寻找空闲的区域来存放这些重建的数据。
图4(b)的中间列给出了完整的条带映射元组信息,条带c的映射元组为(3,4,0),该映射关系表明条带c丢失了校验块(元组中的最后一个数字作为校验块的磁盘),剩余的两个数据块分别存放在磁盘D3和磁盘D4上。同样的道理,条带d的映射元组为(4,0,1),其丢失了一个数据块,剩余的另一个数据块分布在磁盘D4上,校验块分布在磁盘D1上。除了原有的物理映射外,图中出错条带的映射元组(α,β,γ)后还携带了新的物理位置θ。θ表示该条带内部丢失物理块重映射的新磁盘编号。依然以条带c为例,映射的新磁盘编号为1,表示重建完丢失的数据后,将其存放到磁盘D1上。数字1与原有映射关系中的(3,4,0)均不相等,因此同一条带内的物理块继续保证分布在不同的磁盘上。图4(c)中的字母A并辅助以大括号展示了每块磁盘上新增的物理块。12个丢失的物理块恰好形成4×3的降级数据布局。
图4(a)给出了生成新磁盘编号的方法。在原有堆叠在一起的k个正交拉丁方下,引入了新的正交拉丁方L3。如果出错条带在表中的位置为(x,y),则取正交拉丁方L3相同位置(x,y)上的数字作为重映射的磁盘编号。在降级数据布局中,可以看见,数据重建的写入负载被均摊到所有的幸存磁盘上,并且布局依然具有容错能力。如果细心统计每块磁盘上的读取负载,数据模板中的每块幸存磁盘都读取了6个物理块,也就是说,读负载也实现了均衡。数据重建完成后,生成的降级数据布局具有以下特征:第一,出错条带内的任意两个物理块依然放置在不同的磁盘上,从而让降级数据布局保留了容错能力;第二,每块磁盘上新增的物理块数量相等,从而实现了数据重建写负载的完全均衡。由于正常数据布局中已经实现了数据重建读负载的完全均衡,在数据重建读写同时负载均衡的保证下,所有磁盘并行参与读写,提高了存储阵列的重建速度。
单盘故障重建完成后,如果再次发生磁盘故障,只需借助新的正交拉丁方重复降级数据布局的构造过程,实现持续的快速恢复。具体而言:对于第t次磁盘故障,从n-k-t个正交拉丁方当中挑选出一个正交拉丁方来指导降级数据布局的构造,得到n-t块磁盘下的降级数据布局。在发生n-k-1个磁盘故障后,磁盘数量缩减为k+1,此时如果再次发生磁盘故障,条带宽度k与磁盘数量相等,出错条带的重建数据只存在唯一的磁盘选择,从而得到k块磁盘下的降级数据布局。在这个过程中,由于降级数据布局一直保留了容错能 力,因此在无需人工替换坏盘的情况下,存储阵列能够实现持续的自愈恢复。
根据本发明一个实施例的数据存储系统,所述数据存储系统存储具有不同存储模板的数据,其中针对来到的要以第一模板方式存储的第一数据,将所述第一数目个磁盘中的第一相应空间分给所述第一数据;对于第一数目个磁盘中的第一相应空间,针对第一模板的数据条带与第一相应空间之间的映射关系按照上述借用正交拉丁方的正常数据布局方式来建立;其中针对来到的要以第二模板方式存储的第二数据,将所述第一数目个磁盘中的第二相应空间分给所述第二数据;对于第一数目个磁盘中的第二相应空间,针对第二模板的数据条带与第二相应空间之间的映射关系按照上述借用正交拉丁方的正常数据布局方式来建立。
在一个示例中,不同模板包括RAID级别、条带宽度、物理块大小、条带间编址策略中的至少一项不同。
本发明实施例的数据存储系统RAID+一个主要的优势是能够在同一个磁盘池内部提供多个虚拟的RAID阵列(即RAID卷),每个用户卷服务不同的用户或者负载。在多用户下,为每个用户分配不同的逻辑卷。每个逻辑卷使用同一类数据模板作为存储空间分配的粒度。
对于数据模板的配置,不同逻辑卷之间相互独立,系统管理员可以为多个卷配置不同的条带宽度、物理块大小、RAID级别和条带间编址策略。
RAID+可以通过索引技术跟踪逻辑卷中每个数据模板的物理位置。由于每个数据模板相对较大,包含n×(n-1)×k个物理块,而元数据信息只需记录数据模板的物理位置,无需单独映射模板内部的每个物理块,因此RAID+只需额外维护少量的元数据开销,即可实现用户卷与物理空间之间的映射。而通过将元数据缓存在内存中,RAID+能够实现模板物理位置的快速查询,减少地址转换过程中的处理延迟。
当用户请求到来时,为了定位具体的物理访问位置,RAID+需要通过以下5步来完成整个寻址过程(算法1):
1.定位数据模板:根据用户请求逻辑位置x,通过将x与单个数据模板内部用户数据空间大小lt相结合,计算得到访问的模板编号#t,以及在模板内部的偏移δt
2.查询索引表:根据用户卷的索引表,调用函数查询模板#t在磁盘空间上的物理偏移offt
3.定位条带:在模板内部,根据在模板内部的偏移δt,结合条带内部的数据空间大小ls,计算得到请求所在模板内部的条带编号#s,以及条带内部的偏移δs。然后通过查询条带与磁盘之间的映射关系,得到条带#s中物理块所在的磁盘编号集合disks。
4.定位条带内部位置:根据条带内部的偏移δs,结合条带内部布局下逻辑块的访问顺序,确定在条带内部访问的物理块编号id,以及逻辑块在模板内部的磁盘偏移off。
5.计算全局位置:根据物理块编号id,结合条带与磁盘间的映射关系disks,从而得到请求访问的磁盘编号disks[i d];然后通过将模板的偏移offt和逻辑块的偏移off相加,就能得到全局的磁盘偏移位置。
根据本发明一个实施例,对于利用正交拉丁方性质进行数据布局的数据存储系统,对于待存储的数据,当希望以顺序读友好方式来存储时,利用上述基于正交拉丁方性质进行数据布局的方式确定条带和磁盘之间的映射关系,其中一个条带中的校验块是该条带的最后一个数据块,对各个条带进行排序,使得一个条带中的最后一个块所位于的磁盘编号是下一条带中的第一个数据块所位于的磁盘的编号。
根据本发明一个实施例,对于利用正交拉丁方性质进行数据布局的数据存储系统,对于待存储的数据,当希望以顺序写友好方式来存储时,其中一个条带中的校验块是该条带的最后一个数据块,对各个条带进行排序,使得一个条带中的最后一个块所位于的磁盘编号比下一条带中的第一个数据块所位于的磁盘的编号小一个数目,该数目为所述条带对应的映射组在所用拉丁方中的位置的行号。
以上已经描述了本发明的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此,本发明的保护范围应该以权利要求的保护范围为准。

Claims (12)

  1. 一种资源全局共享的基于RAID机制的数据存储系统,包括第一数目个磁盘,以RAID机制来在各个磁盘上存储数据,不同磁盘上的分块组成条带,条带的分块中的至少一块存储奇偶校验信息,其中条带的宽度小于第一数目,以及数据存储系统的数据布局满足如下特征:
    条带内部任意的两个物理块分布在不同磁盘上;
    各个磁盘上分布的数据块相同,分布的校验块也相同;
    与任意一块磁盘数据相关的条带的其它数据均衡分布在剩余所有磁盘上。
  2. 根据权利要求1的数据存储系统,所述第一数目为质数幂,用数字n表示第一数目,n大于等于4,用数字k表示条带的宽度,其中对于n,能够得到(n-1)个正交拉丁方,所述数据存储系统的数据布局按下述方式生成:
    得到(n-1)个正交拉丁方中的k个正交拉丁方,将k个正交拉丁方中元素值完全相同的行忽略,然后按照行优先的方式,遍历正交拉定方中剩余的所有位置,将处于就行列而言相同位置的元素值组合成映射组,每个映射组对应于一个条带,映射组中的各个元素值指示对应条带上的各个块所放置的磁盘的编号。
  3. 根据权利要求2的数据存储系统,
    按照下述定理生成k个正交拉丁方,忽略每个正交拉丁方的第一行,第1个正交拉丁方用L0表示,设第m个正交拉丁方中第i行,第j列的元素值为Lm-1[ij],则映射组(L0[ij],L1[ij],…Lm-1[ij],…,Lk-1[ij])指示第(i-1)*n+j个条带上的各个块所放置的磁盘的编号,其中第1个块放置于第L0[ij]个磁盘,第m个块放置在第Lm-1[ij]个磁盘,第k个块放置在第Lk-1[ij]个磁盘,
    其中每个磁盘的数据是按块为单位放置的。
    定理:n为质数幂时,对于完备正交拉丁方组中的第i个拉丁方fi(i∈[1,n-1]),其在第x行,第y列的元素值fi[x,y]=i·x+y。这里的操作符“·”和“+”分别代表有限域中的乘法和加法。
  4. 根据权利要求2的数据存储系统,其中当一块磁盘出故障时,针对与该出故障磁盘相关联的每个故障条带,从和所述故障条带相关联的其它 磁盘上并发读取用于计算重构的数据,将重构的数据存储在其它所有磁盘保留的空闲空间中,
    其中重构的数据所被写入的磁盘的编号是如下确定的:
    选择(n-1)个正交拉丁方中除了所述k个正交拉丁方之外的一个正交拉丁方,称之为第(k+1)个正交拉丁方,
    对于出故障磁盘,对于每个相关联的出故障条带,确认与该相关联的出故障条带对应的、在正交拉丁方上的位置,获得第(k+1)个正交拉丁方上的该位置处的元素值,该元素值指示重构的数据块被分配到的磁盘的编号,
    重构的数据块被存储在该编号指示的磁盘的空闲空间中。
  5. 根据权利要求4的数据存储系统,当又一块磁盘出现故障时,选择(n-1)个正交拉丁方中除了所述(k+1)个正交拉丁方之外的一个正交拉丁方,称之为第(k+2)个正交拉丁方,
    对于出故障磁盘,对于每个相关联的出故障条带,确认与该相关联的出故障条带对应的、在正交拉丁方上的位置,获得第(k+2)个正交拉丁方上的该位置处的元素值,该元素值指示重构的数据块被分配到的磁盘的编号,
    重构的数据块被存储在该编号指示的磁盘的空闲空间中。
  6. 根据权利要求2的数据存储系统,当p块磁盘同时出现故障时,确定与该P块磁盘中的任一块相关联的条带,
    对于与该P块磁盘中的任一块相关联的任一条带,确定所述条带的数据块中的位于该P块磁盘中数据块的数目;以及
    为位于该P块磁盘中数据块的数目较多的条带,分配较高的恢复优先级;
    优先恢复该恢复优先级较高的条带。
  7. 根据权利要求2的数据存储系统,所述数据存储系统存储具有不同存储模板的数据,
    其中针对来到的要以第一模板方式存储的第一数据,将所述第一数目个磁盘中的第一相应空间分给所述第一数据;对于第一数目个磁盘中的第一相应空间,针对第一模板的数据条带与第一相应空间之间的映射关系按照权利要求2的方式来建立;
    其中针对来到的要以第二模板方式存储的第二数据,将所述第一数目个磁盘中的第二相应空间分给所述第二数据;对于第一数目个磁盘中的第 二相应空间,针对第二模板的数据条带与第二相应空间之间的映射关系按照权利要求2的方式来建立。
  8. 根据权利要求7的数据存储系统,不同模板包括RAID级别、条带宽度、物理块大小、条带间编址策略中的至少一项不同。
  9. 根据权利要求7的数据存储系统,称相应空间为逻辑卷,每个逻辑卷使用同一类数据模板作为存储空间分配的粒度,
    通过索引技术跟踪逻辑卷中每个数据模板的物理位置,
    维护元数据,来实现用户卷与物理空间之间的映射,
    运行中将元数据缓存与内存中。
  10. 根据权利要求9的数据存储系统,当用户请求到来时,通过定位数据模板、查询索引表、定位条带、定位条带内部位置、计算全局位置来定位具体的物理访问位置。
  11. 根据权利要求2的数据存储系统,对于待存储的数据,当希望以顺序读友好方式来存储时,
    根据权利要求2的方式确定了条带和磁盘之间的映射关系,其中一个条带中的校验块是该条带的最后一个数据块,
    对各个条带进行排序,使得一个条带中的最后一个块所位于的磁盘编号是下一条带中的第一个数据块所位于的磁盘的编号。
  12. 根据权利要求2的数据存储系统,对于待存储的数据,当希望以顺序写友好方式来存储时,
    其中一个条带中的校验块是该条带的最后一个数据块,
    对各个条带进行排序,使得一个条带中的最后一个块所位于的磁盘编号比下一条带中的第一个数据块所位于的磁盘的编号小一个数目,
    该数目为所述条带对应的映射组在所用拉丁方中的位置的行号。
PCT/CN2017/110662 2017-11-13 2017-11-13 资源全局共享的基于raid机制的数据存储系统 WO2019090756A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2017/110662 WO2019090756A1 (zh) 2017-11-13 2017-11-13 资源全局共享的基于raid机制的数据存储系统
CN201780091514.4A CN111095217B (zh) 2017-11-13 2017-11-13 资源全局共享的基于raid机制的数据存储系统
US16/856,133 US10997025B2 (en) 2017-11-13 2020-04-23 RAID-based globally resource-shared data storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/110662 WO2019090756A1 (zh) 2017-11-13 2017-11-13 资源全局共享的基于raid机制的数据存储系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/856,133 Continuation US10997025B2 (en) 2017-11-13 2020-04-23 RAID-based globally resource-shared data storage system

Publications (1)

Publication Number Publication Date
WO2019090756A1 true WO2019090756A1 (zh) 2019-05-16

Family

ID=66438698

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/110662 WO2019090756A1 (zh) 2017-11-13 2017-11-13 资源全局共享的基于raid机制的数据存储系统

Country Status (3)

Country Link
US (1) US10997025B2 (zh)
CN (1) CN111095217B (zh)
WO (1) WO2019090756A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297601A (zh) * 2019-06-06 2019-10-01 清华大学 固态硬盘阵列构建方法、电子设备及存储介质
CN113835637A (zh) * 2020-03-19 2021-12-24 北京奥星贝斯科技有限公司 一种数据的写入方法、装置以及设备

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111124249B (zh) 2018-10-30 2023-09-22 伊姆西Ip控股有限责任公司 用于数据处理的方法、设备和计算机程序产品
US11209990B2 (en) * 2019-03-15 2021-12-28 Super Micro Computer, Inc. Apparatus and method of allocating data segments in storage regions of group of storage units
CN114415982B (zh) * 2022-03-30 2022-06-07 苏州浪潮智能科技有限公司 一种数据存储方法、装置、设备及可读存储介质
CN115712390B (zh) * 2022-11-14 2023-05-09 安超云软件有限公司 可用数据条带分片数确定方法及系统
CN116149576B (zh) * 2023-04-20 2023-07-25 北京大学 面向服务器无感知计算的磁盘冗余阵列重建方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5488701A (en) * 1994-11-17 1996-01-30 International Business Machines Corporation In log sparing for log structured arrays
CN101482802A (zh) * 2009-02-18 2009-07-15 杭州华三通信技术有限公司 独立磁盘冗余阵列5扩展方法及装置
CN104765660A (zh) * 2015-04-24 2015-07-08 中国人民解放军国防科学技术大学 一种基于ssd的raid6系统的单盘快速恢复方法器
CN104965768A (zh) * 2014-01-23 2015-10-07 Dssd股份有限公司 用于存储系统中的服务感知数据放置的方法和系统

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8261016B1 (en) * 2009-04-24 2012-09-04 Netapp, Inc. Method and system for balancing reconstruction load in a storage array using a scalable parity declustered layout
US8453036B1 (en) * 2010-02-01 2013-05-28 Network Appliance, Inc. System and method for dynamically resizing a parity declustered group
CN101976176B (zh) * 2010-08-19 2011-12-14 北京同有飞骥科技股份有限公司 一种水平型分组并行分布校验的磁盘阵列的构建方法
JP6369226B2 (ja) * 2014-08-28 2018-08-08 富士通株式会社 情報処理装置、情報処理システム、情報処理システムの制御方法および情報処理装置の制御プログラム
CN107250975B (zh) * 2014-12-09 2020-07-10 清华大学 数据存储系统和数据存储方法
CN105930097B (zh) * 2015-05-20 2019-01-29 德州学院 一种消除局部并行中小写操作的分布校验式磁盘阵列

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5488701A (en) * 1994-11-17 1996-01-30 International Business Machines Corporation In log sparing for log structured arrays
CN101482802A (zh) * 2009-02-18 2009-07-15 杭州华三通信技术有限公司 独立磁盘冗余阵列5扩展方法及装置
CN104965768A (zh) * 2014-01-23 2015-10-07 Dssd股份有限公司 用于存储系统中的服务感知数据放置的方法和系统
CN104765660A (zh) * 2015-04-24 2015-07-08 中国人民解放军国防科学技术大学 一种基于ssd的raid6系统的单盘快速恢复方法器

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110297601A (zh) * 2019-06-06 2019-10-01 清华大学 固态硬盘阵列构建方法、电子设备及存储介质
CN110297601B (zh) * 2019-06-06 2020-06-23 清华大学 固态硬盘阵列构建方法、电子设备及存储介质
CN113835637A (zh) * 2020-03-19 2021-12-24 北京奥星贝斯科技有限公司 一种数据的写入方法、装置以及设备

Also Published As

Publication number Publication date
US10997025B2 (en) 2021-05-04
US20200250036A1 (en) 2020-08-06
CN111095217A (zh) 2020-05-01
CN111095217B (zh) 2024-02-06

Similar Documents

Publication Publication Date Title
WO2019090756A1 (zh) 资源全局共享的基于raid机制的数据存储系统
AU2017203459B2 (en) Efficient data reads from distributed storage systems
US9823980B2 (en) Prioritizing data reconstruction in distributed storage systems
US10146447B1 (en) Mapped RAID (redundant array of independent disks) in a data storage system with drive extents allocated to individual RAID extents from individual sub-groups of storage made up of ranges of logical block addresses defined across a group of hard disk drives
US9311194B1 (en) Efficient resource utilization in data centers
US11327668B1 (en) Predictable member assignment for expanding flexible raid system
JP6663482B2 (ja) 計算機システム、物理記憶デバイスの制御方法、および記録媒体
US10310752B1 (en) Extent selection with mapped raid
US10521145B1 (en) Method, apparatus and computer program product for managing data storage
CN104714758B (zh) 一种基于校验raid加入镜像结构的阵列构建方法及读写系统
CN110096218A (zh) 响应于向使用映射raid技术的数据存储系统添加存储驱动器,减少驱动器盘区分配变化
JP2016038767A (ja) ストレージ制御装置、ストレージ制御プログラム、及びストレージ制御方法
US11687272B2 (en) Method and system for dynamic topology-aware space allocation in a distributed system
US11868637B2 (en) Flexible raid sparing using disk splits
US11256428B2 (en) Scaling raid-based storage by redistributing splits
Li et al. Exploiting decoding computational locality to improve the I/O performance of an XOR-coded storage cluster under concurrent failures
TWI794541B (zh) 儲存空間之自動組態之設備及方法
US11630596B1 (en) Sharing spare capacity of disks with multiple sizes to parallelize RAID rebuild
US20230385167A1 (en) Balanced data mirroring distribution for parallel access
Thomasian Mirrored and hybrid disk arrays and their reliability
Han Studies of disk arrays tolerating two disk failures and a proposal for a heterogeneous disk array

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17931397

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17931397

Country of ref document: EP

Kind code of ref document: A1