US12008271B1 - Adaptive raid width and distribution for flexible storage - Google Patents

Adaptive raid width and distribution for flexible storage Download PDF

Info

Publication number
US12008271B1
US12008271B1 US18/302,837 US202318302837A US12008271B1 US 12008271 B1 US12008271 B1 US 12008271B1 US 202318302837 A US202318302837 A US 202318302837A US 12008271 B1 US12008271 B1 US 12008271B1
Authority
US
United States
Prior art keywords
distribution pattern
protection groups
storage
distribution
linear
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
US18/302,837
Inventor
Kuolin Hua
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dell Products LP
Original Assignee
Dell Products LP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dell Products LP filed Critical Dell Products LP
Priority to US18/302,837 priority Critical patent/US12008271B1/en
Assigned to DELL PRODUCTS L.P. reassignment DELL PRODUCTS L.P. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: HUA, KUOLIN
Application granted granted Critical
Publication of US12008271B1 publication Critical patent/US12008271B1/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0619Improving the reliability of storage systems in relation to data integrity, e.g. data losses, bit errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Definitions

  • the subject matter of this disclosure is generally related to data storage systems.
  • Scalable data storage systems for organizations are designed to have low initial cost and be reconfigurable to grow to meet future needs.
  • homogeneous software-defined, server-based storage area network (SAN) storage nodes can be used as modular storage system building blocks.
  • Storage system resiliency is based on redundant array of independent disks (RAID) protection groups having members distributed across different storage nodes.
  • RAID redundant array of independent disks
  • the width of the implemented RAID level can limit the granularity at which storage capacity can be scaled.
  • reconfiguring the storage system to a different RAID level may be difficult, which is problematic because modular storage systems may initially only be capable of being configured with a smaller RAID width but eventually include enough storage nodes to be configured for a more efficient larger RAID width.
  • an apparatus comprises a plurality of homogeneous storage nodes that are interconnected via a network, each storage node comprising a server and local storage, each server comprising a multi-core processor, memory, and a storage controller configured to: subdivide storage capacity of the local storage into indexed same-size cells; create and distribute members of protection groups across the storage nodes in the cells in a recursive fractal distribution pattern; and metamorphose distribution of sets of the members of the protection groups from the recursive fractal distribution pattern to a linear distribution pattern in response to addition of new storage nodes, thereby enabling scaling in increments of single storage nodes.
  • a method comprises subdividing storage capacity of local storage of a plurality of homogeneous storage nodes that are interconnected via a network into indexed same-size cells; creating and distributing members of protection groups across the storage nodes in the cells in a recursive fractal distribution pattern; and metamorphosing distribution of sets of the members of the protection groups from the recursive fractal distribution pattern to a linear distribution pattern in response to addition of new storage nodes, thereby enabling scaling in increments of single storage nodes.
  • a non-transitory computer-readable storage medium stores instructions that when executed by a computer cause the computer to perform a method comprising subdividing storage capacity of local storage of a plurality of homogeneous storage nodes that are interconnected via a network into indexed same-size cells; creating and distributing members of protection groups across the storage nodes in the cells in a recursive fractal distribution pattern; and metamorphosing distribution of sets of the members of the protection groups from the recursive fractal distribution pattern to a linear distribution pattern in response to addition of new storage nodes, thereby enabling scaling in increments of single storage nodes.
  • FIG. 1 illustrates a software-defined, server-based storage area network (SAN) with single node granular scaling and adaptive RAID width capabilities.
  • SAN storage area network
  • FIG. 2 illustrates matrices that represent organization of aggregate storage in RAID groups characterized by recursive fractal patterns.
  • FIGS. 3 A and 3 B illustrate scaling by metamorphosing from recursive fractal distribution to linear distribution.
  • FIGS. 4 A and 4 B illustrate scaling by metamorphosing from linear distribution to recursive fractal distribution.
  • FIGS. 5 A, 5 B, 6 A, 6 B, 7 A, 7 B, and 8 illustrate RAID width adaptation.
  • FIG. 9 illustrates a method for granular scaling and RAID width adaptation.
  • Some aspects, features and implementations described herein may include machines such as computers, electronic components, optical components, and processes such as computer-implemented steps. It will be apparent to those of ordinary skill in the art that the computer-implemented steps may be stored as computer-executable instructions on a non-transitory computer-readable medium. Furthermore, it will be understood by those of ordinary skill in the art that the computer-executable instructions may be executed on a variety of tangible processor devices. For ease of exposition, not every step, device or component that may be part of a computer or data storage system is described herein. Those of ordinary skill in the art will recognize such steps, devices, and components in view of the teachings of the present disclosure and the knowledge generally available to those of ordinary skill in the art. The corresponding machines and processes are therefore enabled and within the scope of the disclosure.
  • logical and “virtual” are used to refer to features that are abstractions of other features, e.g., and without limitation abstractions of tangible features.
  • physical is used to refer to tangible features. For example, multiple virtual computing devices could operate simultaneously on one physical computing device.
  • logic is used to refer to special purpose physical circuit elements and software instructions that are stored on a non-transitory computer-readable medium and implemented by multi-purpose tangible processors.
  • disk and “drive” are used interchangeably and are not intended to be limited to a particular type of non-volatile data storage media.
  • FIG. 1 illustrates a software-defined, server-based storage area network (SAN) with single node granular scaling and adaptive RAID width capabilities.
  • the software-defined SAN includes multiple homogeneous storage nodes 104 - 1 , 104 - 2 , 104 - 3 , 104 - 4 that are interconnected via an Internet Protocol (IP) network 112 .
  • IP Internet Protocol
  • Each storage node includes a server 100 and local storage 106 .
  • Each server 100 includes a multi-core CPU 102 , a storage controller 108 , and a memory bank 110 .
  • the CPU 102 includes L1 onboard cache.
  • Each memory bank 110 includes L2/L3 cache and main memory implemented with one or both of volatile memory components such as Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), and non-volatile memory (NVM) such as storage class memory (SCM).
  • the local storage 106 includes one or more non-volatile disk drives.
  • the local storage may include solid-state drives (SSDs) based on electrically erasable programmable read-only memory (EEPROM) technology such as NAND and NOR flash memory and hard disk drives (HDDs) with spinning disk magnetic storage media.
  • SSDs solid-state drives
  • EEPROM electrically erasable programmable read-only memory
  • HDDs hard disk drives
  • Each local storage has the same type and quantity of local storage devices with matching storage capacity, so each storage node has the same storage capacity.
  • the storage controllers 108 manage organization of the directly-attached local storage 106 .
  • the local storage is subdivided into equal size units of storage capacity referred to as cells.
  • the cells are indexed, e.g., 1, 2, 3, . . . , and the cell indices are used uniformly for all storage nodes.
  • Aggregate storage capacity of the SAN is organized into RAID protection groups with members located in cells that are distributed across different storage nodes.
  • the number of cells may be an integer multiple of the target RAID width.
  • Storage system efficiency and data availability are determined by the number of data and parity members per RAID protection group, which is referred to as RAID width. A large RAID width corresponds to lower parity overhead cost and greater storage efficiency.
  • the storage system may initially be implemented with too few storage nodes for a large RAID width because storage nodes in a modular storage system are purchased and scaled based on evolving organizational needs.
  • the aggregate storage capacity of the disclosed storage system can advantageously be scaled in increments of single storage nodes regardless of the implemented RAID level and the RAID width and distribution can be automatically adapted to the number of storage nodes in the storage system such that new RAID groups with larger RAID widths are formed to improve storage efficiency or data availability when scaling provides a sufficient number of storage nodes in the storage system.
  • recursive fractal storage configurations supporting RAID widths 2, 4, and 8 for RAID-1 (1+1), RAID-5 (3+1), and RAID-6 (6+2) may be viewed as 2 ⁇ 2, 4 ⁇ 4, and 8 ⁇ 8 matrices, respectively, where each row corresponds to a storage node and each column corresponds to a cell index.
  • the reference number within each position of a matrix represents the RAID group of the member that occupies that position. Every RAID group comprises members distributed across different rows because the members are distributed across different storage nodes to protect against storage node failure.
  • RAID-6 (6+2) group 0 has members on a diagonal of the 8 ⁇ 8 matrix.
  • RAID protection group distribution in the illustrated matrices, and thus aggregate storage is characterized by recursive fractal patterns.
  • the 4 ⁇ 4 matrix is a 2 ⁇ 2 matrix of 4-cell patterns, where each pattern is a 2 ⁇ 2 submatrix.
  • the 8 ⁇ 8 matrix is patterned as four 4 ⁇ 4 and/or eight 2 ⁇ 2 submatrices. Larger (by powers of 2) matrices can be recursively patterned accordingly.
  • recursive fractal distribution of protection group members enables granular scaling and RAID width adaptation. For example, a storage system could initially be configured according to the 2 ⁇ 2 matrix and evolve into the distribution of the 4 ⁇ 4 matrix and then the 8 ⁇ 8 matrix.
  • FIGS. 3 A and 3 B illustrate scaling by metamorphosing from recursive fractal distribution to linear distribution in the context of RAID-5 (3+1) groups (a, b, c, d) starting with a 4 ⁇ 4 matrix.
  • RAID-5 (3+1) groups (a, b, c, d) starting with a 4 ⁇ 4 matrix.
  • members of RAID groups (a, b, c, d) from four same-index cells in the first column are groupwise-rotationally relocated from the original 4 ⁇ 4 matrix into the cells of the new row corresponding to storage node 5.
  • the relocation frees space for creation of a new RAID-5 (3+1) group (A) that is linearly distributed in the vacated same-index cells.
  • FIGS. 4 A and 4 B illustrate scaling by metamorphosing from linear distribution to recursive fractal distribution.
  • the 4 ⁇ 4 recursive fractal matrix described above is superimposed on one of the 4 ⁇ 4 linearly distributed subsets (nodes 1-4) that was split-off as described above.
  • New RAID groups are distributed according to the distribution pattern of the matching size recursive fractal matrix. For example, when a new row 5 is added, the RAID group members on the diagonal cells superimposed with number 0 are relocated to the new row. A new RAID group (a) is created using the vacated cells. The procedure is iterated with numbers 1, 2, and 3, in sequence.
  • FIGS. 5 A, 5 B, 6 A, 6 B, 7 A, 7 B, and 8 illustrate RAID width adaptation.
  • the addition of storage nodes enables implementation of larger RAID widths.
  • RAID distribution can be automatically metamorphosed such that new RAID groups are formed with larger RAID width.
  • a system of RAID-5 (3+1) may evolve into RAID-6 (6+2) as the storage system grows from 4 nodes to more than 8 nodes.
  • the metamorphosis methods adapt to “power of 2” RAID widths with alternating patterns of recursive fractal distribution and linear distribution.
  • RAID groups characterized by width 4 can be distributed by repeating the 4 ⁇ 4 recursive fractal matrix pattern in adjacent 4-cell groups. Each 4-cell group is separately metamorphosed as described above.
  • New RAID groups (A, E) are formed with column-wise linear distribution as the system grows with the addition of storage node 5. All RAID groups will be linearly aligned (per column) as the system grows to 8 storage nodes by metamorphosing from recursive fractal to linear distribution as shown in FIG. 5 B .
  • new RAID groups are formed by metamorphosing from linear distribution according to the superimposed 8 ⁇ 8 recursive fractal matrix.
  • the vacated cells are allocated to a new RAID group (a).
  • original RAID group members from cells marked in subsequent sequential numbers of the 8 ⁇ 8 recursive fractal matrix are relocated to the new storage nodes.
  • the vacated cells are allocated to new RAID groups (b, c, d, e, f, g, h).
  • the storage system is split into two subsets after the addition of eight new storage nodes.
  • the first subset of nodes 9-16 has RAID width 4 groups in a linear distribution pattern. There are two RAID width 4 groups per column.
  • RAID width 4 groups per column.
  • same-cell-index pairs of smaller RAID width 4 groups are combined into single larger RAID width 8 groups. Parity is recomputed, e.g., from RAID-5 (3+1) to RAID-6 (6+2).
  • the subset of storage nodes 1-8 includes RAID width 8 groups in a recursive fractal distribution pattern. This subset will support subsequent metamorphosis growth and split cycles. For example, RAID members in the first column will be relocated to a new storage node 17, making room for a new RAID group (i) to be formed. RAID members in subsequent columns may be relocated to successive new storage nodes for new RAID groups to be formed.
  • the original subset will again metamorphose and split into two subsets, with a linear distribution of all RAID groups.
  • the system may continue to grow in metamorphosis cycles, with alternating patterns of recursive fractal and linear distribution.
  • the system may support larger (power of 2) RAID widths, using the recursive fractals and metamorphosis as described above.
  • RAID 5 3+1) or RAID 6 (6+2)
  • 87.5% RAID 5 (7+1) or RAID 6 (14+2)
  • 93.75% RAID 5 (15+1) or RAID 6 (30+2), etc.
  • a larger RAID width may also increase the complexity of RAID recovery in case of a storage node failure. Therefore, a target maximum RAID width may be selected so that automated RAID width adaptation does not increase RAID width indefinitely.
  • FIG. 9 illustrates a method for granular scaling and RAID width adaptation.
  • Step 900 is subdividing the storage capacity of a group of homogeneous storage nodes into indexed same-size cells.
  • the storage nodes may be parts of a software-defined, server-based SAN, but that should not be viewed as a limitation because other types of storage systems and nodes could be used.
  • Step 902 is creating and distributing RAID groups across the nodes in a recursive fractal distribution pattern.
  • the RAID level may be selected based on the number of storage nodes, e.g., with the RAID width being selected as a function of the number of storage nodes. No more than one member of any RAID group is located on the same storage node.
  • Step 904 is adding new storage nodes by metamorphosing from the recursive fractal distribution pattern to a linear distribution pattern.
  • the nodes may be added individually. For example, same-cell-index members may be groupwise-rotated to cells on a new storage node, iteratively for each new storage node with sequentially indexed cells until there are enough nodes to increase the RAID width. After there are enough nodes to increase the RAID width as determined in step 906 , step 908 determines whether the target RAID width has been reached. If the target RAID width has not been reached, then step 910 is adding new storage nodes and adding new larger width RAID groups in the recursive fractal distribution pattern by metamorphosing from the linear distribution pattern.
  • the step may be performed automatically, and the new RAID level may be selected based on the number of storage nodes in the system. Parity is recomputed. This is iterated until there are enough nodes for a split as determined in step 912 . When there are enough storage nodes for a split, a split is implemented in step 914 and the flow returns to step 904 .
  • the matrix of cells into two subsets: a first subset that is distributed in a linear pattern and a second subset that is distributed in a recursive fractal pattern. If the target RAID width has been reached as determined in step 908 , then flow exits from the existing loops of steps and new RAID groups can still be added by metamorphosing between recursive fractal and linear distribution patterns in step 916 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Storage Device Security (AREA)

Abstract

A software-defined, server-based storage system is configured to support single node granular scaling and adaptive RAID width capabilities. The storage system includes multiple homogeneous storage nodes, each including a server and local storage. Aggregate storage is organized into same-size cells. RAID group members are distributed in cells across storage nodes in a recursive fractal pattern. The storage system is scaled by metamorphosing between recursive fractal distribution of the RAID groups and linear distribution of the RAID groups and splitting matrices of cells. When a sufficient number of new storage nodes have been added, new larger width RAID groups will be formed.

Description

TECHNICAL FIELD
The subject matter of this disclosure is generally related to data storage systems.
BACKGROUND
Scalable data storage systems for organizations are designed to have low initial cost and be reconfigurable to grow to meet future needs. For example, homogeneous software-defined, server-based storage area network (SAN) storage nodes can be used as modular storage system building blocks. Storage system resiliency is based on redundant array of independent disks (RAID) protection groups having members distributed across different storage nodes. However, the width of the implemented RAID level can limit the granularity at which storage capacity can be scaled. Moreover, reconfiguring the storage system to a different RAID level may be difficult, which is problematic because modular storage systems may initially only be capable of being configured with a smaller RAID width but eventually include enough storage nodes to be configured for a more efficient larger RAID width.
SUMMARY
The examples described herein are not intended to be limiting. All examples, aspects, and features mentioned in this document can be combined in any technically possible way.
In accordance with some aspects, an apparatus comprises a plurality of homogeneous storage nodes that are interconnected via a network, each storage node comprising a server and local storage, each server comprising a multi-core processor, memory, and a storage controller configured to: subdivide storage capacity of the local storage into indexed same-size cells; create and distribute members of protection groups across the storage nodes in the cells in a recursive fractal distribution pattern; and metamorphose distribution of sets of the members of the protection groups from the recursive fractal distribution pattern to a linear distribution pattern in response to addition of new storage nodes, thereby enabling scaling in increments of single storage nodes.
In accordance with some aspects, a method comprises subdividing storage capacity of local storage of a plurality of homogeneous storage nodes that are interconnected via a network into indexed same-size cells; creating and distributing members of protection groups across the storage nodes in the cells in a recursive fractal distribution pattern; and metamorphosing distribution of sets of the members of the protection groups from the recursive fractal distribution pattern to a linear distribution pattern in response to addition of new storage nodes, thereby enabling scaling in increments of single storage nodes.
In accordance with some aspects, a non-transitory computer-readable storage medium stores instructions that when executed by a computer cause the computer to perform a method comprising subdividing storage capacity of local storage of a plurality of homogeneous storage nodes that are interconnected via a network into indexed same-size cells; creating and distributing members of protection groups across the storage nodes in the cells in a recursive fractal distribution pattern; and metamorphosing distribution of sets of the members of the protection groups from the recursive fractal distribution pattern to a linear distribution pattern in response to addition of new storage nodes, thereby enabling scaling in increments of single storage nodes.
BRIEF DESCRIPTION OF THE FIGURES
FIG. 1 illustrates a software-defined, server-based storage area network (SAN) with single node granular scaling and adaptive RAID width capabilities.
FIG. 2 illustrates matrices that represent organization of aggregate storage in RAID groups characterized by recursive fractal patterns.
FIGS. 3A and 3B illustrate scaling by metamorphosing from recursive fractal distribution to linear distribution.
FIGS. 4A and 4B illustrate scaling by metamorphosing from linear distribution to recursive fractal distribution.
FIGS. 5A, 5B, 6A, 6B, 7A, 7B, and 8 illustrate RAID width adaptation.
FIG. 9 illustrates a method for granular scaling and RAID width adaptation.
DETAILED DESCRIPTION
Some aspects, features and implementations described herein may include machines such as computers, electronic components, optical components, and processes such as computer-implemented steps. It will be apparent to those of ordinary skill in the art that the computer-implemented steps may be stored as computer-executable instructions on a non-transitory computer-readable medium. Furthermore, it will be understood by those of ordinary skill in the art that the computer-executable instructions may be executed on a variety of tangible processor devices. For ease of exposition, not every step, device or component that may be part of a computer or data storage system is described herein. Those of ordinary skill in the art will recognize such steps, devices, and components in view of the teachings of the present disclosure and the knowledge generally available to those of ordinary skill in the art. The corresponding machines and processes are therefore enabled and within the scope of the disclosure.
The terminology used in this disclosure is intended to be interpreted broadly within the limits of subject matter eligibility. The terms “logical” and “virtual” are used to refer to features that are abstractions of other features, e.g., and without limitation abstractions of tangible features. The term “physical” is used to refer to tangible features. For example, multiple virtual computing devices could operate simultaneously on one physical computing device. The term “logic” is used to refer to special purpose physical circuit elements and software instructions that are stored on a non-transitory computer-readable medium and implemented by multi-purpose tangible processors. The terms “disk” and “drive” are used interchangeably and are not intended to be limited to a particular type of non-volatile data storage media.
FIG. 1 illustrates a software-defined, server-based storage area network (SAN) with single node granular scaling and adaptive RAID width capabilities. The software-defined SAN includes multiple homogeneous storage nodes 104-1, 104-2, 104-3, 104-4 that are interconnected via an Internet Protocol (IP) network 112. Each storage node includes a server 100 and local storage 106. Each server 100 includes a multi-core CPU 102, a storage controller 108, and a memory bank 110. The CPU 102 includes L1 onboard cache. Each memory bank 110 includes L2/L3 cache and main memory implemented with one or both of volatile memory components such as Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), and non-volatile memory (NVM) such as storage class memory (SCM). The local storage 106 includes one or more non-volatile disk drives. For example, the local storage may include solid-state drives (SSDs) based on electrically erasable programmable read-only memory (EEPROM) technology such as NAND and NOR flash memory and hard disk drives (HDDs) with spinning disk magnetic storage media. Each local storage has the same type and quantity of local storage devices with matching storage capacity, so each storage node has the same storage capacity.
The storage controllers 108 manage organization of the directly-attached local storage 106. The local storage is subdivided into equal size units of storage capacity referred to as cells. The cells are indexed, e.g., 1, 2, 3, . . . , and the cell indices are used uniformly for all storage nodes. Aggregate storage capacity of the SAN is organized into RAID protection groups with members located in cells that are distributed across different storage nodes. The number of cells may be an integer multiple of the target RAID width. Storage system efficiency and data availability are determined by the number of data and parity members per RAID protection group, which is referred to as RAID width. A large RAID width corresponds to lower parity overhead cost and greater storage efficiency. However, the storage system may initially be implemented with too few storage nodes for a large RAID width because storage nodes in a modular storage system are purchased and scaled based on evolving organizational needs. As will be explained in greater detail below, the aggregate storage capacity of the disclosed storage system can advantageously be scaled in increments of single storage nodes regardless of the implemented RAID level and the RAID width and distribution can be automatically adapted to the number of storage nodes in the storage system such that new RAID groups with larger RAID widths are formed to improve storage efficiency or data availability when scaling provides a sufficient number of storage nodes in the storage system.
As shown in FIG. 2 , recursive fractal storage configurations supporting RAID widths 2, 4, and 8 for RAID-1 (1+1), RAID-5 (3+1), and RAID-6 (6+2), for example, may be viewed as 2×2, 4×4, and 8×8 matrices, respectively, where each row corresponds to a storage node and each column corresponds to a cell index. The reference number within each position of a matrix represents the RAID group of the member that occupies that position. Every RAID group comprises members distributed across different rows because the members are distributed across different storage nodes to protect against storage node failure. For example, RAID-6 (6+2) group 0 has members on a diagonal of the 8×8 matrix. RAID protection group distribution in the illustrated matrices, and thus aggregate storage, is characterized by recursive fractal patterns. For example, the 4×4 matrix is a 2×2 matrix of 4-cell patterns, where each pattern is a 2×2 submatrix. Similarly, the 8×8 matrix is patterned as four 4×4 and/or eight 2×2 submatrices. Larger (by powers of 2) matrices can be recursively patterned accordingly. As will be explained below, recursive fractal distribution of protection group members enables granular scaling and RAID width adaptation. For example, a storage system could initially be configured according to the 2×2 matrix and evolve into the distribution of the 4×4 matrix and then the 8×8 matrix.
FIGS. 3A and 3B illustrate scaling by metamorphosing from recursive fractal distribution to linear distribution in the context of RAID-5 (3+1) groups (a, b, c, d) starting with a 4×4 matrix. When a new storage node 5 is added to the 4-node storage system, members of RAID groups (a, b, c, d) from four same-index cells in the first column are groupwise-rotationally relocated from the original 4×4 matrix into the cells of the new row corresponding to storage node 5. The relocation frees space for creation of a new RAID-5 (3+1) group (A) that is linearly distributed in the vacated same-index cells. As additional storage nodes are added, members from sequentially adjacent same-index cell columns of the original 4-node storage system with recursive fractal distributed groups (a, b, c, d) are groupwise-rotationally relocated into the new rows corresponding to the new storage nodes to free space for new RAID-5 (3+1) groups (B, C, D) that are linearly distributed in the vacated same-index cells. After adding four new storage nodes (5-8), all the protection groups are linearly distributed, and storage capacity can be split into two 4-node linearly distributed subsets: nodes 1-4 and nodes 5-8.
FIGS. 4A and 4B illustrate scaling by metamorphosing from linear distribution to recursive fractal distribution. The 4×4 recursive fractal matrix described above is superimposed on one of the 4×4 linearly distributed subsets (nodes 1-4) that was split-off as described above. New RAID groups are distributed according to the distribution pattern of the matching size recursive fractal matrix. For example, when a new row 5 is added, the RAID group members on the diagonal cells superimposed with number 0 are relocated to the new row. A new RAID group (a) is created using the vacated cells. The procedure is iterated with numbers 1, 2, and 3, in sequence. After adding three more rows (6-8) in this manner, three more new RAID groups (b, c, d) have been created using cells superimposed with numbers 1, 2, and 3, respectively. Members of each original RAID group are relocated to cells of the new storage nodes at the same-column indices. The linear distribution pattern thus metamorphoses into two subsets: a first subset characterized by a recursive fractal pattern and a second subset characterized by a column-wise linear pattern.
FIGS. 5A, 5B, 6A, 6B, 7A, 7B, and 8 illustrate RAID width adaptation. The addition of storage nodes enables implementation of larger RAID widths. RAID distribution can be automatically metamorphosed such that new RAID groups are formed with larger RAID width. For example, a system of RAID-5 (3+1) may evolve into RAID-6 (6+2) as the storage system grows from 4 nodes to more than 8 nodes. The metamorphosis methods adapt to “power of 2” RAID widths with alternating patterns of recursive fractal distribution and linear distribution.
Referring to FIG. 5A, if the storage system is initially configured with 4 storage nodes with 8 cells per node, RAID groups characterized by width 4 can be distributed by repeating the 4×4 recursive fractal matrix pattern in adjacent 4-cell groups. Each 4-cell group is separately metamorphosed as described above. New RAID groups (A, E) are formed with column-wise linear distribution as the system grows with the addition of storage node 5. All RAID groups will be linearly aligned (per column) as the system grows to 8 storage nodes by metamorphosing from recursive fractal to linear distribution as shown in FIG. 5B.
Referring to FIGS. 6A and 6B, as the storage system grows beyond 8 storage nodes, new RAID groups are formed by metamorphosing from linear distribution according to the superimposed 8×8 recursive fractal matrix. After relocating RAID group members from the diagonal cells (marked 0) to the new node 9, the vacated cells are allocated to a new RAID group (a). As more storage nodes are added to the system, original RAID group members from cells marked in subsequent sequential numbers of the 8×8 recursive fractal matrix are relocated to the new storage nodes. The vacated cells are allocated to new RAID groups (b, c, d, e, f, g, h).
Referring to FIG. 7A, the storage system is split into two subsets after the addition of eight new storage nodes. The first subset of nodes 9-16 has RAID width 4 groups in a linear distribution pattern. There are two RAID width 4 groups per column. Referring to FIG. 7B, same-cell-index pairs of smaller RAID width 4 groups are combined into single larger RAID width 8 groups. Parity is recomputed, e.g., from RAID-5 (3+1) to RAID-6 (6+2). The subset of storage nodes 1-8 includes RAID width 8 groups in a recursive fractal distribution pattern. This subset will support subsequent metamorphosis growth and split cycles. For example, RAID members in the first column will be relocated to a new storage node 17, making room for a new RAID group (i) to be formed. RAID members in subsequent columns may be relocated to successive new storage nodes for new RAID groups to be formed.
Referring to FIG. 8 , after adding eight new storage nodes, the original subset will again metamorphose and split into two subsets, with a linear distribution of all RAID groups. The system may continue to grow in metamorphosis cycles, with alternating patterns of recursive fractal and linear distribution.
If the storage space of each storage node is subdivided into more cells, the system may support larger (power of 2) RAID widths, using the recursive fractals and metamorphosis as described above. However, the storage efficiency is subject to diminishing returns because storage efficiency will be 75% with RAID 5 (3+1) or RAID 6 (6+2), 87.5% with RAID 5 (7+1) or RAID 6 (14+2), and 93.75% with RAID 5 (15+1) or RAID 6 (30+2), etc. A larger RAID width may also increase the complexity of RAID recovery in case of a storage node failure. Therefore, a target maximum RAID width may be selected so that automated RAID width adaptation does not increase RAID width indefinitely.
FIG. 9 illustrates a method for granular scaling and RAID width adaptation. Step 900 is subdividing the storage capacity of a group of homogeneous storage nodes into indexed same-size cells. The storage nodes may be parts of a software-defined, server-based SAN, but that should not be viewed as a limitation because other types of storage systems and nodes could be used. Step 902 is creating and distributing RAID groups across the nodes in a recursive fractal distribution pattern. The RAID level may be selected based on the number of storage nodes, e.g., with the RAID width being selected as a function of the number of storage nodes. No more than one member of any RAID group is located on the same storage node. Step 904 is adding new storage nodes by metamorphosing from the recursive fractal distribution pattern to a linear distribution pattern. The nodes may be added individually. For example, same-cell-index members may be groupwise-rotated to cells on a new storage node, iteratively for each new storage node with sequentially indexed cells until there are enough nodes to increase the RAID width. After there are enough nodes to increase the RAID width as determined in step 906, step 908 determines whether the target RAID width has been reached. If the target RAID width has not been reached, then step 910 is adding new storage nodes and adding new larger width RAID groups in the recursive fractal distribution pattern by metamorphosing from the linear distribution pattern. The step may be performed automatically, and the new RAID level may be selected based on the number of storage nodes in the system. Parity is recomputed. This is iterated until there are enough nodes for a split as determined in step 912. When there are enough storage nodes for a split, a split is implemented in step 914 and the flow returns to step 904. The matrix of cells into two subsets: a first subset that is distributed in a linear pattern and a second subset that is distributed in a recursive fractal pattern. If the target RAID width has been reached as determined in step 908, then flow exits from the existing loops of steps and new RAID groups can still be added by metamorphosing between recursive fractal and linear distribution patterns in step 916.
A number of features, aspects, embodiments, and implementations have been described. Nevertheless, it will be understood that a wide variety of modifications and combinations may be made without departing from the scope of the inventive concepts described herein. Accordingly, those modifications and combinations are within the scope of the following claims.

Claims (20)

What is claimed is:
1. An apparatus comprising:
a plurality of homogeneous storage nodes that are interconnected via a network, each storage node comprising a server and local storage, each server comprising a multi-core processor, memory, and a storage controller configured to:
subdivide storage capacity of the local storage into indexed same-size cells;
create and distribute members of protection groups across the storage nodes in the cells in a recursive fractal distribution pattern; and
metamorphose distribution of sets of the members of the protection groups from the recursive fractal distribution pattern to a linear distribution pattern in response to addition of new storage nodes, thereby enabling scaling in increments of single storage nodes.
2. The apparatus of claim 1 further comprising the storage controller being configured to metamorphose distribution of the protection groups from the recursive fractal distribution pattern to the linear distribution pattern until all protection groups are linearly distributed.
3. The apparatus of claim 2 further comprising the storage controller being configured to metamorphose distribution of the protection groups from the recursive fractal distribution pattern to the linear distribution pattern via group-wise rotation of same-cell-index protection group members to cells on a new storage node, iteratively for each new storage node with sequentially indexed cells.
4. The apparatus of claim 3 further comprising the storage controller being configured to split the protection groups into two subsets and metamorphose distribution of the protection groups of one of the subsets from the linear distribution pattern to the recursive fractal distribution pattern in response to addition of new storage nodes.
5. The apparatus of claim 4 further comprising the storage controller being configured to metamorphose distribution of the protection groups of one of the subsets from the linear distribution pattern to the recursive fractal distribution pattern by selecting protection group members corresponding to superimposition of a recursive fractal matrix on the linear distribution pattern.
6. The apparatus of claim 5 further comprising the storage controller being configured to split the protection groups into two subsets: a first subset that is distributed in a linear pattern and a second subset that is distributed in a recursive fractal pattern.
7. The apparatus of claim 6 further comprising the storage controller being configured to combine pairs of protection groups with the same cell index into single wider protection groups responsive to addition of a sufficient number of storage nodes.
8. A method comprising:
subdividing storage capacity of local storage of a plurality of homogeneous storage nodes that are interconnected via a network into indexed same-size cells;
creating and distributing members of protection groups across the storage nodes in the cells in a recursive fractal distribution pattern; and
metamorphosing distribution of sets of the members of the protection groups from the recursive fractal distribution pattern to a linear distribution pattern in response to addition of new storage nodes, thereby enabling scaling in increments of single storage nodes.
9. The method of claim 8 further comprising metamorphosing distribution of the protection groups from the recursive fractal distribution pattern to the linear distribution pattern until all protection groups are linearly distributed.
10. The method of claim 9 further comprising metamorphosing distribution of the protection groups from the recursive fractal distribution pattern to the linear distribution pattern by group-wise rotating same-cell-index protection group members to cells on a new storage node, iteratively for each new storage node with sequentially indexed cells.
11. The method of claim 10 further comprising splitting the protection groups into two subsets and metamorphosing distribution of the protection groups of one of the subsets from the linear distribution pattern to the recursive fractal distribution pattern in response to addition of new storage nodes.
12. The method of claim 11 further comprising metamorphosing distribution of the protection groups of one of the subsets from the linear distribution pattern to the recursive fractal distribution pattern by selecting protection group members corresponding to superimposition of a recursive fractal matrix on the linear distribution pattern.
13. The method of claim 12 further comprising splitting the protection groups into two subsets: a first subset that is distributed in a linear pattern and a second subset that is distributed in a recursive fractal pattern.
14. The method of claim 13 further comprising combining pairs of protection groups with the same cell index into single wider protection groups responsive to addition of a sufficient number of storage nodes.
15. A non-transitory computer-readable storage medium storing instructions that are executed by a computer to perform a method comprising:
subdividing storage capacity of local storage of a plurality of homogeneous storage nodes that are interconnected via a network into indexed same-size cells;
creating and distributing members of protection groups across the storage nodes in the cells in a recursive fractal distribution pattern; and
metamorphosing distribution of sets of the members of the protection groups from the recursive fractal distribution pattern to a linear distribution pattern in response to addition of new storage nodes, thereby enabling scaling in increments of single storage nodes.
16. The non-transitory computer-readable storage medium of claim 15 in which the method further comprises metamorphosing distribution of the protection groups from the recursive fractal distribution pattern to the linear distribution pattern until all protection groups are linearly distributed.
17. The non-transitory computer-readable storage medium of claim 16 in which the method further comprises metamorphosing distribution of the protection groups from the recursive fractal distribution pattern to the linear distribution pattern by group-wise rotating same-cell-index protection group members to cells on a new storage node, iteratively for each new storage node with sequentially indexed cells.
18. The non-transitory computer-readable storage medium of claim 17 in which the method further comprises splitting the protection groups into two subsets and metamorphosing distribution of the protection groups of one of the subsets from the linear distribution pattern to the recursive fractal distribution pattern in response to addition of new storage nodes.
19. The non-transitory computer-readable storage medium of claim 18 in which the method further comprises metamorphosing distribution of the protection groups of one of the subsets from the linear distribution pattern to the recursive fractal distribution pattern by selecting protection group members corresponding to superimposition of a recursive fractal matrix on the linear distribution pattern.
20. The non-transitory computer-readable storage medium of claim 19 in which the method further comprises splitting the protection groups into a first subset that is distributed in a linear pattern and a second subset that is distributed in a recursive fractal pattern and combining pairs of protection groups with the same cell index into single wider protection groups responsive to addition of a sufficient number of storage nodes.
US18/302,837 2023-04-19 2023-04-19 Adaptive raid width and distribution for flexible storage Active US12008271B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/302,837 US12008271B1 (en) 2023-04-19 2023-04-19 Adaptive raid width and distribution for flexible storage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US18/302,837 US12008271B1 (en) 2023-04-19 2023-04-19 Adaptive raid width and distribution for flexible storage

Publications (1)

Publication Number Publication Date
US12008271B1 true US12008271B1 (en) 2024-06-11

Family

ID=91382629

Family Applications (1)

Application Number Title Priority Date Filing Date
US18/302,837 Active US12008271B1 (en) 2023-04-19 2023-04-19 Adaptive raid width and distribution for flexible storage

Country Status (1)

Country Link
US (1) US12008271B1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020078276A1 (en) * 2000-12-20 2002-06-20 Ming-Li Hung RAID controller with IDE interfaces
US20050063217A1 (en) * 2003-09-24 2005-03-24 Nec Corporation Disk array device, method of extending storage capacity and computer program
US20080276057A1 (en) * 2007-05-01 2008-11-06 International Business Machines Corporation Data storage array scaling method and system with minimal data movement
US20200401340A1 (en) * 2017-06-19 2020-12-24 Hitachi, Ltd. Distributed storage system
US20220391359A1 (en) * 2021-06-07 2022-12-08 Netapp, Inc. Distributed File System that Provides Scalability and Resiliency

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020078276A1 (en) * 2000-12-20 2002-06-20 Ming-Li Hung RAID controller with IDE interfaces
US20050063217A1 (en) * 2003-09-24 2005-03-24 Nec Corporation Disk array device, method of extending storage capacity and computer program
US20080276057A1 (en) * 2007-05-01 2008-11-06 International Business Machines Corporation Data storage array scaling method and system with minimal data movement
US20200401340A1 (en) * 2017-06-19 2020-12-24 Hitachi, Ltd. Distributed storage system
US20220391359A1 (en) * 2021-06-07 2022-12-08 Netapp, Inc. Distributed File System that Provides Scalability and Resiliency

Similar Documents

Publication Publication Date Title
US11449226B2 (en) Reorganizing disks and raid members to split a disk array during capacity expansion
US10558383B2 (en) Storage system
CN108052655B (en) Data writing and reading method
US11144396B1 (en) Raid reliability with a provisional spare disk
US9436394B2 (en) RAID random distribution scheme
US7093069B2 (en) Integration of a RAID controller with a disk drive module
US10817376B2 (en) RAID with heterogeneous combinations of segments
US20160048342A1 (en) Reducing read/write overhead in a storage array
US11340789B2 (en) Predictive redistribution of capacity in a flexible RAID system
US20180246793A1 (en) Data stripping, allocation and reconstruction
CN107025066A (en) The method and apparatus that data storage is write in the storage medium based on flash memory
US11314608B1 (en) Creating and distributing spare capacity of a disk array
US11983414B2 (en) Successive raid distribution for single disk expansion with efficient and balanced spare capacity
US11474901B2 (en) Reliable RAID system with embedded spare capacity and flexible growth
CN119336536A (en) A data reconstruction method, device, storage medium and program product
US11327666B2 (en) RAID member distribution for granular disk array growth
US10146619B2 (en) Assigning redundancy in encoding data onto crossbar memory arrays
US11507287B1 (en) Adding single disks to an array by relocating raid members
US11256428B2 (en) Scaling raid-based storage by redistributing splits
WO2018235132A1 (en) DISTRIBUTED STORAGE SYSTEM
US20210389896A1 (en) Flexible raid sparing using disk splits
US12008271B1 (en) Adaptive raid width and distribution for flexible storage
Li et al. Relieving both storage and recovery burdens in big data clusters with R-STAIR codes
US11403022B2 (en) Growing and splitting a disk array by moving RAID group members
US20200336157A1 (en) Systematic and xor-based coding technique for distributed storage systems

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE