CN117795468A

CN117795468A - Heterogeneous support elastic group

Info

Publication number: CN117795468A
Application number: CN202280054641.8A
Authority: CN
Inventors: 罗伯特·李; B·费金; Y·高; 罗纳德·卡尔
Original assignee: Pure Storage Inc
Current assignee: Pure Storage Inc
Priority date: 2021-07-19
Filing date: 2022-05-25
Publication date: 2024-03-29
Also published as: EP4359900A1; WO2023003627A1

Abstract

A method of operating a storage system and related storage system are provided. The storage system establishes elastic groups, each elastic group having a defined level of resource redundancy for the storage system. The elastic groups include at least one computing resource elastic group and at least one storage resource elastic group. The storage system supports the ability to have a configuration of a multiple of each of the elastic groups. Blades of the storage system perform distributed data and metadata storage across modular storage devices according to the elastic groups.

Description

Heterogeneous support elastic group

Cross reference to

U.S. patent application Ser. No. 17/379,762, filed on 7/19, 2021, is hereby incorporated by reference herein for all purposes.

Background

Erase codes in storage systems such as storage arrays and storage clusters are typically provided with N+2 redundancy to withstand failure of two blades or other storage devices, such as storage units or drives (e.g., solid state drives, hard disk drives, optical disk drives, etc.), or other specified redundancy levels. As blades or other devices are added to a storage system, the failure probability (and thus the loss of data) of three or more blades or other storage devices or memories and computing devices grows exponentially. This trend gets worse with multi-chassis storage clusters, which may have 150 blades in 10 chassis, for example. Because of this trend, lost viability and data recovery does not scale properly with storage system expansion. Accordingly, there is a need in the art for a solution that overcomes the drawbacks described above.

Drawings

The described embodiments and their advantages are best understood by referring to the following description taken in conjunction with the accompanying drawings. These drawings in no way limit any changes in form and detail that may be made to the described embodiments by those skilled in the art without departing from the spirit and scope of the described embodiments.

The present disclosure is illustrated by way of example, and not by way of limitation, and may be more fully understood with reference to the following detailed description when considered in connection with the figures described below.

FIG. 1A illustrates a first example system for data storage according to some embodiments.

FIG. 1B illustrates a second example system for data storage according to some embodiments.

FIG. 1C illustrates a third example system for data storage according to some embodiments.

FIG. 1D illustrates a fourth example system for data storage according to some embodiments.

FIG. 2A is a perspective view of a storage cluster having multiple storage nodes and internal storage coupled to each storage node to provide network attached storage, according to some embodiments.

Fig. 2B is a block diagram showing an interconnect switch coupling multiple storage nodes, according to some embodiments.

FIG. 2C is a multi-level block diagram showing the contents of a storage node and the contents of one of the non-volatile solid state storage units, according to some embodiments.

Fig. 2D shows a storage server environment using embodiments of the storage nodes and storage units of fig. 1-3, according to some embodiments.

FIG. 2E is a block diagram of blade hardware showing a control plane, a compute and store plane, and authorities interacting with underlying physical resources, according to some embodiments.

Fig. 2F depicts a resilient software layer in a blade of a storage cluster, according to some embodiments.

FIG. 2G depicts authoritative and storage resources in blades of a storage cluster, according to some embodiments.

Fig. 3A illustrates a diagram of a storage system coupled for data communication with a cloud service provider, according to some embodiments of the present disclosure.

Fig. 3B illustrates a diagram of a storage system according to some embodiments of the present disclosure.

Fig. 3C illustrates an example of a cloud-based storage system according to some embodiments of the present disclosure.

FIG. 3D illustrates an exemplary computing device that may be specifically configured to perform one or more of the processes described herein.

FIG. 3E illustrates an example of a storage system cluster for providing storage services.

FIG. 4 depicts an elastic group in a storage system that supports recovery of data in the event that up to a specified number of blades of the elastic group are lost.

FIG. 5 is a scenario of a geometry change of a storage system in which a blade is added to a storage cluster resulting in a change from a previous version of an elastic group to a current version of an elastic group.

FIG. 6 is a system and acts diagram showing authorities in distributed computing resources of a storage system and switches in communication with chassis and blades to form a resilient group.

FIG. 7 depicts garbage collection with respect to elastic group reclamation memory (storage memory) and relocation data.

FIG. 8 is a system and action diagram showing details of garbage collection that coordinates recovery of memory and scanning and repositioning of data.

Fig. 9 depicts a majority group (quorum) for a startup process using an elastic group.

FIG. 10 depicts witness groups and authoritative elections for using elastic groups.

FIG. 11 is a system and action diagram showing details of authoritative elections and assignments organized by witness groups and relating to majority votes across blades of a storage system.

FIG. 12 is a flow chart of a method of operating a storage system having elastic groups.

FIG. 13 is a flow chart of a method of operating a storage system with garbage collection in elastic groups.

FIG. 14 is an illustration showing an exemplary computing device in which embodiments described herein may be implemented.

Fig. 15 depicts the formation of elastic groups by blades having different amounts of memory.

Fig. 16 depicts a conservative estimate of the amount of memory space available in a resiliency group.

Fig. 17 depicts a garbage collection module having various options related to the elastic groups depicted in fig. 15 and 16.

FIG. 18 is a flow chart of a method of operating a storage system to form elastic groups.

FIG. 19 depicts a storage system having one or more elastic groups of computing resources defined within a computing region and one or more elastic groups of storage resources defined within a storage region, according to an embodiment.

FIG. 20 depicts an embodiment of a memory system that forms data stripes and writes data stripes using resources in a resilient group.

FIG. 21 is a flowchart of a method that may be practiced by a storage system using resources in a resiliency group, according to an embodiment.

Detailed Description

Various embodiments of the storage systems described herein form elastic groups, each elastic group having a specified subset of resources of the total number of blades in a storage system, such as a storage array or storage cluster. In some embodiments (see fig. 1A-3E and 4-18), the entire blade may be a member of an elastic group, and in some embodiments (see fig. 1A-3E and 19-21), the subset of resources of the blade and the storage device may be a member of an elastic group. One proposed maximum number of blades for the elastic group is 29 blades, or one less than the number of blades required to fill two chassis, although other numbers and arrangements of blades or other storage devices or memories and computing devices may be used. If another blade is added and any one elastic group will have more than the specified maximum number of blades, then embodiments of the storage cluster reform the elastic group. The elastic groups are also modified by removing blades or moving blades to different slots, or adding or removing chassis with one or more blades to or from the storage system (i.e., changing the cluster geometry that meets specified criteria). In some embodiments, flash write (for segment formation) and NVRAM (non-volatile random access memory) insertion (i.e., write to NVRAM) should not cross the boundaries of write groups, and each write group is selected within a resilient group. NVRAM insertion will attempt to select blades in the same chassis to avoid overloading a top of rack (TOR) switch and causing the resulting write amplification in some embodiments. This organization of the write groups and the elastic groups results in the failure probability of three or more blades in the write groups or elastic groups being a stable, non-deteriorating value, even if more blades are added to the storage cluster. When the cluster geometry changes, the authority (in some embodiments of the storage cluster) lists segments that are partitioned into two or more new elastic groups, and garbage collection remaps the segments to keep each segment in one of the elastic groups. For the startup process, in some embodiments, multiple groups are formed within the elastic group and are the same as the elastic group in steady state. In some embodiments, when the elastic group changes, the process of refreshing NVRAM and remapping segments and transactions and committing records is defined for the authority. In some embodiments, a witness group of the first elastic group selected as the cluster is used to define a process for authoritative election and authoritative lease renewal. In various embodiments, there are multiple possibilities to form elastic groups and perform garbage collection when the blades have different amounts of memory. Various storage systems are described below with reference to fig. 1A-3E, embodiments of which may operate as storage systems according to the elastic groups described with reference to fig. 4-13 and 15-18.

FIG. 1A illustrates an example system for data storage according to some embodiments. For purposes of illustration and not limitation, system 100 (also referred to herein as a "storage system") includes many elements. It may be noted that system 100 may include the same, more, or fewer elements configured in the same or different ways in other embodiments. The system 100 includes a number of computing devices 164A-B. The computing device (also referred to herein as a "client device") may be embodied as, for example, a server, workstation, personal computer, notebook computer, or the like in a data center. The computing devices 164A-B may be coupled for data communication with one or more storage arrays 102A-B through a storage area network ('SAN') 158 or a local area network ('LAN') 160.

SAN 158 may be implemented with various data communication structures, devices, and protocols. For example, the fabric of SAN 158 may comprise fibre channel, ethernet, infiniband, serial attached Small computer System interface ('SAS'), or the like. The data communication protocols used with SAN 158 may include advanced technology attachment ('ATA'), fibre channel protocol, small computer System interface ('SCSI'), internet Small computer System interface ('iSCSI'), hyperSCSI, nonvolatile flash memory over structure ('NVMe'), or the like. It is noted that SAN 158 is provided for purposes of illustration and not limitation. Other data communication couplings may be implemented between the computing devices 164A-B and the storage arrays 102A-B.

LAN 160 may also be implemented with various structures, devices, and protocols. For example, the architecture for LAN 160 may include ethernet (802.3), wireless (802.11), or the like. The data communication protocols used in the LAN 160 may include transmission control protocol ('TCP'), user datagram protocol ('UDP'), internet protocol ('IP'), hypertext transfer protocol ('HTTP'), wireless access protocol ('WAP'), hand-held device transfer protocol ('HDTP'), session initiation protocol ('SIP'), real-time protocol ('RTP'), or the like.

The storage arrays 102A-B may provide persistent data storage for the computing devices 164A-B. In an implementation, the storage array 102A may be housed in a chassis (not shown) and the storage array 102B may be housed in another chassis (not shown). The memory arrays 102A and 102B may include one or more memory array controllers 110A-D (also referred to herein as "controllers"). The storage array controllers 110A-D may be embodied as modules of an automated computing machinery comprising computer hardware, computer software, or a combination of computer hardware and software. In some implementations, the storage array controllers 110A-D may be configured to perform various storage tasks. Storage tasks may include writing data received from computing devices 164A-B to storage arrays 102A-B, erasing data from storage arrays 102A-B, retrieving data from storage arrays 102A-B and providing data to computing devices 164A-B, monitoring and reporting disk utilization and performance, performing redundancy operations, such as redundant array of independent drives ('RAID') or RAID-like data redundancy operations, compressing data, encrypting data, and so forth.

The memory array controllers 110A-D may be implemented in various ways, including as a field programmable gate array ('FPGA'), a programmable logic chip ('PLC'), an application specific integrated circuit ('ASIC'), a system on a chip ('SOC'), or any computing device including discrete components, such as a processing device, a central processing unit, a computer memory, or various adapters. The storage array controllers 110A-D may include, for example, data communications adapters configured to support communications via the SAN 158 or LAN 160. In some implementations, the storage array controllers 110A-D may be independently coupled to the LAN 160. In an implementation, storage array controllers 110A-D may include I/O controllers or the like that couple storage array controllers 110A-D for data communications to persistent storage resources 170A-B (also referred to herein as "storage resources") through a midplane (not shown). Persistent storage resources 170A-B generally include any number of storage drives 171A-F (also referred to herein as "storage devices") and any number of non-volatile random access memory ('NVRAM') devices (not shown).

In some implementations, NVRAM devices of persistent storage resources 170A-B may be configured to receive data from storage array controllers 110A-D to be stored in storage drives 171A-F. In some examples, the data may originate from computing devices 164A-B. In some examples, writing data to the NVRAM device may be performed faster than writing data directly to the storage drives 171A-F. In an implementation, the storage array controllers 110A-D may be configured to utilize NVRAM devices as fast accessible buffers for data intended to be written to the storage drives 171A-F. The latency of write requests using NVRAM devices as buffers may be improved relative to systems in which storage array controllers 110A-D write data directly to storage drives 171A-F. In some embodiments, the NVRAM device may be implemented with computer memory in the form of high-bandwidth, low-latency RAM. The NVRAM device is referred to as "non-volatile" because the NVRAM device may receive or contain the sole power source to maintain the state of the RAM after the main power to the NVRAM device is lost. Such a power source may be a battery, one or more capacitors, or the like. In response to a power loss, the NVRAM device may be configured to write the contents of RAM to persistent storage, such as storage drives 171A-F.

In an implementation, storage drives 171A-F may refer to any device configured to record data permanently, where "permanently" or "persistence" refers to the device's ability to maintain the recorded data after a power loss. In some implementations, the storage drives 171A-F may correspond to non-disk storage media. For example, storage drives 171A-F may be one or more solid state drives ('SSDs'), flash-based storage devices, any type of solid state non-volatile memory, or any other type of non-mechanical storage device. In other embodiments, storage drives 171A-F may comprise mechanical or rotating hard disks, such as hard disk drives ('HDD').

In some implementations, the storage array controllers 110A-D may be configured to offload device management responsibilities from the storage drives 171A-F in the storage arrays 102A-B. For example, the storage array controllers 110A-D may manage control information that may describe the state of one or more memory blocks in the storage drives 171A-F. The control information may indicate, for example, that a particular memory block has failed and should no longer be written, that a particular memory block contains boot code for the storage array controllers 110A-D, the number of program erase ('P/E') cycles that have been performed on a particular memory block, the age of data stored in a particular memory block, the type of data stored in a particular memory block, and so forth. In some implementations, control information may be stored as metadata with associated memory blocks. In other implementations, control information for the storage drives 171A-F may be stored in one or more particular memory blocks of the storage drives 171A-F selected by the storage array controller 110A-D. The selected memory block may be marked with an identifier indicating that the selected memory block contains control information. The identifier may be utilized by the memory array controllers 110A-D in conjunction with the memory drives 171A-F to quickly identify memory blocks containing control information. For example, the memory controllers 110A-D may issue commands to locate memory blocks containing control information. It may be noted that the control information may be so large that portions of the control information may be stored in multiple locations, for example, for redundancy purposes, or the control information may be otherwise distributed across multiple memory blocks in the storage drives 171A-F.

In an embodiment, the storage array controllers 110A-D may offload device management responsibilities from the storage drives 171A-F of the storage arrays 102A-B by retrieving control information from the storage drives 171A-F describing the state of one or more memory blocks in the storage drives 171A-F. Retrieving control information from storage drives 171A-F may be performed, for example, by storage array controllers 110A-D querying storage drives 171A-F for the location of control information for a particular storage drive 171A-F. The storage drives 171A-F may be configured to execute instructions that enable the storage drives 171A-F to identify the location of the control information. The instructions may be executed by a controller (not shown) associated with or otherwise located on the storage drives 171A-F, and may cause the storage drives 171A-F to scan a portion of each memory block to identify the memory block storing the control information for the storage drives 171A-F. The storage drives 171A-F may respond by sending a response message to the storage array controller 110A-D that includes the location of the control information for the storage drives 171A-F. In response to receiving the response message, the storage array controllers 110A-D may issue a request to read data stored at addresses associated with locations for storing control information for the drives 171A-F.

In other embodiments, the storage array controllers 110A-D may further offload device management responsibilities from the storage drives 171A-F by performing storage drive management operations in response to receiving control information. Storage drive management operations may include, for example, operations typically performed by storage drives 171A-F, such as controllers (not shown) associated with particular storage drives 171A-F. Storage drive management operations may include, for example, ensuring that data is not written to failed memory blocks within storage drives 171A-F, ensuring that data is written to memory blocks within storage drives 171A-F in such a way that adequate wear leveling is achieved, and so forth.

In an implementation, the storage arrays 102A-B may implement two or more storage array controllers 110A-D. For example, the memory array 102A may include a memory array controller 110A and a memory array controller 110B. In a given example, a single storage array controller 110A-D (e.g., storage array controller 110A) of storage system 100 may be designated as having a primary role (also referred to herein as a "primary controller"), and other storage array controllers 110A-D (e.g., storage array controller 110A) may be designated as having a secondary role (also referred to herein as a "secondary controller"). The primary controller may have particular rights, such as permissions to alter data in persistent storage resources 170A-B (e.g., write data to persistent storage resources 170A-B). At least some of the rights of the primary controller may supersede the rights of the secondary controller. For example, when a primary controller has the right to change data in persistent storage resources 170A-B, a secondary controller may not have permission to change data in persistent storage resources 170A-B. The status of the memory array controllers 110A-D may vary. For example, storage array controller 110A may be designated as having a secondary status and storage array controller 110B may be designated as having a primary status.

In some implementations, a primary controller, such as storage array controller 110A, may be used as the primary controller for one or more storage arrays 102A-B, and a secondary controller, such as storage array controller 110B, may be used as the secondary controller for one or more storage arrays 102A-B. For example, storage array controller 110A may be a primary controller of storage arrays 102A and 102B, and storage array controller 110B may be a secondary controller of storage arrays 102A and 102B. In some implementations, the storage array controllers 110C and 110D (also referred to as "storage processing modules") may have neither a primary nor a secondary status. The storage array controllers 110C and 110D implemented as storage processing modules may serve as a communication interface between the primary and secondary controllers (e.g., storage array controllers 110A and 110B, respectively) and the storage array 102B. For example, the storage array controller 110A of the storage array 102A may send write requests to the storage array 102B via the SAN 158. The write request may be received by both memory array controllers 110C and 110D of memory array 102B. The storage array controllers 110C and 110D facilitate communications, such as sending write requests to the appropriate storage drives 171A-F. It may be noted that in some implementations, the storage processing module may be used to increase the number of storage drives controlled by the primary and secondary controllers.

In an implementation, the storage array controllers 110A-D are communicatively coupled to one or more storage drivers 171A-F and one or more NVRAM devices (not shown) included as part of the storage arrays 102A-B via a midplane (not shown). The memory array controllers 110A-D may be coupled to the midplane via one or more data communication links, and the midplane may be coupled to the memory drivers 171A-F and NVRAM devices via one or more data communication links. The data communication links described herein are collectively illustrated by data communication links 108A-D and may include, for example, a peripheral component interconnect express ('PCIe') bus.

FIG. 1B illustrates an example system for data storage according to some embodiments. The memory array controller 101 illustrated in FIG. 1B may be similar to the memory array controllers 110A-D described with respect to FIG. 1A. In one example, storage array controller 101 may be similar to storage array controller 110A or storage array controller 110B. For purposes of illustration and not limitation, the memory array controller 101 includes many elements. It may be noted that the memory array controller 101 may contain the same, more, or fewer elements configured in the same or different ways in other embodiments. It may be noted that the elements of fig. 1A may be included below to help illustrate features of the memory array controller 101.

The memory array controller 101 may include one or more processing devices 104 and random access memory ('RAM') 111. The processing device 104 (or controller 101) represents one or more general-purpose processing devices, such as a microprocessor, central processing unit, or the like. More specifically, the processing device 104 (or the controller 101) may be a complex instruction set computing ('CISC') microprocessor, a reduced instruction set computing ('RISC') microprocessor, a very long instruction word ('VLIW') microprocessor, or a processor implementing other instruction sets or processors implementing a combination of instruction sets. The processing device 104 (or controller 101) may also be one or more special purpose processing devices, such as an ASIC, FPGA, digital signal processor ('DSP'), network processor, or the like.

The processing device 104 may be connected to the RAM 111 via a data communication link 106, which data communication link 106 may be embodied as a high-speed memory bus, such as a double data rate 4 ('DDR 4') bus. Stored in RAM 111 is an operating system 112. In some embodiments, instructions 113 are stored in RAM 111. The instructions 113 may include computer program instructions for performing operations in a direct mapped flash memory system. In one embodiment, a direct mapped flash memory system is a system that addresses blocks of data within a flash drive directly and without the need for address translation performed by the memory controller of the flash drive.

In an implementation, the storage array controller 101 includes one or more host bus adapters 103A-C coupled to the processing device 104 via data communication links 105A-C. In implementations, the host bus adapters 103A-C can be computer hardware that connects a host system (e.g., a storage array controller) to other networks and storage arrays. In some examples, host bus adapters 103A-C may be fibre channel adapters enabling storage array controller 101 to connect to a SAN, ethernet adapters enabling storage array controller 101 to connect to a LAN, or the like. The host bus adapters 103A-C may be coupled to the processing device 104 via data communication links 105A-C, such as, for example, a PCIe bus.

In an implementation, the storage array controller 101 may include a host bus adapter 114 coupled to the expander 115. Expander 115 may be used to attach host systems to a larger number of storage drives. Expander 115 may be, for example, a SAS expander for enabling host bus adapter 114 to be attached to a storage drive in embodiments where host bus adapter 114 is embodied as a SAS controller.

In an embodiment, the storage array controller 101 may include a switch 116 coupled to the processing device 104 via a data communication link 109. Switch 116 may be a computer hardware device that may create multiple endpoints from a single endpoint, thereby enabling multiple devices to share a single endpoint. For example, switch 116 may be a PCIe switch coupled to a PCIe bus (e.g., data communication link 109) and presenting a plurality of PCIe connection points to the midplane.

In an embodiment, the storage array controller 101 includes a data communication link 107 for coupling the storage array controller 101 to other storage array controllers. In some examples, data communication link 107 may be a Quick Path Interconnect (QPI) interconnect.

A conventional storage system using a conventional flash drive may implement a process across flash drives that are part of the conventional storage system. For example, higher level processes of a storage system may initiate and control processes across flash drives. However, the flash drive of a conventional storage system may contain its own storage controller that also performs the process. Thus, for a conventional storage system, both higher level processes (e.g., initiated by the storage system) and lower level processes (e.g., initiated by the storage controller of the storage system) may be performed.

To address various drawbacks of conventional memory systems, operations may be performed by higher level processes rather than lower level processes. For example, a flash memory system may include a flash memory drive that does not include a memory controller that provides the process. Thus, the operating system of the flash memory system itself may initiate and control the process. This may be achieved by a direct mapped flash memory system that addresses data blocks within a flash drive directly and without the need for address translation performed by the memory controller of the flash drive.

In an implementation, the storage drives 171A-F may be one or more partitioned storage devices. In some implementations, one or more of the partitioned storage devices may be a shingled HDD. In an implementation, the one or more storage devices may be flash-based SSDs. In a partitioned storage device, the partition namespaces on the partitioned storage device are addressable by groups of blocks that are grouped and aligned by natural size, thereby forming a number of addressable areas. In implementations utilizing SSDs, the natural size may be based on the erase block size of the SSD. In some embodiments, the region of the partitioned storage device may be defined during initialization of the partitioned storage device. In an embodiment, the region may be dynamically defined as data is written to the partitioned storage device.

In some implementations, regions may be heterogeneous, with some regions each being a page group and other regions being multiple page groups. In implementations, some regions may correspond to erase blocks and other regions may correspond to multiple erase blocks. In an implementation, for heterogeneous mixes of programming patterns, manufacturers, product types, and/or product generations of memory devices that apply to heterogeneous assemblies, upgrades, distributed storage, etc., a region may be any combination of different numbers of pages in a page group and/or erase block. In some embodiments, a region may be defined as having a usage characteristic, such as a property that supports data having a particular kind of lifetime (e.g., very short lifetime or very long lifetime). These properties may be used by the partitioned storage device to determine how the region will be managed over its expected lifetime.

It should be appreciated that the region is a virtual construct. Any particular region may not have a fixed location at the memory device. The region may not have any location at the memory device prior to allocation. The region may correspond to a number representing a chunk of virtual allocatable space, which in various implementations is the size of an erase block or other block size. When the system allocates or opens an area, the area is allocated to flash or other solid state memory, and when the system writes to the area, pages are written to mapped flash or other solid state memory of the partitioned storage device. When the system shuts down an area, the associated erase block or other sized block is completed. At some point in the future, the system may delete an area, which will free up allocated space for that area. During its lifetime, the region may be moved around to different locations of the partitioned storage device, for example, when the partitioned storage device is undergoing internal maintenance.

In embodiments, the areas of the partitioned memory device may be in different states. The region may be in an empty state in which data has not been stored in the region. The empty region may be explicitly opened or may be implicitly opened by writing data to the region. This is the initial state of the region on the newly partitioned memory device, but may also be the result of a region reset. In some embodiments, the empty region may have a designated location within the flash memory of the partitioned storage device. In an embodiment, the location of the empty region may be selected the first time the region is opened or written to (or later if the write is buffered in memory). The region may be implicitly or explicitly in an open state, wherein the region in the open state may be written with a write or additional command to store data. In an embodiment, a copy command that copies data from different areas may also be used to write to an area that is in an open state. In some embodiments, a partitioned memory device may have a limit on the number of open areas at a particular time.

The area in the closed state is an area that has been partially written but has entered the closed state after an explicit closing operation is issued. The region in the off state may be left for future writing but some runtime overhead consumed by keeping the region in the on state may be reduced. In embodiments, a partitioned memory device may have a limit on the number of closed regions at a particular time. The area in the complete state is an area in which data is being stored and cannot be written any more. After the write has written the data to the entire region, or as a result of a region completion operation, the region may be in a complete state. The region may or may not have been completely written to before the operation is completed. However, after the operation is completed, the region may not be opened for further writing without first performing a region reset operation.

The mapping from the region to the erase block (or to the shingled tracks in the HDD) may be arbitrary, dynamic, and hidden from view. The process of opening a region may be an operation that allows a new region to be dynamically mapped to the underlying storage of the partitioned storage device, and then allows data to be written by appending writes to the region until the region reaches capacity. The area may end at any point in time after which no further data can be written into the area. When the data stored in the region is no longer needed, the region may be reset, which effectively deletes the contents of the region from the partitioned storage device, so that the physical storage maintained by the region is available for subsequent storage of the data. Once an area has been written to and completed, the partitioned storage device ensures that data stored in the area is not lost until the area is reset. During the time between writing data to the region and resetting the region, the region may move around between shingled tracks or erase blocks as part of a maintenance operation within the partitioned storage device, for example, by copying data to keep the data refreshed or to handle memory cell aging in the SSD.

In embodiments utilizing an HDD, a reset of an area may allow shingled tracks to be assigned to new, open areas, which may be opened at some point in the future. In implementations utilizing SSDs, a reset of a region may result in an associated physical erase block of the region being erased and subsequently reused for storage of data. In some embodiments, the partitioned storage device may have a limit on the number of open areas at a point in time to reduce the amount of overhead dedicated to keeping the areas open.

The operating system of the flash memory system may identify and maintain a list of allocation units across multiple flash drives of the flash memory system. The allocation unit may be an entire erase block or a plurality of erase blocks. The operating system may maintain a map or address range that directly maps addresses to erase blocks of a flash drive of the flash memory system.

An erase block that is mapped directly to a flash drive may be used to rewrite data and erase data. For example, an operation may be performed on one or more allocation units that include first data and second data, where the first data is to be retained and the second data is no longer used by the flash memory system. The operating system may initiate a process of writing first data to a new location within other allocation units and erasing second data and marking allocation units as available for subsequent data. Thus, the process may be performed by only the higher level operating system of the flash memory system without requiring additional lower level processes to be performed by the controller of the flash memory drive.

Advantages of the process being performed only by the operating system of the flash memory system include increasing the reliability of the flash drive of the flash memory system because no unnecessary or redundant write operations are performed during the process. One possible novel feature herein is the concept of starting and controlling processes on the operating system of a flash memory system. Additionally, the process may be controlled by the operating system across multiple flash drives. This is in contrast to the process being performed by the memory controller of the flash drive.

The storage system may consist of two storage array controllers sharing a set of drives for failover purposes, or it may consist of a single storage array controller providing storage services utilizing multiple drives, or it may consist of a distributed network of storage array controllers, each having a number of drives or a number of flash memory devices, wherein the storage array controllers in the network cooperate to provide complete storage services and cooperate in various aspects of storage services, including storage allocation and garbage collection.

FIG. 1C illustrates a third example system 117 for data storage according to some embodiments. For purposes of illustration and not limitation, system 117 (also referred to herein as a "storage system") includes many elements. It may be noted that system 117 may include the same, more, or fewer elements configured in the same or different ways in other embodiments.

In one embodiment, the system 117 includes a dual peripheral component interconnect ('PCI') flash memory device 118 with individually addressable flash write storage. The system 117 may include a memory device controller 119. In one embodiment, the memory device controllers 119A-D may be CPU, ASIC, FPGA or any other circuitry that may implement the control structures required in accordance with the present disclosure. In one embodiment, the system 117 includes flash memory devices (e.g., including flash memory devices 120 a-n) that are operatively coupled to various channels of the memory device controller 119. Flash devices 120 a-n may be presented to controllers 119A-D as addressable sets of flash pages, erase blocks, and/or control elements sufficient to allow memory device controllers 119A-D to program and retrieve various aspects of the flash. In one embodiment, memory device controllers 119A-D may perform operations on flash memory devices 120 a-n, including storing and retrieving data content of pages, arranging and erasing any blocks, tracking statistics related to the use and reuse of flash pages, erased blocks and cells, tracking and predicting error codes and faults within flash memory, controlling voltage levels associated with programming and retrieving the contents of flash memory cells, and the like.

In one embodiment, the system 117 may include RAM 121 to separately store addressable fast write data. In one embodiment, RAM 121 may be one or more separate discrete devices. In another embodiment, RAM 121 may be integrated into memory device controllers 119A-D or multiple memory device controllers. RAM 121 may also be used for other purposes, such as for storing temporary program memory for a processing device (e.g., CPU) in device controller 119.

In one embodiment, the system 117 may include an energy storage device 122, such as a rechargeable battery or capacitor. The energy storage device 122 may store energy sufficient to power the memory device controller 119, an amount of RAM (e.g., RAM 121), and an amount of flash memory (e.g., flash memories 120 a-120 n) to have sufficient time to write the contents of the RAM to the flash memory. In one embodiment, if the storage device controller detects a loss of external power, the storage device controllers 119A-D may write the contents of the RAM to the flash memory.

In one embodiment, the system 117 includes two data communication links 123a, 123b. In one embodiment, the data communication links 123a, 123b may be PCI interfaces. In another embodiment, the data communication links 123a, 123b may be based on other communication standards (e.g., hyperTransport, infiniband, etc.). The data communication links 123a, 123b may be based on non-volatile flash memory ('NVMe') or on a structural NVMe ('NVMf') specification that allows external connection to the storage device controllers 119A-D from other components in the storage system 117. It should be noted that for convenience, the data communication link is interchangeably referred to herein as a PCI bus.

The system 117 may also include an external power source (not shown), which may be provided over one or both data communication links 123a, 123b, or may be provided separately. Alternative embodiments include a separate flash memory (not shown) dedicated to storing the contents of RAM 121. The memory device controllers 119A-D may present logic devices on the PCI bus, which may include addressable fast write logic devices, or different portions of the logic address space of the memory device 118, which may be present as PCI memory or persistent storage. In one embodiment, the operations stored into the device are directed into RAM 121. In the event of a power failure, the storage device controllers 119A-D may write storage content associated with the addressable fast write logical memory to flash memory (e.g., flash memories 120 a-n) for long-term persistent storage.

In one embodiment, the logic device may include some rendering of some or all of the contents of flash devices 120a through n, where the rendering allows a storage system (e.g., storage system 117) including storage device 118 to directly address flash pages and directly reprogram erase blocks from storage system components external to the storage device over the PCI bus. The presentation may also allow one or more external components to control and retrieve other aspects of the flash memory, including some or all of the following: tracking statistics related to the use and reuse of flash pages, erase blocks, and cells across all flash devices; tracking and predicting error codes and faults within and across the flash memory device; controlling a voltage level associated with programming and retrieving the contents of the flash memory cell; etc.

In one embodiment, the energy storage device 122 may be sufficient to ensure that ongoing operation of the flash memory devices 120a through 120n is completed. The energy storage device 122 may power the memory device controllers 119A-D and associated flash memory devices (e.g., 120 a-n) for the operations and for storing fast write RAM to flash memory. The energy storage device 122 may be used to store accumulated statistics and other parameters maintained and tracked by the flash memory devices 120a through n and/or the memory device controller 119. Individual capacitors or energy storage devices (e.g., smaller capacitors near or embedded within the flash memory device itself) may be used for some or all of the operations described herein.

Various schemes may be used to track and optimize the lifetime of the energy storage component, such as adjusting voltage levels over time, partially discharging the energy storage device 122 to measure corresponding discharge characteristics, and so forth. If the available energy decreases over time, the effective available capacity of the addressable fast write storage device may be reduced to ensure that it can be safely written to based on the currently available stored energy.

FIG. 1D illustrates a third example storage system 124 for data storage according to some embodiments. In one embodiment, the storage system 124 includes storage controllers 125a, 125b. In one embodiment, the memory controllers 125a, 125b are operatively coupled to a dual PCI memory device. The storage controllers 125a, 125b are operably coupled (e.g., via a storage network 130) to some number of host computers 127 a-n.

In one embodiment, two storage controllers (e.g., 125a and 125 b) provide storage services, such as SCS block storage arrays, file servers, object servers, databases or data analysis services, and the like. The storage controllers 125a, 125b may provide services to host computers 127 a-n external to the storage system 124 through some number of network interfaces (e.g., 126 a-d). The storage controllers 125a, 125b may provide integrated services or applications entirely within the storage system 124, forming an aggregated storage and computing system. The storage controllers 125a, 125b may utilize fast write memory within the storage devices 119 a-d or across the storage devices 119 a-d to record ongoing operations to ensure that operations are not lost upon a power failure, storage controller removal, storage controller or storage system shutdown, or some failure of one or more software or hardware components within the storage system 124.

In one embodiment, the memory controllers 125a, 125b operate as PCI masters for one or the other PCI buses 128a, 128 b. In another embodiment, 128a and 128b may be based on other communication standards (e.g., hyperTransport, infiniband, etc.). Other memory system embodiments may operate the memory controllers 125a, 125b as multiple masters for both PCI buses 128a, 128 b. Alternatively, the PCI/NVMe/NVMf switching infrastructure or fabric may connect multiple storage controllers. Some storage system embodiments may allow storage devices to communicate directly with each other, rather than only with a storage controller. In one embodiment, the memory device controller 119a may operate under the direction from the memory controller 125a to synthesize and transfer data to be stored into the flash memory device from data already stored in RAM (e.g., RAM 121 of fig. 1C). For example, after the storage controller has determined that an operation has been fully committed across the storage system, or when the fast write memory on the device has reached a certain used capacity, or after a certain amount of time, a recalculated version of the RAM content may be transferred to ensure that the security of the data is improved or that the addressable fast write capacity is released for reuse. For example, such a mechanism may be used to avoid a second transfer from the memory controller 125a, 125b over the bus (e.g., 128a, 128 b). In one embodiment, the recalculation may comprise compressing the data, appending an index or other metadata, combining multiple data segments together, performing erasure code calculations, and so forth.

In one embodiment, under direction from the memory controllers 125a, 125b, the memory device controllers 119a, 119b are operable to calculate data from data stored in RAM (e.g., RAM 121 of fig. 1C) and transfer the data to other memory devices without involving the memory controllers 125a, 125b. This operation may be used to mirror data stored in one storage controller 125a to another storage controller 125b, or it may be used to offload compression, data aggregation, and/or erasure coding calculations and transfers to a storage device to reduce the load on the storage controllers or storage controller interfaces 129a, 129b to the PCI buses 128a, 128 b.

The storage device controllers 119A-D may include mechanisms for implementing high availability primitives for use by other portions of the storage system external to the dual-PCI storage device 118. For example, a reservation or exclusion primitive may be provided such that in a storage system having two storage controllers providing highly available storage services, one storage controller may prevent the other storage controller from accessing or continuing to access the storage device. This may be used, for example, if one controller detects that the other controller is not functioning properly or the interconnect between two storage controllers itself may not function properly.

In one embodiment, a storage system for use with a dual PCI direct mapped storage device with individually addressable flash write storage includes a system for managing erase blocks or groups of erase blocks as allocation units for storing data on behalf of a storage service, or for storing metadata (e.g., indexes, logs, etc.) associated with the storage service, or for properly managing the storage system itself. Flash pages, which may be several kilobytes in size, may be written when data arrives or when the storage system is to hold data for a long period of time (e.g., exceeding a defined time threshold). To commit data faster, or to reduce the number of writes to the flash memory device, the memory controller may first write the data to an individually addressable fast write memory device on a further memory device.

In one embodiment, the memory controllers 125a, 125b may initiate the use of erase blocks within and across the memory devices (e.g., 118) according to the age and expected remaining life of the memory devices or based on other statistics. The memory controllers 125a, 125b may initiate garbage collection and data migration among the memory devices based on pages that are no longer needed, as well as manage flash memory pages and erase block life, and manage overall system performance.

In one embodiment, storage system 124 may utilize a mirroring and/or erasure coding scheme as part of storing data into addressable fast write storage devices and/or as part of writing data into allocation units associated with an erasure block. The erase code may be used across memory devices, as well as within erase blocks or allocation units, or within and across flash memory devices on a single memory device to provide redundancy against single or multiple memory device failures, or to prevent internal corruption of flash pages caused by flash operations or flash cell degradation. Various levels of mirroring and erasure coding can be used to recover from multiple types of faults occurring alone or in combination.

The embodiments depicted with reference to fig. 2A-G illustrate a storage cluster storing user data (e.g., user data originating from one or more users or client systems or other sources external to the storage cluster). Storage clusters distribute user data across storage nodes housed within a chassis or across multiple chassis using erasure coding and redundant copies of metadata. Erasure coding refers to a method of data protection or reconstruction in which data is stored across a set of different locations (e.g., disks, storage nodes, or geographic locations). Flash memory is one type of solid state memory that may be integrated with embodiments, but embodiments may be extended to other types of solid state memory or other storage media, including non-solid state memory. Control of storage locations and workloads is distributed across storage locations in clustered peer-to-peer systems. Tasks such as mediating communications between the various storage nodes, detecting when a storage node becomes unavailable, and balancing I/O (input and output) across the various storage nodes are all handled on a distributed basis. In some embodiments, data is arranged or distributed across multiple storage nodes in the form of data segments or stripes that support data recovery. Independent of the input and output modes, ownership of the data may be reassigned within the cluster. Such an architecture, described in more detail below, allows storage nodes in the cluster to fail while the system remains operational because data can be reconstructed from other storage nodes and thus remain available for input and output operations. In various embodiments, a storage node may be referred to as a cluster node, a blade, or a server.

The storage clusters may be housed within a chassis (i.e., a housing that houses one or more storage nodes). Included within the chassis are mechanisms for providing power to each storage node (e.g., a power distribution bus) and communication mechanisms capable of communicating between the storage nodes (e.g., a communication bus). According to some embodiments, the storage clusters may operate as stand-alone systems in one location. In one embodiment, the chassis contains at least two instances of both the power distribution and communication buses that can be independently enabled or disabled. The internal communication bus may be an ethernet bus, however, other technologies (e.g., PCIe, infiniband, and others) are equally applicable. The chassis provides ports for an external communication bus to enable communication between multiple chassis and with the client system, either directly or through a switch. External communications may use technologies such as ethernet, infiniband, fibre channel, etc. In some embodiments, the external communication bus uses different communication bus technologies for inter-chassis and client communication. If the switch is deployed within a chassis or between chassis, the switch may be used as a translation between multiple protocols or technologies. When multiple chassis are connected to define a storage cluster, the storage cluster may be accessed by a client using a proprietary or standard interface (e.g., network file system ('NFS'), common internet file system ('CIFS'), small computer system interface ('SCSI'), or hypertext transfer protocol ('HTTP'). The conversion from the client protocol may occur at the switch, at the chassis external communication bus, or within each storage node. In some embodiments, multiple chassis may be coupled or connected to each other through an aggregator switch. A portion and/or all of the coupled or connected chassis may be designated as a storage cluster. As discussed above, each chassis may have multiple blades, each with a media access control ('MAC') address, but in some embodiments the storage cluster appears to the external network as having a single cluster IP address and a single MAC address.

Each storage node may be one or more storage servers, and each storage server is connected to one or more non-volatile solid state storage units, which may be referred to as storage units or storage devices. One embodiment includes a single storage server in each storage node and between 1 and 8 non-volatile solid state storage units, although this one example is not meant to be limiting. The storage server may include a processor, DRAM, and interfaces for internal communication buses and power distribution for each power bus. In some embodiments, the interface and storage units share a communication bus, such as PCI Express, within the storage node. The non-volatile solid state memory unit may directly access the internal communication bus interface through the storage node communication bus or request the storage node to access the bus interface. The non-volatile solid state memory unit contains an embedded CPU, a solid state memory controller, and a quantity of solid state mass storage devices, such as between 2 and 32 terabytes ('TB') in some embodiments. Embedded volatile storage media, such as DRAM, and energy storage devices are included in the nonvolatile solid state memory cells. In some embodiments, the energy reserve device is a capacitor, supercapacitor, or battery that enables the transfer of a subset of the DRAM content to a stable storage medium in the event of a power loss. In some embodiments, the non-volatile solid state memory cells are comprised of memory-like memory, such as phase change or magnetoresistive random access memory ('MRAM') that replaces DRAM and implements a means of reducing power maintenance (hold-up).

One of the many features of storage nodes and non-volatile solid state storage devices is the ability to actively reconstruct data in a storage cluster. The storage nodes and non-volatile solid state storage devices may determine when a storage node or non-volatile solid state storage device in a storage cluster is unreachable, regardless of whether an attempt is made to read data related to the storage node or non-volatile solid state storage device. The storage nodes and non-volatile solid state storage devices then cooperate to recover and reconstruct data in at least a portion of the new locations. This constitutes an active rebuild in that the system does not need to wait until a read access initiated from a client system employing the storage cluster requires the data to be rebuilt. These and further details of the memory and its operation are discussed below.

FIG. 2A is a perspective view of a storage cluster 161 according to some embodiments, the storage cluster 161 having a plurality of storage nodes 150 and internal solid state memory coupled to each storage node to provide a network attached storage or storage area network. The network attached storage, storage area network, or storage cluster, or other memory, may include one or more storage clusters 161, each storage cluster 161 having one or more storage nodes 150 in a flexible and reconfigurable arrangement of both physical components and the amount of memory provided thereby. Storage clusters 161 are designed to fit in racks and may be arranged and populated with one or more racks as needed for storage. Storage cluster 161 has a chassis 138 with a plurality of slots 142. It should be appreciated that the chassis 138 may be referred to as a shell, housing, or rack unit. In one embodiment, the chassis 138 has 14 slots 142, but other numbers of slots are readily designed. For example, some embodiments have four slots, eight slots, sixteen slots, thirty-two slots, or other suitable number of slots. In some embodiments, each slot 142 may house one storage node 150. The chassis 138 includes a bezel 148 that may be used to mount the chassis 138 to a rack. Fan 144 provides air circulation for cooling storage node 150 and its components, although other cooling components may be used, or embodiments without cooling components may be designed. The switch fabric 146 couples storage nodes 150 within the chassis 138 together and to a network to communicate with memory. In the embodiment depicted herein, the slots 142 to the left of the switch fabric 146 and fan 144 are shown occupied by storage nodes 150, while the slots 142 to the right of the switch fabric 146 and fan 144 are empty and available for insertion into storage nodes 150 for illustration purposes. This configuration is one example, and one or more storage nodes 150 may occupy slots 142 in various other arrangements. In some embodiments, the storage node arrangements need not be sequential or adjacent. Storage node 150 is hot-swapped, meaning that storage node 150 may be inserted into slot 142 in chassis 138 or removed from slot 142 without stopping or shutting down the system. After insertion or removal of the storage node 150 from the slot 142, the system is automatically reconfigured to recognize and accommodate the changes. In some embodiments, reconfiguring includes recovering redundancy and/or rebalancing data or loads.

Each storage node 150 may have multiple components. In the embodiment shown herein, the storage node 150 includes a printed circuit board 159 populated by a CPU 156 (i.e., a processor), a memory 154 coupled to the CPU 156, and a non-volatile solid state storage 152 coupled to the CPU 156, although other installations and/or components may be used in other embodiments. The memory 154 has instructions executed by the CPU 156 and/or data operated on by the CPU 156. As further explained below, the non-volatile solid-state storage 152 includes flash memory, or in other embodiments, other types of solid-state memory.

Referring to FIG. 2A, storage cluster 161 is scalable, meaning that storage capacity with non-uniform storage size is easily added, as described above. In some embodiments, one or more storage nodes 150 may be inserted into or removed from each chassis, and the storage cluster self-configures. The plug-in storage nodes 150, whether installed in the chassis at the time of delivery or added later, may be of different sizes. For example, in one embodiment, storage node 150 may have any multiple of 4TB, such as 8TB, 12TB, 16TB, 32TB, and so on. In other embodiments, storage node 150 may have other storage or any multiple of capacity. The storage capacity of each storage node 150 is broadcast and affects the decision of how to stripe the data. For maximum storage efficiency, embodiments may self-configure as widely as possible in a stripe, subject to predetermined requirements for continued operation in the event of loss of up to one or up to two nonvolatile solid state storage 152 units or storage nodes 150 within the chassis.

Fig. 2B is a block diagram showing a communication interconnect 173 and a power distribution bus 172 coupling a plurality of storage nodes 150. Referring back to fig. 2A, in some embodiments, the communication interconnect 173 may be included in the switch fabric 146 or implemented with the switch fabric 146. In some embodiments, where multiple storage clusters 161 occupy racks, communication interconnect 173 may be included in or implemented with a top-of-rack switch. As illustrated in FIG. 2B, storage clusters 161 are enclosed within a single chassis 138. External port 176 is coupled to storage node 150 through communication interconnect 173, while external port 174 is coupled directly to the storage node. An external power port 178 is coupled to the power distribution bus 172. Storage nodes 150 may include different amounts and different capacities of non-volatile solid-state storage 152 as described with reference to fig. 2A. Additionally, as illustrated in fig. 2B, one or more storage nodes 150 may be compute-only storage nodes. The authority 168 is implemented on the non-volatile solid state storage 152, for example, as a list or other data structure stored in memory. In some embodiments, the authority is stored within the non-volatile solid state storage 152 and is supported by software executing on a controller or other processor of the non-volatile solid state storage 152. In other embodiments, the authority 168 is implemented on the storage node 150, for example as a list or other data structure stored in the memory 154, and is supported by software executing on the CPU 156 of the storage node 150. In some embodiments, the authority 168 controls how and where data is stored in the non-volatile solid state storage 152. This control helps determine which type of erasure coding scheme is applied to the data and which storage nodes 150 have which portions of the data. Each authority 168 may be assigned to a non-volatile solid state storage 152. In various embodiments, each authority may control a series of inode numbers, segment numbers, or other data identifiers assigned to data by the file system, by the storage node 150, or by the non-volatile solid state storage 152.

In some embodiments, each piece of data and each piece of metadata has redundancy in the system. In addition, each piece of data and each piece of metadata has an owner, which may be referred to as an authority. If the authority is not reachable, e.g. due to a failure of a storage node, there is a successor plan how to find the data or the metadata. In various embodiments, there is a redundant copy of the authority 168. In some embodiments, an authority 168 is associated with the storage node 150 and the non-volatile solid state storage 152. Each authority 168 covering a series of data segment numbers or other identifiers of data may be assigned to a particular non-volatile solid state storage 152. In some embodiments, all of these ranges of authorities 168 are distributed over the non-volatile solid state storage 152 of the storage cluster. Each storage node 150 has a network port that provides access to the non-volatile solid state storage 152 of that storage node 150. In some embodiments, data may be stored in segments associated with segment numbers, and the segment numbers are indirection of the configuration of RAID (redundant array of independent disks) stripes. Thus, the assignment and use of authorities 168 establishes indirection to data. According to some embodiments, indirection may be referred to as the ability to indirectly (in this case via authority 168) reference data. The segment identifies a set of non-volatile solid state storage devices 152 and a local identifier into the set of non-volatile solid state storage devices 152 that may contain data. In some embodiments, the local identifier is an offset into the device and may be reused by multiple segments in sequence. In other embodiments, the local identifier is unique to a particular segment and is never reused. The offset in the non-volatile solid state storage 152 is applied to locate data (in the form of RAID stripes) for writing to the non-volatile solid state storage 152 or reading from the non-volatile solid state storage 152. Data is striped across multiple units of non-volatile solid state storage 152, which may include or be different from non-volatile solid state storage 152 having authorities 168 for particular segments of data.

If the location where a particular piece of data is located changes, such as during a data movement or data reconstruction, the authority 168 for the piece of data should be consulted at the non-volatile solid state storage 152 or storage node 150 having the authority 168. To locate a particular piece of data, embodiments calculate a hash value of the data segment or apply an inode number or data segment number. The output of this operation is directed to the non-volatile solid-state storage 152 having an authority 168 for the particular piece of data. In some embodiments, this operation has two phases. The first stage maps entity Identifiers (IDs) (e.g., segment numbers, inode numbers, or directory numbers) to authoritative identifiers. This mapping may include, for example, a hash or a bit mask calculation. The second phase is to map the authoritative identifier to a particular non-volatile solid state storage 152, which may be accomplished through explicit mapping. The operations are repeatable such that when a calculation is performed, the results of the calculation may be repeatedly and reliably directed to a particular non-volatile solid state storage 152 having the authority 168. The operations may include reachable storage node groups as inputs. If the set of reachable non-volatile solid state storage units changes, then the optimal set also changes. In some embodiments, the save value is the current assignment (always true) and the calculated value is the target assignment that the cluster will attempt to reconfigure the orientation. Such a calculation may be used to determine the best non-volatile solid-state storage 152 for an authority in the presence of a set of non-volatile solid-state storage 152 that are reachable and constitute the same cluster. The computation also determines an ordered set of peer non-volatile solid state storage devices 152, which peer non-volatile solid state storage devices 152 also map recording authorities to non-volatile solid state memory so that authorities can be determined even if the assigned non-volatile solid state storage devices are not reachable. If a particular authority 168 is not available in some embodiments, a replication or replacement authority 168 may be consulted.

Referring to fig. 2A and 2B, two of the many tasks of the CPU 156 on the storage node 150 are decomposing the write data and recombining the read data. When the system has determined that data is to be written, the authority 168 for the data is located as described above. When the segment ID of the data has been determined, the write request is forwarded to the nonvolatile solid state storage 152 currently determined to be the host of the authority 168 determined from the segment. The host CPUs 156 of the non-volatile solid state storage 152 and the storage nodes 150 on which the corresponding authorities 168 reside then decompose or fragment the data and output the data to the various non-volatile solid state storage 152. The transmitted data is written as a data stripe according to an erasure coding scheme. In some embodiments, the extraction of data is requested, and in other embodiments, the data is pushed. Conversely, when data is read, the authority 168 containing the segment ID of the data is located as described above. The non-volatile solid state storage 152 and the host CPU 156 of the storage node 150 on which the corresponding authority 168 resides request data from the non-volatile solid state storage and the corresponding storage node to which the authority is directed. In some embodiments, the data is read from the flash memory as a stripe of data. The host CPU 156 of the storage node 150 then reassembles the read data, corrects any errors (if any) according to the appropriate erasure coding scheme, and forwards the reassembled data to the network. In other embodiments, some or all of these tasks may be handled in the non-volatile solid-state storage 152. In some embodiments, the segment host requests to send data to storage node 150 by requesting a page from the storage device and then sending the data to the storage node making the original request.

In an embodiment, authority 168 operates to determine how an operation will proceed with respect to a particular logic element. Each logic element may be operable by a particular authority across multiple storage controllers of the storage system. The authority 168 may communicate with multiple storage controllers such that the multiple storage controllers collectively perform operations on those particular logic elements.

In an embodiment, the logical element may be, for example, a file, a directory, an object bucket, individual objects, a delimited portion of a file or object, other forms of key-value versus database or table. In embodiments, performing an operation may involve, for example, ensuring consistency, structural integrity, and/or recoverability of other operations with respect to the same logical element, reading metadata and data associated with the logical element, determining which data should be permanently written into the storage system to maintain any changes in the operation, or where the metadata and data may be stored across modular storage devices attached to multiple storage controllers in the storage system.

In some embodiments, the operations are token-based transactions to efficiently communicate within the distributed system. Each transaction may be accompanied by or associated with a token that gives permission to execute the transaction. In some embodiments, the authority 168 is able to maintain the pre-transaction state of the system until the operation is complete. Token-based communication may be accomplished without a global lock across the system, and may also be capable of restarting operation in the event of an interrupt or other failure.

In some systems, such as UNIX-style file systems, data is handled with an index node (or inode) that specifies a data structure representing objects in the file system. For example, the object may be a file or a directory. Metadata may accompany an object as attributes such as license data and creation time stamps, as well as other attributes. Segment numbers may be assigned to all or a portion of such objects in the file system. In other systems, data segments are handled with segment numbers assigned elsewhere. For purposes of discussion, an allocation unit is an entity, and an entity may be a file, directory, or segment. That is, an entity is a unit of data or metadata stored by a storage system. The entities are grouped into groups called authorities. Each authority has an authority owner that is a storage node with the exclusive authority to update the entities in the authority. In other words, the storage node contains an authority, and the authority in turn contains an entity.

According to some embodiments, a segment is a logical container of data. A segment is an address space between media address spaces and a physical flash location (i.e., a data segment number) is in this address space. The segments may also contain metadata that enables data redundancy to be recovered (rewritten to a different flash location or device) without involving higher level software. In one embodiment, the internal format of the segment contains client data and media map to determine the location of the data. Where applicable, each data segment is protected from memory and other failures, for example, by dividing the segment into data and parity slices. According to the erasure coding scheme, the data and parity slices are distributed, i.e., striped, across the non-volatile solid state storage 152 coupled to the host CPU 156 (see fig. 2E and 2G). In some embodiments, the use of the term segment refers to a container and its location in the address space of the segment. According to some embodiments, the use of the term stripe refers to the same set of slices as segments and includes how the slices are distributed along with redundancy or parity information.

A series of address space translations occurs across the entire storage system. At the top is a directory entry (file name) linked to the inode. The inode points to a media address space that logically stores data. The media addresses may be mapped through a series of indirect media to spread the load of large files or to implement data services such as deduplication or snapshot. The media addresses may be mapped through a series of indirect media to spread the load of large files or to implement data services such as deduplication or snapshot. The segment address is then translated to a physical flash location. According to some embodiments, the physical flash locations have an address range that is delimited by the amount of flash in the system. The media addresses and segment addresses are logical containers and in some embodiments use 128 bit or larger identifiers in order to be virtually unlimited, with the possibility of reuse being calculated to be longer than the expected lifetime of the system. In some embodiments, addresses from logical containers are allocated in a hierarchical fashion. Initially, each nonvolatile solid state storage 152 unit may be assigned a range of address spaces. Within this assignment, the non-volatile solid-state storage 152 is able to allocate addresses without synchronizing with other non-volatile solid-state storage 152.

The data and metadata are stored by a set of underlying storage layouts that are optimized for different workload patterns and storage devices. These layouts incorporate a variety of redundancy schemes, compression formats, and indexing algorithms. Some of these layouts store information about authorities and authoritative masters, while others store file metadata and file data. Redundancy schemes include error correction codes that tolerate defective bits within a single memory device (e.g., a NAND flash memory chip), erasure codes that tolerate multiple storage node failures, and replication schemes that tolerate data center or area failures. In some embodiments, a low density parity check ('LDPC') code is used within a single memory cell. In some embodiments, reed-Solomon (Reed-Solomon) encoding is used within the storage clusters, and mirroring is used within the storage grid. The metadata may be stored using an ordered log structured index (e.g., a log structured merge tree), and large data may not be stored in a log structured layout.

To maintain consistency across multiple copies of an entity, storage nodes implicitly agree on two things by computing: (1) An authority containing entity, and (2) a storage node containing authority. Assigning entities to authorities may be accomplished by pseudo-randomly assigning entities to authorities, by dividing entities into ranges based on externally generated keys, or by placing a single entity into each authority. Examples of pseudo-random schemes are linear hashes and hashes of the copy ('run') series under extensible hashes, including controlled copies ('CRUSH') under extensible hashes. In some embodiments, the pseudo-random assignment is used only to assign authorities to nodes, as the node group may change. The authoritative group cannot change and therefore any subjective function can be applied in these embodiments. Some placement schemes automatically place authorities on storage nodes, while other placement schemes rely on explicit mapping of authorities to storage nodes. In some embodiments, a pseudo-random scheme is used to map from each authority to a set of candidate authority owners. A pseudorandom data distribution function associated with the CRUSH may assign an authority to a storage node and create a list of where the authority is assigned. Each storage node has a copy of the pseudorandom data distribution function and may derive the same calculation for distribution and subsequent lookup or locating authority. In some embodiments, each pseudo-random scheme requires a set of reachable storage nodes as input in order to infer the same target node. Once an entity has been placed in the authority, the entity may be stored on the physical device such that the intended failure does not result in unexpected data loss. In some embodiments, the rebalancing algorithm attempts to store copies of all entities within an authority in the same layout and on the same set of machines.

Examples of expected failures include device failures, machine theft, data center fires, and regional disasters, such as nuclear or geological events. Different failures result in different levels of acceptable data loss. In some embodiments, storage node theft does not affect the security nor reliability of the system, while area events may result in lost updates without losing data, seconds or minutes, or even complete data loss, depending on the system configuration.

In an embodiment, the placement of data for storing redundancy is independent of the placement of authorities for data consistency. In some embodiments, the authoritative storage node does not contain any persistent storage. Instead, the storage node is connected to a non-volatile solid state storage unit that contains no authority. The communication interconnections between storage nodes and non-volatile solid state storage units are made up of a variety of communication technologies and have non-uniform performance and fault tolerance characteristics. In some embodiments, as mentioned above, the non-volatile solid state storage units are connected to storage nodes via PCI express, the storage nodes are connected together within a single chassis using an Ethernet backplane, and the chassis are connected together to form a storage cluster. In some embodiments, the storage clusters are connected to the clients using ethernet or fibre channel. If multiple storage clusters are configured into a storage grid, the multiple storage clusters are connected using the Internet or other long-range network links (e.g., a "metropolitan-scale" link or a dedicated link that does not traverse the Internet).

The authoritative owner has the exclusive right to modify an entity, migrate an entity from one non-volatile solid state storage unit to another, and add and delete copies of an entity. This allows redundancy of the underlying data to be maintained. When the rights-holder fails, is about to retire or overload, the authority is transferred to the new storage node. Transient faults make it very important to ensure that all non-faulty machines agree to a new authoritative location. Ambiguity due to transient faults may be achieved automatically through consensus protocols (e.g., paxos, hot-cold failover schemes), via manual intervention by a remote system administrator or local hardware administrator (e.g., by physically removing the failed machine from the cluster, or by pressing a button on the failed machine). In some embodiments, a consensus protocol is used and failover is automatic. According to some embodiments, if too many failures or copy events occur within too short a period of time, the system enters a self-save mode and stops copying and data movement activities until an administrator intervenes.

When the authority is transferred between storage nodes and the authority owner updates the entities in his authority, the system transfers messages between the storage nodes and the non-volatile solid state storage units. Regarding persistent messages, messages with different purposes are of different types. Depending on the type of message, the system maintains different ordering and persistence guarantees. As persistent messages are processed, the messages are temporarily stored in a plurality of persistent and non-persistent storage hardware technologies. In some embodiments, messages are stored in RAM, NVRAM, and on NAND flash devices, and various protocols are used in order to efficiently utilize each storage medium. Delay sensitive client requests may be saved in the replication NVRAM and then subsequently in the NAND, while background rebalancing operations are saved directly to the NAND.

The persistent message is persistently stored prior to transmission. This allows the system to continue servicing client requests even in the event of failure and component replacement. While many hardware components contain unique identifiers that are visible to system administrators, manufacturers, hardware supply chains, and the persistent monitoring quality control infrastructure, applications running on top of the infrastructure addresses virtualize the addresses. These virtualized addresses do not change over the life of the storage system, regardless of whether the components fail and are replaced. This allows each component of the storage system to be replaced over time without requiring reconfiguration or interrupt client request processing, i.e., the system supports interrupt-free upgrades.

In some embodiments, virtualized addresses are stored with sufficient redundancy. The continuous monitoring system correlates hardware and software status with hardware identifiers. This allows for detection and prediction of faults due to faulty components and manufacturing details. In some embodiments, the monitoring system also enables proactive diversion of authorities and entities away from affected devices before failure occurs by removing components from the critical path.

Fig. 2C is a multi-level block diagram showing the contents of storage node 150 and the contents of non-volatile solid state storage 152 of storage node 150. In some embodiments, data is transferred to storage node 150 and from storage node 150 by a network interface controller ('NIC') 202. Each storage node 150 has a CPU 156 and one or more non-volatile solid state storage devices 152, as discussed above. Moving one step down in fig. 2C, each non-volatile solid-state storage 152 has relatively fast non-volatile solid-state memory, such as non-volatile random access memory ('NVRAM') 204 and flash memory 206. In some embodiments, NVRAM 204 may be a component (DRAM, MRAM, PCM) that does not require a program/erase cycle, and may be memory that may support being written to more frequently than memory is read. Moving another stage down in fig. 2C, NVRAM 204 is implemented in one embodiment as high-speed volatile memory, such as Dynamic Random Access Memory (DRAM) 216 supported by energy reserves 218. The energy reserve 218 provides enough power to keep the DRAM 216 powered for a sufficient time to transfer content to the flash memory 206 in the event of a power failure. In some embodiments, the energy reserve 218 is a capacitor, supercapacitor, battery, or other device that supplies a suitable energy supply sufficient to enable transfer of the contents of the DRAM 216 to a stable storage medium in the event of a power loss. The flash memory 206 is implemented as a plurality of flash die 222, which may be referred to as a package of flash die 222 or an array of flash die 222. It should be appreciated that the flash memory die 222 may be packaged in any number of ways, one die per package, multiple dies per package (i.e., multi-chip packages), hybrid packages, as die on a printed circuit board or other substrate, as encapsulated die, etc. In the embodiment shown, the non-volatile solid-state storage 152 has a controller 212 or other processor and an input output (I/O) port 210 coupled to the controller 212. The I/O ports 210 are coupled to the CPUs 156 of the flash node 150 and/or the network interface controller 202. A flash input output (I/O) port 220 is coupled to a flash die 222, and a direct memory access unit (DMA) 214 is coupled to the controller 212, the DRAM 216, and the flash die 222. In the illustrated embodiment, I/O ports 210, controller 212, DMA unit 214, and flash I/O ports 220 are implemented on a programmable logic device ('PLD') 208, such as an FPGA. In this embodiment, each flash die 222 has pages organized as 16kB (kilobyte) pages 224 and registers 226 through which data may be written to the flash die 222 or read from the flash die 222. In other embodiments, other types of solid state memory are used instead of or in addition to the flash memory illustrated within flash die 222.

In various embodiments disclosed herein, storage cluster 161 may be contrasted with a typical storage array. Storage node 150 is the portion that creates a collection of storage clusters 161. Each storage node 150 has a slice of data and the computations needed to provide the data. The plurality of storage nodes 150 cooperate to store and retrieve data. Memory or storage devices typically used in storage arrays are less involved in processing and manipulating data. A memory or storage device in a storage array receives a command to read, write, or erase data. The memories or storage devices in the storage array are unaware of the larger system in which they are embedded, or what the data means. The memory or storage devices in the storage array may include various types of memory, such as RAM, solid state drives, hard drives, and the like. The nonvolatile solid state storage 152 unit described herein has multiple interfaces that are active simultaneously and serve multiple purposes. In some embodiments, some functions of storage node 150 are moved into storage unit 152, thereby converting storage unit 152 into a combination of storage unit 152 and storage node 150. Placing the calculation (relative to the stored data) into the storage unit 152 brings this calculation closer to the data itself. Various system embodiments have a hierarchy of storage node layers with different capabilities. In contrast, in a storage array, a controller owns and knows everything about all the data that the controller manages in a rack or storage device. In storage cluster 161, as described herein, multiple nonvolatile sold state storage 152 units and/or multiple controllers in storage nodes 150 cooperate in various ways (e.g., for erasure coding, data slicing, metadata communication and redundancy, storage capacity expansion or contraction, data recovery, etc.).

Fig. 2D shows a storage server environment using an embodiment of the storage node 150 and storage 152 units of fig. 2A-C. In this version, each nonvolatile solid state storage 152 unit has a processor (e.g., controller 212 (see fig. 2C)), FPGA, flash memory 206, and NVRAM 204 (which are supercapacitor-backed DRAM 216, see fig. 2B and 2C) on a PCIe (peripheral component interconnect express) board in chassis 138 (see fig. 2A). The non-volatile solid state storage 152 unit may be implemented as a single board containing the storage devices and may be the largest fault tolerant domain inside the chassis. In some embodiments, up to two non-volatile solid state storage 152 units may fail and the device will continue without data loss.

In some embodiments, the physical storage is divided into named regions based on application usage. NVRAM 204 is a contiguous block of reserved memory in non-volatile solid-state storage 152DRAM 216 and is supported by NAND flash memory. The NVRAM 204 is logically partitioned into multiple memory regions (e.g., spool_regions) for two writes as spools. The space within NVRAM 204 spools is managed independently by each authority 168. Each device provides an amount of storage space to each authority 168. The authority 168 further manages lifetime and allocation within the space. Examples of spooling include distributed transactions or concepts. The on-board super capacitor provides short-lasting power conservation when the primary power of the non-volatile solid-state storage 152 unit fails. During this hold interval, the contents of NVRAM 204 are refreshed to flash memory 206. At the next power-on, the contents of NVRAM 204 are restored from flash memory 206.

As with the storage unit controllers, the responsibilities of the logical "controller" are distributed across each blade that contains an authority 168. This distribution of logic control is shown in fig. 2D as host controller 242, middle tier controller 244, and storage unit controller 246. The management of the control plane and the storage plane are handled independently, but the components may be physically co-located on the same blade. Each authority 168 effectively acts as an independent controller. Each authority 168 provides its own data and metadata structure, its own background work process (background worker), and maintains its own lifecycle.

FIG. 2E is a block diagram of blade 252 hardware in the storage server environment of FIG. 2D using an embodiment of storage node 150 and storage unit 152 of FIGS. 2A-C, showing control plane 254, compute and storage planes 256, 258, and authority 168 interacting with the underlying physical resources. The control plane 254 is partitioned into a number of authorities 168 that can use the computing resources in the computing plane 256 to run on any blade 252. The memory plane 258 is partitioned into a set of devices, each providing access to the flash memory 206 and NVRAM 204 resources. In one embodiment, the compute plane 256 may perform operations of a memory array controller on one or more devices (e.g., a memory array) of the memory plane 258, as described herein.

In the compute and store planes 256, 258 of FIG. 2E, authorities 168 interact with underlying physical resources (i.e., devices). From the perspective of authority 168, its resources are striped across all physical devices. From the device's perspective, it provides resources to all authorities 168, regardless of where the authorities are running. Each authority 168 has allocated or has been allocated one or more partitions 260 of memory in the storage unit 152, such as the partition 260 in flash memory 206 and NVRAM 204. Each authority 168 uses those assigned partitions 260 belonging to it for writing or reading user data. The authorities may be associated with different amounts of physical storage of the system. For example, one authority 168 may have a greater number of partitions 260 or a larger size of partitions 260 in one or more storage units 152 than one or more other authorities 168.

FIG. 2F depicts the resilient software layers in the blades 252 of the storage cluster, according to some embodiments. In the elastic structure, the elastic software is symmetrical, i.e., the computing module 270 of each blade runs three identical layers of the process depicted in FIG. 2F. The storage manager 274 performs read and write requests from the other blades 252 for data and metadata stored in the local storage unit 152NVRAM 204 and the flash memory 206. Authority 168 satisfies the client requests by issuing the necessary reads and writes to blade 252 with the corresponding data or metadata residing on storage unit 152 of blade 252. Endpoint 272 parses the client connection request received from the switch fabric 146 supervisory software, relays the client connection request to the authority 168 responsible for enforcement, and relays the response of the authority 168 to the client. The symmetrical three-layer structure achieves a high degree of concurrency of the storage system. In these embodiments, the elasticity is efficiently and reliably laterally expanded. In addition, resiliency implements a unique lateral expansion technique that can work equally across all resources, regardless of client access patterns, and maximize concurrency by eliminating many of the needs for inter-blade coordination that typically occurs with conventional distributed locking.

Still referring to FIG. 2F, an authority 168 running in a computing module 270 of blade 252 performs the internal operations required to fulfill the client request. One feature of the resiliency is that the authorities 168 are stateless, i.e., they cache active data and metadata in their own blade 252 DRAMs for quick access, but the authorities store each update in their NVRAM 204 partitions on three separate blades 252 until the update has been written to the flash memory 206. In some embodiments, all storage system writes to NVRAM 204 are written to the partitions on three separate blades 252 in triplicate. With triple mirrored NVRAM 204 and persistent storage protected by parity and Reed-Solomon RAID checksums, the storage system can withstand simultaneous failure of both blades 252 without losing data, metadata, or access to either.

Because authorities 168 are stateless, they may migrate between blades 252. Each authority 168 has a unique identifier. NVRAM 204 and flash 206 partitions are associated with the identifiers of the authorities 168, rather than with the blades 252 on which they run. Thus, as the authority 168 migrates, the authority 168 continues to manage the same memory partition from its new location. When a new blade 252 is installed in an embodiment of a storage cluster, the system automatically rebalances the load by: splitting the storage of new blade 252 for use by system authority 168, migrating selected authorities 168 to new blade 252, launching endpoints 272 on new blade 252, and including them in the client connection allocation algorithm of switch fabric 146.

From their new location, the migrating authority 168 saves the contents of their NVRAM 204 partition on the flash memory 206, processes read and write requests from other authorities 168, and satisfies client requests directed to them by endpoint 272. Similarly, if a blade 252 fails or is removed, the system redistributes its authority 168 among the remaining blades 252 of the system. The reassigned authorities 168 continue to perform their original functions from their new locations.

FIG. 2G depicts an authority 168 and storage resources in a blade 252 of a storage cluster, according to some embodiments. Each authority 168 is responsible for the partitioning of flash memory 206 and NVRAM 204 on each blade 252. The authority 168 manages the contents and integrity of its partition independently of the other authorities 168. The authorities 168 compress the incoming data and temporarily save it in their NVRAM 204 partition, and then merge, RAID protect and save the data in the memory segments in their flash 206 partition. When the authority 168 writes data to the flash memory 206, the storage manager 274 performs the necessary flash translation to optimize write performance and maximize media life. In the background, the authority 168 "garbage collects" or reclaims the space occupied by data that has been discarded by the client by overwriting the data. It should be appreciated that because the partitions of the authority 168 are disjoint, no distributed locking is required to perform client and write or perform background functions.

Embodiments described herein may utilize various software, communication, and/or network protocols. In addition, the configuration of hardware and/or software may be adjusted to accommodate various protocols. For example, embodiments may utilize an Active Directory (Active Directory), which is in WINDOWS ^TM Database-based systems in an environment that provide authentication, cataloging, policies, and other services. In these embodiments, LDAP (lightweight directory Access protocol) is one example application protocol for querying and modifying items in a directory service provider, such as an active directory. In some embodimentsIn an example, a network lock manager ('NLM') is used as a facility to work in conjunction with a network file system ('NFS') to provide system V style advisory files and record locks on a network. The server message block ('SMB') protocol (one of which versions is also referred to as the common internet file system ('CIFS') may be integrated with the storage system discussed herein. SMP operates as an application layer network protocol that is commonly used to provide shared access to files, printers, and serial ports, and a wide variety of communications between nodes on the network. SMB also provides an authenticated inter-process communication mechanism. AMAZON ^TM S3 (simple storage service) is a web service provided by Amazon web service, and the system described herein can interface with Amazon S3 through web service interfaces (REST (representational state transfer), SOAP (simple object access protocol), and BitTorrent). The RESTful API (application programming interface) breaks down transactions to create a series of small modules. Each module handles a particular underlying portion of the transaction. The controls or permissions provided by these embodiments, particularly for object data, may include the utilization of an access control list ('ACL'). An ACL is a list of permissions attached to an object, and an ACL specifies which users or system processes are granted access to the object, and which operations are allowed to be performed on a given object. The system may utilize internet protocol version 6 ('IPv 6') as well as IPv4 as a communication protocol that provides an identification and location system for computers on a network and routes traffic across the internet. Packet routing between network systems may include equal cost multi-path routing ('ECMP'), a routing strategy in which the forwarding of next-hop packets to a single destination may occur over multiple "best paths," which are first side-by-side in the routing metric calculation. Multipath routing can be used in conjunction with most routing protocols because it is a hop-by-hop decision limited to a single router. Software may support multiple tenants, an architecture in which a single instance of a software application serves multiple clients. Each customer may be referred to as a tenant. In some embodiments, tenants may be given the ability to customize portions of an application, but not customize the application's code. Embodiments may maintain an audit log. Audit logs are records of events in a computing system Document of the piece. In addition to recording which resources are accessed, audit log entries typically include destination and source addresses, time stamps, and user login information to comply with various regulations. Embodiments may support various key management policies, such as encryption key rotation. In addition, the system may support some variation of a dynamic root password or a dynamic change password.

Fig. 3A illustrates a diagram of a storage system 306, the storage system 306 coupled for data communication with a cloud service provider 302, according to some embodiments of the present disclosure. Although described in less detail, the storage system 306 depicted in fig. 3A may be similar to the storage system described above with reference to fig. 1A-1D and 2A-2G. In some embodiments, the storage system 306 depicted in fig. 3A may be embodied as a storage system including unbalanced active/active controllers, a storage system including balanced active/active controllers, a storage system including active/active controllers (where less than all of the resources of each controller are utilized such that each controller has reserved resources available to support failover), a storage system including fully active/active controllers, a storage system including data set isolation controllers, a storage system including a dual-layer architecture with front-end controllers and back-end integrated storage controllers, a storage system including a laterally-expanded cluster of dual-controller arrays, and combinations of these embodiments.

In the example depicted in fig. 3A, storage system 306 is coupled to cloud service provider 302 via data communication link 304. The data communication link 304 may be embodied as a dedicated data communication link, a data communication path provided through one or more data communication networks using, for example, a wide area network ('WAN') or LAN, or some other mechanism capable of conveying digital information between the storage system 306 and the cloud service provider 302. Such data communication links 304 may be entirely wired, entirely wireless, or some aggregation of wired and wireless data communication paths. In such examples, digital information may be exchanged between storage system 306 and cloud service provider 302 via data communication link 304 using one or more data communication protocols. For example, digital information may be exchanged between storage system 306 and cloud service provider 302 via data communication link 304 using a handheld device transport protocol ('HDTP'), hypertext transport protocol ('HTTP'), internet protocol ('IP'), real-time transport protocol ('RTP'), transmission control protocol ('TCP'), user datagram protocol ('UDP'), wireless application protocol ('WAP'), or other protocol.

The cloud service provider 302 depicted in fig. 3A may be embodied as a system and computing environment that provides a large number of services to users of the cloud service provider 302, for example, by sharing computing resources via the data communication link 304. Cloud service provider 302 may provide on-demand access to a shared pool of configurable computing resources, such as computer networks, servers, storage, applications, and services. The shared pool of configurable resources may be quickly provisioned and released to users of cloud service provider 302 with minimal administrative effort. In general, the user of cloud service provider 302 is unaware of the exact computing resources utilized by cloud service provider 302 to provide the service. Although in many cases such cloud service provider 302 may be accessible via the internet, readers of skill in the art will recognize that any system that abstracts the use of shared resources to provide services to users over any data communication link may be considered a cloud service provider 302.

In the example depicted in fig. 3A, cloud service provider 302 may be configured to provide various services to storage system 306 and users of storage system 306 by implementing various service models. For example, cloud service provider 302 may be configured to provide services by implementing an infrastructure as a service ('IaaS') service model, by implementing a platform as a service ('PaaS') service model, by implementing a software as a service ('SaaS') service model, by implementing an authentication as a service ('AaaS') service model, by implementing a storage as a service model in which cloud service provider 302 provides access to its storage infrastructure for use by storage system 306 and users of storage system 306, and so forth. Readers will appreciate that cloud service provider 302 may be configured to provide additional services to storage system 306 and users of storage system 306 by implementing additional service models, as the service models described above are included for explanation purposes only and in no way represent limitations on the services that may be provided by cloud service provider 302 or limitations on the service models that may be implemented by cloud service provider 302.

In the example depicted in fig. 3A, cloud service provider 302 may be embodied as, for example, a private cloud, a public cloud, or a combination of private and public clouds. In embodiments where cloud service provider 302 is embodied as a private cloud, cloud service provider 302 may be dedicated to providing services to a single organization, rather than providing services to multiple organizations. In embodiments where cloud service provider 302 is embodied as a public cloud, cloud service provider 302 may provide services to multiple organizations. In other alternative embodiments, cloud service provider 302 may be embodied as a hybrid of private and public cloud services with hybrid cloud deployment.

Although not explicitly depicted in fig. 3A, readers will appreciate that a large number of additional hardware components and additional software components may be necessary to facilitate the presentation of cloud services to storage system 306 and users of storage system 306. For example, the storage system 306 may be coupled to (or even include) a cloud storage gateway. Such a cloud storage gateway may be embodied as, for example, a hardware-based or software-based appliance that is located inside the storage system 306. Such a cloud storage gateway may operate as a bridge between local applications executing on storage system 306 and remote cloud-based storage utilized by storage system 306. By using a cloud storage gateway, an organization may move primary iSCSI or NAS to cloud service provider 302, thereby enabling the organization to save space on its internal storage system. Such a cloud storage gateway may be configured to emulate a disk array, block-based device, file server, or other storage system, which may convert SCSI commands, file server commands, or other suitable commands into REST space protocols that facilitate communication with cloud service provider 302.

In order to enable storage system 306 and users of storage system 306 to utilize services provided by cloud service provider 302, a cloud migration process may occur during which data, applications, or other elements from an organization's local system (or even from another cloud environment) are moved to cloud service provider 302. To successfully migrate data, applications, or other elements to the environment of cloud service provider 302, middleware, such as a cloud migration tool, may be utilized to bridge the gap between the environment of cloud service provider 302 and the environment of the organization. Such cloud migration tools may also be configured to address the potentially high network costs and long transfer times associated with migrating large amounts of data to cloud service provider 302, as well as to address security issues associated with sensitive data to cloud service provider 302 over a data communications network. To further enable storage system 306 and users of storage system 306 to utilize services provided by cloud service provider 302, cloud orchestrators may also be used to arrange and coordinate automation tasks in pursuit of creating a unified process or workflow. Such a cloud orchestrator may perform tasks such as configuring various components (whether the components are cloud components or internal components) and managing the interconnections between these components. The cloud orchestrator may simplify inter-component communication and connections to ensure proper configuration and maintenance of links.

In the example depicted in fig. 3A, and as briefly described above, cloud service provider 302 may be configured to provide services to storage system 306 and users of storage system 306 by using a SaaS service model, eliminating the need to install and run applications on local computers, which may simplify maintenance and support of applications. Such applications may take many forms according to various embodiments of the present disclosure. For example, cloud service provider 302 may be configured to provide access to data analysis applications to storage system 306 and users of storage system 306. Such a data analysis application may be configured to, for example, receive a large amount of telemetry data that is recalled by the storage system 306. Such telemetry data may describe various operational characteristics of the storage system 306 and may be analyzed for various purposes including, for example, determining a health of the storage system 306, identifying workloads executing on the storage system 306, predicting when the storage system 306 will consume various resources, recommending configuration changes, hardware or software upgrades, workflow migration, or other actions that may improve the operation of the storage system 306.

Cloud service provider 302 may also be configured to provide access to virtualized computing environments to storage system 306 and users of storage system 306. Such virtualized computing environment may be embodied as, for example, a virtual machine or other virtualized computer hardware platform, virtual storage, virtualized computer network resources, and the like. Examples of such virtualized environments may include virtual machines created to simulate an actual computer, virtualized desktop environments that separate logical desktops from physical machines, virtualized file systems that allow uniform access to different types of specific file systems, and many others.

Although the example depicted in fig. 3A illustrates the storage system 306 being coupled for data communication with the cloud service provider 302, in other embodiments the storage system 306 may be part of a hybrid cloud deployment, where private cloud elements (e.g., private cloud services, internal infrastructure, etc.) and public cloud elements (e.g., public cloud services, infrastructure, etc., that may be provided by one or more cloud service providers) are combined to form a single solution, orchestrated among the various platforms. Such hybrid cloud deployment may utilize hybrid cloud management software, such as from Microsoft, for example ^TM Azure of (A) ^TM Arc, which concentrates the management of hybrid cloud deployment to any infrastructure and enables deployment of services anywhere. In such examples, the hybrid cloud management software may be configured to create, update, and delete resources (both physical and virtual) that form the hybrid cloud deployment, allocate computing and storage to particular workloads, monitor performance of the workloads and resources, policy compliance, update and patch, security status, or perform various other tasks.

The reader will appreciate that by pairing the storage system described herein with one or more cloud service providers, various products (offerings) may be implemented. For example, a disaster recovery as a service ('DRaaS') may be provided in which cloud resources are utilized to protect applications and data from disruption caused by a disaster, including in embodiments in which a storage system may serve as a primary data repository. In such an embodiment, a full system backup may be performed, which allows for business continuity in the event of a system failure. In such embodiments, cloud data backup techniques (either by themselves or as part of a larger DRaaS solution) may also be integrated into an overall solution that includes the storage system and cloud service provider described herein.

The storage systems described herein and cloud service providers may be used to provide a wide range of security features. For example, the storage system may encrypt static data (and data may be sent to or from the encrypted storage system), and may utilize a key management as a service ('KMaaS') to manage encryption keys, keys for locking and unlocking storage devices, and so forth. Also, a cloud data security gateway or similar mechanism may be utilized to ensure that data stored within the storage system is not incorrectly ultimately stored in the cloud as part of a cloud data backup operation. Furthermore, micro-segmentation or identity-based segmentation may be used in a data center containing storage systems or within a cloud service provider to create secure areas in the data center and cloud deployment that enable isolation of workloads from each other.

For further explanation, fig. 3B sets forth a diagram of a storage system 306 according to some embodiments of the present disclosure. Although described in less detail, the storage system 306 depicted in fig. 3B may be similar to the storage system described above with reference to fig. 1A-1D and 2A-2G, as the storage system may include many of the components described above.

The storage system 306 depicted in fig. 3B may include a large number of storage resources 308, which may be embodied in a variety of forms. For example, the memory resource 308 may include nano-RAM or another form of non-volatile random access memory utilizing carbon nanotubes deposited on a substrate, 3D cross-point non-volatile memory, flash memory (including single level cell ('SLC') NAND flash memory, multi level cell ('MLC') NAND flash memory, three level cell ('TLC') NAND flash memory, four level cell ('QLC') NAND flash memory), or others. Likewise, the storage resource 308 may include a non-volatile magnetoresistive random access memory ('MRAM'), including spin transfer torque ('STT') MRAM. Example memory resources 308 may alternatively include non-volatile phase change memory ('PCM'), quantum memory that allows for storing and retrieving photonic quantum information, resistive random access memory ('ReRAM'), storage class memory ('SCM'), or other forms of memory resources, including any combination of the resources described herein. The reader will appreciate that other forms of computer memory and storage devices can be utilized by the storage systems described above, including DRAM, SRAM, EEPROM, general purpose memory, and many others. The storage resources 308 depicted in fig. 3A may be embodied in a variety of form factors including, but not limited to, dual inline memory modules ('DIMMs'), non-volatile dual inline memory modules ('NVDIMMs'), m 2, U.2, and others.

The storage resources 308 depicted in fig. 3B may include various forms of SCM. The SCM may effectively treat fast, non-volatile memory (e.g., NAND flash) as a DRAM extension such that the entire data set may be considered an in-memory data set that resides entirely in DRAM. The SCM may include a non-volatile medium such as, for example, NAND flash memory. Such NAND flash memory may be accessed utilizing NVMe, which may use the PCIe bus as its transport, providing relatively low access latency compared to older protocols. In fact, the network protocols for SSDs in full flash arrays may include NVMe (ROCE, NVME TCP), fibre channel (NVMe FC), infiniband (iWARP), and others that may consider fast, non-volatile memory as an extension of DRAM, using Ethernet. In view of the fact that DRAMs are typically byte-addressable and fast, non-volatile memory (e.g., NAND flash memory) is block-addressable, a controller software/hardware stack may be required to convert block data into bytes stored in a medium. Examples of media and software that may be used as SCM may include, for example, 3D XPoint, intel memory drive technology, three-star Z-SSD, and others.

The storage resources 308 depicted in fig. 3B may also include racetrack memory (also referred to as domain wall memory). Such racetrack memory may be embodied in the form of non-volatile solid-state memory that relies on the inherent strength and orientation of the magnetic field in the solid-state device, in addition to its charge, that is generated by electrons as they rotate. By using spin-coherent current to move the magnetic domains along the nanoscale permalloy wire, when current passes through the wire, the magnetic domains can pass through a magnetic read/write head positioned near the wire, which alters the magnetic domains to record the pattern of bits. To fabricate a racetrack memory device, many such wires and read/write elements may be packaged together.

The example storage system 306 depicted in fig. 3B may implement various storage architectures. For example, a storage system according to some embodiments of the present disclosure may utilize block storage, where data is stored in blocks, and each block essentially serves as an individual hard disk drive. A storage system according to some embodiments of the present disclosure may utilize object storage, where data is managed as objects. Each object may include the data itself, variable amounts of metadata, and a globally unique identifier, where object storage may be implemented at multiple levels (e.g., device level, system level, interface level). Storage systems according to some embodiments of the present disclosure utilize file storage, wherein data is stored in a hierarchical structure. Such data may be stored in files and folders and presented in the same format to both the system storing it and the system retrieving it.

The example storage system 306 depicted in fig. 3B may be embodied as a storage system in which additional storage resources may be added through the use of a longitudinal expansion model, may be added through the use of a lateral expansion model, or by some combination thereof. In the longitudinally extending model, additional storage may be added by adding additional storage devices. However, in the lateral expansion model, additional storage nodes may be added to the cluster of storage nodes, where such storage nodes may include additional processing resources, additional network resources, and so forth.

The example storage system 306 depicted in FIG. 3B may utilize the storage resources described above in a variety of different ways. For example, portions of the storage resources may be used to act as write caches, storage resources within the storage system may be used as read caches, or layering may be implemented within the storage system by placing data within the storage system according to one or more layering policies.

The storage system 306 depicted in fig. 3B also includes communication resources 310 that may be used to facilitate data communication between components within the storage system 306, as well as between the storage system 306 and computing devices external to the storage system 306, including embodiments in which those resources are separated by a relatively wide space. The communication resources 310 may be configured to utilize a variety of different protocols and data communication structures to facilitate data communication between components within the storage system and computing devices external to the storage system. For example, the communication resources 310 may include fibre channel ('FC') technology, such as FC fabric and FC protocol that may transport SCSI commands over FC networks, FC over ethernet ('FCoE') technology that encapsulates and transmits FC frames over ethernet, infiniband ('IB') technology in which a switching fabric topology is utilized to facilitate transmission between channel adapters, NVM Express ('NVMe') technology, and structural NVMe ('nvmeoh') technology that may access non-volatile storage media attached via a PCI Express ('PCIe') bus, among others. In fact, the storage system described above may directly or indirectly utilize neutrino communication techniques and devices by which information (including binary information) is transmitted using neutrino beams.

The communication resources 310 may also include mechanisms for accessing the storage resources 308 within the storage system 306 using serial attached SCSI ('SAS'), serial ATA ('SATA') bus interfaces for connecting the storage resources 308 within the storage system 306 to host bus adapters within the storage system 306, internet Small computer System interface ('iSCSI') technology for providing block-level access to the storage resources 308 within the storage system 306, and other communication resources that may be used to facilitate data communication between components within the storage system 306 and between the storage system 306 and computing devices external to the storage system 306.

The storage system 306 depicted in fig. 3B also includes processing resources 312 that may be used to execute computer program instructions and perform other computing tasks within the storage system 306. The processing resources 312 may include one or more ASICs and one or more CPUs tailored for some specific purposes. The processing resources 312 may also include one or more DSPs, one or more FPGAs, one or more system-on-a-chip ('SOCs'), or other forms of processing resources 312. Storage system 306 may utilize storage resources 312 to perform various tasks, including but not limited to supporting execution of software resources 314, as will be described in more detail below.

The storage system 306 depicted in fig. 3B also includes software resources 314, which software resources 314 may perform a number of tasks when executed by the processing resources 312 within the storage system 306. The software resources 314 may include, for example, one or more modules of computer program instructions that, when executed by the processing resources 312 within the storage system 306, may be used to carry out various data protection techniques. Such data protection techniques may be implemented, for example, by system software executing on computer hardware within a storage system, by a cloud service provider, or in other ways. Such data protection techniques may include data archiving, data backup, data replication, data snapshot, data and database cloning, and other data protection techniques.

The software resource 314 may also include software that may be used to implement a software defined storage ('SDS'). In such examples, software resources 314 may include one or more modules of computer program instructions that, when executed, may be used for policy-based data store provisioning and management independent of the underlying hardware. Such software resources 314 may be used to implement storage virtualization to separate storage hardware from software that manages the storage hardware.

The software resources 314 may also include software that may be used to facilitate and optimize I/O operations directed to the storage system 306. For example, the software resources 314 may include software modules that perform various data reduction techniques, such as, for example, data compression, data deduplication, and others. The software resources 314 may include software modules that intelligently group I/O operations together to facilitate better use of the underlying storage resources 308, software modules that perform data migration operations to migrate from within the storage system, and software modules that perform other functions. Such software resources 314 may be embodied as one or more software containers or in many other ways.

For further explanation, fig. 3C sets forth an example of cloud-based storage system 318 according to some embodiments of the present disclosure. In the example depicted in fig. 3C, the cloud-based storage system 318 is entirely within the cloud computing environment 316, such as, for example, amazon Web Services ('AWS') ^TM 、Microsoft Azure ^TM 、Google Cloud Platform ^TM 、IBM Cloud ^TM 、Oracle Cloud ^TM And other such creation. Cloud-based storage system 318 may be used to provide services similar to those that may be provided by the storage systems described above.

The cloud-based storage system 318 depicted in fig. 3C includes two cloud computing instances 320, 322, each for supporting execution of a storage controller application 324, 326. Cloud computing items 320, 322 may be embodied as items of cloud computing resources (e.g., virtual machines) that may be provided by cloud computing environment 316, for example, to support execution of software applications such as storage controller applications 324, 326. For example, each of cloud computing instances 320, 322 may be executed on Azure VMs, where each Azure VM may include high-speed temporary storage, which may be used as a cache (e.g., as a read cache). In one embodiment, the cloud computing instances 320, 322 may be embodied as amazon elastic computing cloud ('EC 2') instances. In such examples, an amazon machine image ('AMI') including the storage controller applications 324, 326 may be launched, creating and configuring virtual machines of the executable storage controller applications 324, 326.

In the example method depicted in fig. 3C, the storage controller applications 324, 326 may be embodied as modules of computer program instructions that, when executed, perform various storage tasks. For example, the storage controller applications 324, 326 may be embodied as computer program instruction modules that, when executed, perform the same tasks as the controllers 110A, 110B in fig. 1A described above, such as writing data to the cloud-based storage system 318, erasing data from the cloud-based storage system 318, retrieving data from the cloud-based storage system 318, monitoring and reporting disk utilization and performance, performing redundancy operations (e.g., RAID or RAID-like data redundancy operations), compressing data, encrypting data, deduplicating data, and so forth. Readers will appreciate that because there are two cloud computing instances 320, 322, each containing a storage controller application 324, 326, in some embodiments one cloud computing instance 320 may operate as the primary controller described above, while the other cloud computing instance 322 may operate as the secondary controller described above. The reader will appreciate that the storage controller applications 324, 326 depicted in fig. 3C may include the same source code executing within different cloud computing instances 320, 322 (e.g., different EC2 instances).

The reader will appreciate that other embodiments that do not include primary and secondary controllers are within the scope of the present disclosure. For example, each cloud computing item 320, 322 may operate as a primary controller for some portion of the address space supported by cloud-based storage system 318, each cloud computing item 320, 322 may operate as a primary controller, with services directed to I/O operations of cloud-based storage system 318 partitioned in some other manner, and so on. In fact, in other embodiments where cost savings may be prioritized over performance requirements, there may be only a single cloud computing instance containing a storage controller application.

The cloud-based storage system 318 depicted in fig. 3C includes cloud computing instances 340a, 340b, 340n with local storage 330, 334, 338. Cloud computing items 340a, 340b, 340n may be embodied as, for example, items of cloud computing resources, which may be provided by cloud computing environment 316 to support execution of software applications. The cloud computing items 340a, 340b, 340n of fig. 3C may differ from the cloud computing items 320, 322 described above in that the cloud computing items 340a, 340b, 340n of fig. 3C have local storage 330, 334, 338 resources, while the cloud computing items 320, 322 supporting execution of the storage controller applications 324, 326 do not need to have local storage resources. Cloud computing instances 340a, 340b, 340n with local storage 330, 334, 338 may be embodied as, for example, EC 2M 5 instances including one or more SSDs, EC 2R 5 instances including one or more SSDs, EC 2I 3 instances including one or more SSDs, and so forth. In some embodiments, the local storage 330, 334, 338 must be embodied as a solid state storage device (e.g., SSD) rather than a storage device that utilizes a hard disk drive.

In the example depicted in fig. 3C, each of the cloud computing items 340a, 340b, 340n with local storage 330, 334, 338 may include a software daemon 328, 332, 336 that, when executed by the cloud computing items 340a, 340b, 340n, may present itself to the storage controller application 324, 326 as if the cloud computing items 340a, 340b, 340n were physical storage devices (e.g., one or more SSDs). In such examples, software daemons 328, 332, 336 may include computer program instructions similar to those that would normally be contained on a storage device so that storage controller applications 324, 326 can send and receive the same commands that the storage controller would send to the storage device. In this way, the storage controller applications 324, 326 may contain code that is the same (or substantially the same) as the code that would be executed by the controller in the storage system described above. In these and similar embodiments, communication between the storage controller application 324, 326 and the cloud computing items 340a, 340b, 340n with the local storage 330, 334, 338 may utilize iSCSI, NVMe over TCP, messaging, custom protocols, or some other mechanism.

In the example depicted in fig. 3C, each of the cloud computing items 340a, 340b, 340n having local storage 330, 334, 338 may also be coupled to a block storage 342, 344, 346 provided by the cloud computing environment 316, such as, for example, an amazon elastic block storage ('EBS') volume. In such instances, the block storage 342, 344, 346 provided by the cloud computing environment 316 may be utilized in a manner similar to how the NVRAM devices described above are utilized, as the software daemon 328, 332, 336 (or some other module) executing within a particular cloud composition instance 340a, 340b, 340n may initiate writing data to its attached EBS volume and writing data to its local storage 330, 334, 338 resources after receiving a request to write data. In some alternative embodiments, data may be written to only the local storage 330, 334, 338 resources within a particular cloud composition instance 340a, 340b, 340 n. In an alternative embodiment, instead of using the block storage 342, 344, 346 provided by the cloud computing environment 316 as NVRAM, the actual RAM on each of the cloud computing items 340a, 340b, 340n with the local storage 330, 334, 338 is used as NVRAM, thereby reducing the network utilization costs to be associated with using EBS volumes as NVRAM. In yet another embodiment, high performance block storage resources, such as one or more Azure Ultra disks, may be used as NVRAM.

The storage controller applications 324, 326 may be used to perform various tasks, such as deduplicating data contained in the request, compressing data contained in the request, determining where to write data contained in the request, and the like, before eventually sending a request to one or more of the cloud computing instances 340a, 340b, 340n with the local storage 330, 334, 338 to the deduplicated, encrypted, or otherwise possibly updated version of the write data. In some embodiments, either cloud computing instance 320, 322 may receive a request to read data from cloud-based storage system 318, and may ultimately send the request to read data to one or more of cloud computing instances 340a, 340b, 340n with local storage 330, 334, 338.

When a request to write data is received by a particular cloud computing item 340a, 340b, 340n having a local storage 330, 334, 338, the software daemon 328, 332, 336 may be configured to not only write data to its own local storage 330, 334, 338 resources and any suitable block storage 342, 344, 346 resources, but the software daemon 328, 332, 336 may also be configured to write data to a cloud-based object storage 348 attached to the particular cloud computing item 340a, 340b, 340 n. The cloud-based object store 348 attached to the particular cloud computing items 340a, 340b, 340n may be embodied as, for example, an amazon simple storage service ('S3'). In other embodiments, cloud computing items 320, 322, each including a storage controller application 324, 326, may initiate storage of data in local storage 330, 334, 338 and cloud-based object storage 348 of cloud computing items 340a, 340b, 340 n. In other embodiments, instead of storing data using both cloud computing items 340a, 340b, 340n with local storage 330, 334, 338 (also referred to herein as "virtual drives") and cloud-based object storage 348, a persistent storage layer may be implemented in other ways. For example, one or more Azure Ultra disks may be used to persistently store data (e.g., after the data has been written to an NVRAM layer).

While the local storage 330, 334, 338 resources and block storage 342, 344, 346 resources utilized by cloud computing items 340a, 340b, 340n may support block-level access, cloud-based object storage 348 attached to a particular cloud computing item 340a, 340b, 340n only supports object-based access. Thus, software daemons 328, 332, 336 may be configured to obtain data blocks, package those data blocks into objects, and write the objects to cloud-based object store 348 attached to particular cloud computing items 340a, 340b, 340 n.

Consider an example in which data is written to local storage 330, 334, 338 resources and block storage 342, 344, 346 resources utilized in 1MB blocks by cloud computing items 340a, 340b, 340 n. In such an example, assume that a user of cloud-based storage system 318 issues a request to write data that results in 5MB of data needing to be written after the data is compressed and deduplicated by storage controller applications 324, 326. In such an example, writing data to the local storage 330, 334, 338 resources and the block storage 342, 344, 346 resources utilized by the cloud computing items 340a, 340b, 340n is relatively simple because 5 blocks of 1MB size are written to the local storage 330, 334, 338 resources and the block storage 342, 344, 346 resources utilized by the cloud computing items 340a, 340b, 340 n. In such an example, the software daemons 328, 332, 336 may also be configured to create five objects that contain different 1MB data blocks. Thus, in some embodiments, each object written to cloud-based object store 348 may be identical (or nearly identical) in size. The reader will appreciate that in such examples, metadata associated with the data itself may be included in each object (e.g., the first 1MB of the object is data and the remainder is metadata associated with the data). Readers will appreciate that cloud-based object store 348 may be incorporated into cloud-based storage system 318 to increase the persistence of cloud-based storage system 318.

In some embodiments, all data stored by cloud-based storage system 318 may be stored in both: 1) Cloud-based object storage 348, and 2) at least one of local storage 330, 334, 338 resources or block storage 342, 344, 346 resources utilized by cloud computing items 340a, 340b, 340 n. In such an embodiment, the local storage 330, 334, 338 resources and the block storage 342, 344, 346 resources utilized by the cloud computing items 340a, 340b, 340n may operate effectively as a cache that typically contains all of the data also stored in S3, such that all reads of the data may be served by the cloud computing items 340a, 340b, 340n without requiring the cloud computing items 340a, 340b, 340n to access the cloud-based object storage 348. However, readers will appreciate that in other embodiments, all data stored by the cloud-based storage system 318 may be stored in the cloud-based object storage 348, but less than all data stored by the cloud-based storage system 318 may be stored in at least one of the local storage 330, 334, 338 resources or the block storage 342, 344, 346 resources utilized by the cloud computing items 340a, 340b, 340 n. In such instances, various policies may be utilized to determine which subset of data stored by cloud-based storage system 318 should reside in both: 1) Cloud-based object storage 348, and 2) at least one of local storage 330, 334, 338 resources or block storage 342, 344, 346 resources utilized by cloud computing items 340a, 340b, 340 n.

One or more modules of computer program instructions executing within cloud-based storage system 318 (e.g., a monitoring module executing on its own EC2 instance) may be designed to handle failure of one or more of cloud computing instances 340a, 340b, 340n with local storage 330, 334, 338. In such instances, the monitoring module may handle the failure of one or more of the cloud computing items 340a, 340b, 340n with the local storage 330, 334, 338 by creating one or more new cloud computing items with the local storage, retrieving data stored on the failed cloud computing items 340a, 340b, 340n from the cloud-based object storage 348, and storing the data retrieved from the cloud-based object storage 348 in the local storage on the newly created cloud computing items. The reader will appreciate that many variations of this process may be implemented.

The reader will appreciate that various performance aspects of the cloud-based storage system 318 may be monitored (e.g., by a monitoring module executing in the EC2 instance) such that the cloud-based storage system 318 may be longitudinally or laterally expanded as desired. For example, if the cloud computing items 320, 322 used to support execution of the storage controller applications 324, 326 are undersized and do not adequately service I/O requests issued by users of the cloud-based storage system 318, the monitoring module may create a new, more powerful cloud computing item (e.g., one type of cloud computing item that includes more processing power, more memory, etc.) that includes the storage controller application such that the new, more powerful cloud computing item may begin to operate as a primary controller. Likewise, if the monitoring module determines that the cloud computing items 320, 322 for supporting execution of the storage controller applications 324, 326 are oversized and cost savings can be obtained by switching to smaller, weaker cloud computing items, the monitoring module may create new, weaker (and cheaper) cloud computing items that contain the storage controller applications so that the new, weaker cloud computing items may begin to operate as primary controllers.

The storage system described above may implement intelligent data backup techniques by which data stored in the storage system may be replicated and stored in different locations to avoid data loss in the event of a device failure or some other form of disaster. For example, the storage system described above may be configured to check each backup to avoid restoring the storage system to an undesirable state. Consider an example in which malware infects a storage system. In such examples, the storage system may include a software resource 314 that may scan each backup to identify those backups captured before and those backups captured after the malware infects the storage system. In such instances, the storage system may restore itself from a backup that does not contain malware, or at least does not restore portions of the backup that contain malware. In such examples, the storage system may include a software resource 314 that may scan each backup to identify the presence of malware (or viruses, or some other undesirable), for example, by identifying write operations serviced by the storage system and originating from a network subnet suspected of having rendered malware, by identifying write operations serviced by the storage system and originating from a user suspected of having rendered malware, by identifying write operations serviced by the storage system and checking the contents of write operations against fingerprints of malware, and in many other ways.

The reader will further appreciate that backup (typically in the form of one or more snapshots) may also be used to perform a quick restore of the storage system. Consider an example in which a storage system is infected with lux software that locks a user out of the storage system. In such examples, the software resource 314 within the storage system may be configured to detect the presence of the lux software, and may be further configured to restore the storage system to a point in time prior to the point in time at which the lux software infects the storage system using the reserved backup. In such examples, the presence of the lux software may be explicitly detected by using a software tool utilized by the system, by using a key (e.g., a USB drive) inserted into the storage system, or in a similar manner. Also, the presence of the luxury software may be inferred in response to system activity meeting a predetermined fingerprint (e.g., no reads or writes to the system within a predetermined period of time, for example).

The reader will appreciate that the various components described above may be grouped into one or more optimization computing packages as an aggregation infrastructure. Such an aggregated infrastructure may include computers, storage devices, and a pool of network resources that may be shared by multiple applications and managed in a collective manner using policy-driven processes. Such an aggregation infrastructure may be implemented with an aggregation infrastructure reference architecture, stand-alone appliances, software-driven hyper-aggregation methods (e.g., a hyper-aggregation infrastructure), or in other ways.

Readers will appreciate that the storage systems described in this disclosure may be used to support various types of software applications. In fact, the storage system may be "application aware" in the following sense: the storage system may obtain, maintain, or otherwise access information describing the connected applications (e.g., applications that utilize the storage system) to optimize operation of the storage system based on intelligence about the applications and their modes of utilization. For example, the storage system may optimize data layout, optimize cache behavior, optimize 'QoS' levels, or perform some other optimization designed to improve storage performance experienced by the application.

As an example of one type of application that may be supported by the storage systems described herein, the storage system 306 may be used to support artificial intelligence ('AI') applications, database applications, XOps programs (e.g., devOps programs, dataOps programs, MLOps programs, modelOps programs, platformOps programs), electronic design automation tools, event driven software applications, high performance computing applications, simulation applications, high speed data capture and analysis applications, machine learning applications, media production applications, media service applications, picture archiving and communication system ('PACS') applications, software development applications, virtual reality applications, augmented reality applications, and many other types of applications by providing storage resources to these applications.

In view of the fact that the storage system includes computing resources, storage resources, and various other resources, the storage system may be well suited to support resource-intensive applications such as, for example, AI applications. AI applications may be deployed in a variety of fields, including: predictive maintenance in manufacturing and related fields, healthcare applications (e.g., patient data and risk analysis), retail and marketing deployments (e.g., search advertisements, social media advertisements), supply chain solutions, financial technology solutions (e.g., business analysis and reporting tools), operational deployments (e.g., real-time analysis tools), application performance management tools, IT infrastructure management tools, and many others.

Such AI applications may enable devices to perceive their environment and take actions that maximize their chances of success on a certain target. Examples of such AI applications may include IBM Watson ^TM 、Microsoft Oxford ^TM 、Google DeepMind ^TM 、Baidu Minwa ^TM And others.

The storage system described above may also be well suited to support other types of applications that are resource intensive, such as, for example, machine learning applications. The machine learning application may perform various types of data analysis to automate analytical model construction. Using an algorithm that iteratively learns from the data, the machine learning application may enable the computer to learn without being explicitly programmed. One particular area of machine learning is known as reinforcement learning, which involves taking appropriate action to maximize return in certain situations.

In addition to the resources already described, the storage system described above may also contain a graphics processing unit ('GPU'), sometimes referred to as a visual processing unit ('VPU'). Such GPUs may be embodied as specialized electronic circuits that quickly manipulate and alter memory to speed up the creation of images in a frame buffer for output to a display device. Such a GPU may be included within any computing device that is part of the storage system described above, including as one of many individual scalable components of the storage system, where other examples of individual scalable components of such a storage system may include storage components, memory components, computing components (e.g., CPU, FPGA, ASIC), network components, software components, and others. In addition to GPUs, the storage systems described above may also include a neural network processor ('NNP') for various aspects of neural network processing. Such NNPs may be used in place of (or in addition to) GPUs, and may also be independently scalable.

As described above, the storage systems described herein may be configured to support artificial intelligence applications, machine learning applications, big data analysis applications, and many other types of applications. The rapid growth of such applications is driven by three technologies: deep Learning (DL), GPU processor, and big data. Deep learning is a computational model that utilizes a massively parallel neural network inspired by the human brain. The deep learning model writes its own software by learning from a large number of instances, rather than expert handmade software. Such GPUs may contain thousands of cores well suited to running algorithms that loosely represent the parallel nature of the human brain.

Advances in deep neural networks, including the development of multi-layer neural networks, have motivated a new wave of algorithms and tools for data science home Artificial Intelligence (AI) to utilize its data. With improved algorithms, larger data sets, and various frameworks (including open source software libraries for machine learning across a range of tasks), data scientists are dealing with new use cases such as autopilot, natural language processing and understanding, computer vision, machine reasoning, strong AI, and many others. Applications of this technique may include: detecting, identifying and avoiding machine and vehicle objects; visual identification, classification and marking; algorithmic financial transaction policy performance management; synchronous positioning and mapping; predictive maintenance of high value instruments; preventing network security threat and automation of professional knowledge; image identification and classification; solving a problem; robotics; text analysis (extraction, classification) and text generation and translation; and many others. The application of AI technology has been implemented in a wide range of products including, for example: voice recognition technology of amazon Echo, which allows users to talk to their machines; google TranslateTM, which allows machine-based language translation; spotify's weekly findings, which provide recommendations of new songs and artists that the user may like based on the user's usage and flow analysis; the text of Quill generates a product that takes structured data and converts it into a narrative story; a chat robot that provides real-time, context-specific answers to questions in a conversational format; and many others.

Data is the core of modern AI and deep learning algorithms. Before training can begin, one problem that must be addressed is centered on collecting labeled data, which is critical to training an accurate AI model. A comprehensive AI deployment may be required to continually collect, clean up, convert, tag, and store large amounts of data. Adding additional high quality data points translates directly into a more accurate model and better insight. The data sample may undergo a series of processing steps including, but not limited to: 1) ingest data from external sources into a training system and store the data in raw form, 2) clean up and convert the data in a format that facilitates training, including linking data samples to appropriate labels, 3) explore parameters and models, quickly test with smaller data sets, and iterate to converge on the most promising model to push into a production cluster, 4) perform a training phase to select random batches of input data, including both new and old samples, and feed those into a production GPU server for computation to update model parameters, and 5) evaluate, including using the reserved portion of the data that is not used in training, in order to evaluate model accuracy that leaves the data. This lifecycle may be applicable to any type of parallelized machine learning, not just neural networks or deep learning. For example, a standard machine learning framework may rely on a CPU instead of a GPU, but the data ingest and training workflows may be the same. Readers will appreciate that a single shared storage data center creates coordination points throughout the lifecycle without requiring additional copies of data among the ingestion, preprocessing, and training phases. The data ingested is rarely used for one purpose only, and shared storage gives flexibility to train multiple different models or apply traditional analysis to the data.

The reader will appreciate that each stage in the AI data pipeline may have different requirements for a data center (e.g., a storage system or collection of storage systems). Laterally expanding storage systems must present non-compromised performance for a variety of access types and patterns, from small, large metadata volumes of files to large files, from random access patterns to sequential access patterns, and from low concurrency to high concurrency. The storage system described above may be used as an ideal AI data center because the system may serve unstructured work loads. In the first phase, data is ideally ingested and stored onto the same data center that will be used in the subsequent phase in order to avoid excessive data duplication. The next two steps can be done on a standard compute server optionally containing a GPU, and then in a fourth and last stage, running the full training production job on a powerful GPU acceleration server. Typically, a production pipeline exists alongside an experimental pipeline operating on the same dataset. Furthermore, the GPU-accelerated servers may be used independently for different models, or may be combined together to train on one larger model, even distributed across multiple systems. If the shared storage layer is slow, each stage must copy the data to the local storage resulting in wasted time buffering the data to a different server. The ideal data center rendering of the AI training pipeline resembles the performance of data stored locally on the server node, while also having simplicity and performance to enable all pipeline stages to operate simultaneously.

In order for the storage system described above to be used as a data center or as part of AI deployment, in some embodiments, the storage system may be configured to provide DMA between a storage device included in the storage system and one or more GPUs used in an AI or big data analysis pipeline. One or more GPUs may be coupled to a storage system, for example, via a structural NVMe ('NVMe-ofj') such that, for example, bottlenecks oF a host CPU may be bypassed and the storage system (or one component contained therein) may directly access GPU memory. In such examples, the storage system may utilize an API hook (hook) to the GPU to transfer data directly to the GPU. For example, the GPU may be embodied as Nvidia ^TM The GPU, and the storage system may support gpudiect store ('GDS') software, or have similar proprietary software, that enables the storage system to transfer data to the GPU via RDMA or similar mechanisms.

While the preceding paragraphs discuss a deep learning application, the reader will appreciate that the storage system described herein may also be part of a distributed deep learning ('DDL') platform to support execution of DDL algorithms. The storage system described above may also be paired with other technologies such as TensorFlow, an open source software library for data flow programming across a series of tasks, which may be used in machine learning applications such as neural networks to facilitate development of such machine learning models, applications, and the like.

The storage system described above may also be used in neuromorphic computing environments. Neuromorphic calculations are a form of calculation that mimics brain cells. To support neuromorphic computation, the architecture of interconnected "neurons" replaces the traditional computational model with low power signals that pass directly between neurons to achieve more efficient computation. Neuromorphic calculations may utilize Very Large Scale Integration (VLSI) systems containing electronic analog circuits to mimic the neural biological architecture present in the nervous system, as well as analog, digital, mixed-mode analog/digital VLSI and software systems implementing nervous system models for sensing, motion control, or multi-sensory integration.

Readers will appreciate that the storage system described above may be configured to support storage or use (among other types of data) of blockchains and derivative items, such as, for example, open source blockchains and as IBM ^TM Related tools for portions of the super ledger plan, licensed blockchains that allow a certain number of trusted parties to access blockchains, blockchain products that enable developers to build their own distributed ledger plan, and others. Blockchains and storage systems described herein may be utilized to support on-chain storage of data as well as off-chain storage of data.

The out-of-chain storage of data may be implemented in various ways and may occur when the data itself is not stored in the blockchain. For example, in one embodiment, a hash function may be utilized and the data itself may be fed into the hash function to generate the hash value. In such instances, a hash of a large amount of data may be embedded within the transaction, rather than the data itself. Readers will appreciate that in other embodiments, alternatives to blockchains may be used to facilitate decentralized storage of information. For example, one alternative to blockchain that may be used is blockweave (blockweave). While conventional blockchains store each transaction to enable authentication, blockspinning allows secure dispersion without using the entire chain, thereby enabling low-cost on-chain storage of data. Such block spinning may utilize a consensus mechanism based on access attestation (PoA) and proof of work (PoW).

The storage systems described above may be used alone or in combination with other computing devices to support in-memory computing applications. In-memory computing involves storing information in RAM distributed across a cluster of computers. The reader will appreciate that the storage systems described above, particularly those configurable with customizable amounts of processing resources, storage resources, and memory resources (e.g., those systems in which blades of each type of resource contain configurable amounts), may be configured in a manner that provides an infrastructure that can support in-memory computing. Also, the storage system described above may include component parts (e.g., NVDIMMs, 3D cross-point storage that provides persistent fast random access memory) that may actually provide an improved in-memory computing environment as compared to in-memory computing environments that rely on RAM distributed across dedicated servers.

In some embodiments, the storage system described above may be configured to operate as a hybrid in-memory computing environment that includes a generic interface to all storage media (e.g., RAM, flash memory, 3D cross-point storage). In such embodiments, the user may not know details about where their data is stored, but they may still address the data using the same complete, unified API. In such embodiments, the storage system may move the data (in the background) to the fastest tier available, including intelligently placing the data depending on various characteristics of the data or depending on some other heuristic. In such examples, the storage system may even utilize existing products such as Apache igite and GridGain to move data between the various storage layers, or the storage system may utilize custom software to move data between the various storage layers. The storage systems described herein may implement various optimizations to improve the performance of in-memory computations, such as, for example, making the computation as close to the data occurrence as possible.

The reader will further appreciate that in some embodiments, the storage system described above may be paired with other resources to support the applications described above. For example, one infrastructure may include primary computations in the form of servers and workstations that exclusively use general purpose computing ('GPGPU') on a graphics processing unit to accelerate deep learning applications that are interconnected into a compute engine to train parameters of a deep neural network. Each system may have an ethernet external connection, an infiniband external connection, some other form of external connection, or some combination thereof. In such instances, GPUs may be grouped for a single large training or used independently to train multiple models. The infrastructure may also include a storage system such as described above to provide full flash file or object storage, e.g., laterally expanding, through which data may be accessed via high performance protocols such as NFS, S3, etc. The infrastructure may also include redundant shelf-top ethernet switches connected to storage and computing, for example, via ports in the MLAG port channels to achieve redundancy. The infrastructure may also include additional computations in the form of white-box servers, optionally with GPUs, for data ingest, preprocessing, and model debugging. The reader will appreciate that additional infrastructure is also possible.

The reader will appreciate that the storage system described above (whether alone or in coordination with other computing devices) may be configured to support other AI-related tools. For example, the storage system may utilize tools such as ONXX or other open neural network switching formats, which make it easier to transfer models written in different AI frameworks. Also, the storage system may be configured to support tools such as Amazon's Gluon that allow developers to prototype, build, and train deep learning models. In fact, the storage system described above may be part of a larger platform, such as IBM ^TM Private data cloud, which includes integrated data science, data engineering, and application build services.

Readers will further appreciate that the storage system described above may also be deployed as an edge solution. Such edge solutions may be in place to optimize the cloud computing system by performing data processing at the network edge near the data source. Edge computing can push applications, data, and computing power (i.e., services) away from the centralized point to the logical extremity of the network. By using an edge solution to a storage system such as described above, computing tasks may be performed using computing resources provided by such a storage system, data may be stored using storage resources of the storage system, and cloud-based services may be accessed using various resources (including network resources) of the storage system. By performing computing tasks on edge solutions, storing data on edge solutions, and typically utilizing edge solutions, consumption of expensive cloud-based resources can be avoided, and in fact, performance improvements can be experienced relative to greater reliance on cloud-based resources.

While many tasks may benefit from the utilization of edge solutions, certain specific uses may be particularly suited for deployment in such environments. For example, devices such as drones, autopilots, robots and others may require extremely fast processing, in fact, so fast that sending data up to the cloud environment and back to receive data processing support may be too slow. As an additional example, some IoT devices (e.g., connected cameras) may be less suitable for utilizing cloud-based resources because sending data to the cloud may be impractical (not only from a privacy, security, or financial perspective) simply because of the pure amount of data involved. Thus, many tasks that are truly related to data processing, storage, or communication may be more suitable for platforms that include edge solutions (e.g., the storage systems described above).

The storage system described above may be used alone or in combination with other computing resources as a network edge platform for combining computing resources, storage resources, network resources, cloud technology, network virtualization technology, and the like. As part of the network, edges may exhibit characteristics similar to other network facilities, from customer premises and backhaul aggregation facilities to point of presence (pop) and regional data centers. Readers will appreciate that network workloads, such as Virtual Network Functions (VNFs) and others, will reside on the network edge platform. By a combination of containers and virtual machines, the network edge platform may rely on controllers and schedulers that are no longer geographically co-located with the data processing resources. As a micro-service, the functionality may be partitioned into a control plane, user and data planes, or even state machines, allowing independent optimization and expansion techniques to be applied. Such user and data planes may be implemented by added accelerators (both resident in server platforms such as FPGAs and smart NICs) and by SDN enabled business chips and programmable ASICs.

The storage system described above may also be optimized for use in big data analysis, including utilization as part of a combinable data analysis pipeline, where the containerized analysis architecture, for example, makes analysis capabilities more combinable. Big data analysis can be generally described as a process of examining large and diverse data sets to reveal hidden patterns, unknown correlations, market trends, customer preferences, and other useful information that can help organizations make more informed business decisions. As part of the process, semi-structured and unstructured data (such as, for example, internet click stream data, web server logs, social media content, text from customer emails and survey replies, mobile phone call detail records, ioT sensor data, and other data) may be converted into structured form.

The storage systems described above may also support (including being implemented as a system interface) applications that perform tasks in response to human speech. For example, the storage system may support execution of a smart personal assistant application, such as Alexa, for example amazon ^TM 、Apple Siri ^TM 、Google Voice ^TM 、Samsung Bixby ^TM 、Microsoft Cortana ^TM And others. Although the example described in the previous sentence utilized voice as input, the storage system described above may also support chat robots, conversation robots, or manual conversation entities, or other applications configured to conduct conversations via auditory or text methods. Also, the storage system may actually execute such an application program to enable a user, such as a system administrator, to interact with the storage system via voice. Such applications are typically capable of voice interaction, music playback, making to-do lists, setting alarm clocks, streaming podcasts, playing audio books, and providing weather, traffic, and other real-time information (e.g., news), but may be used as interfaces for various system management operations in embodiments according to the present disclosure.

The storage system described above may also implement an AI platform for fulfilling the landscape of an autopilot storage. Such an AI platform may be configured to render global predictive intelligence by collecting and analyzing a large number of storage system telemetry data points for ease of management, analysis, and support. In fact, such a storage system may be able to predict both capacity and performance, as well as generate intelligent advice regarding workload deployment, interaction, and optimization. Such an AI platform may be configured to scan all incoming storage system telemetry data against an outgoing fingerprint library to predict and resolve events in real-time before they affect the customer environment, and capture hundreds of performance-related variables for predicting performance load.

The storage system described above may support serialization or simultaneous execution of artificial intelligence applications, machine learning applications, data analysis applications, data conversion, and other tasks that may together form an AI ladder. Such AI ladder can be effectively formed by combining these elements to form a complete data science conduit, where there are dependencies between the elements of the AI ladder. For example, AI may require some form of machine learning to have occurred, machine learning may require some form of analysis to have occurred, analysis may require some form of data and information structuring to have occurred, and so on. Thus, each element can be considered as one of the AI steps, which together can form a complete and complex AI solution.

The storage system described above may also be used, alone or in combination with other computing environments, to render experiences in which AI is ubiquitous, with AI penetrating into a wide range of business and living aspects. For example, AI may play an important role in rendering deep learning solutions, deep reinforcement learning solutions, artificial general intelligence solutions, autopilots, cognitive computing solutions, commercial UAVs or drones, conversational user interfaces, enterprise taxonomies, ontology management solutions, machine learning solutions, intelligent motes, intelligent robots, intelligent workshops, and many other aspects.

The storage systems described above may also be used, alone or in combination with other computing environments, to render a wide range of transparent immersive experiences, including those using digital twinning of various "things" (e.g., people, places, processes, systems, etc.), wherein technologies may introduce transparency between people, businesses, and things. Such transparent immersive experience can be rendered as augmented reality technology, networking home, virtual reality technology, brain-computer interface, human augmentation technology, nanotube electronics, volumetric display, 4D printing technology, or others.

The storage systems described above may also be used to support a variety of digital platforms, alone or in combination with other computing environments. Such digital platforms may include, for example, 5G wireless systems and platforms, digital twin platforms, edge computing platforms, ioT platforms, quantum computing platforms, serverless PaaS, software defined security, neuromorphic computing platforms, and the like.

The storage system described above may also be part of a multi-cloud environment, where multiple cloud computing and storage services are deployed in a single heterogeneous architecture. To facilitate operation of such a multi-cloud environment, a DevOps tool may be deployed to enable orchestration across the clouds. Likewise, continuous development and continuous integration tools can be deployed to normalize the process around continuous integration and delivery, new feature push and supply cloud workloads. By normalizing these processes, a cloudy policy may be implemented that achieves optimal provider utilization for each workload.

The storage system described above may be used as part of a platform to enable the use of encryption anchors that may be used to authenticate the source and content of a product to ensure that it matches blockchain records associated with the product. Similarly, the storage systems described above may implement various encryption techniques and schemes, including lattice cryptography, as part of a kit that protects data stored on the storage systems. Lattice cryptography may involve cryptographic primitive construction that involves lattice in the construction itself or in the security certification. Unlike public key schemes such as RSA, diffie-Hellman, or elliptic curve cryptography, which are vulnerable to quantum computer attacks, some lattice-based constructs appear to be resistant to attacks by both classical and quantum computers.

Quantum computers are devices that perform quantum computation. Quantum computing is computation using quantum mechanical phenomena such as superposition and entanglement. Quantum computers differ from transistor-based traditional computers in that such traditional computers require encoding data into binary digits (bits), each digit (bit) always being in one of two finite states (0 or 1). Compared to traditional computers, quantum computers use qubits, which can be in state superposition. Quantum computers maintain a series of qubits, where a single qubit may represent a 1, 0, or any quantum superposition of those two qubit states. A pair of qubits may be in any quantum superposition of 4 states, and three qubits may be in any superposition of 8 states. Quantum computers with n qubits can typically be in any superposition of up to 2 n different states at the same time, whereas traditional computers can only be in one of these states at any time. The quantum drawing machine is a theoretical model of such a computer.

The storage system described above can also be paired with an FPGA-accelerated server as part of a larger AI or ML infrastructure. Such FPGA-accelerated servers may reside near the storage systems described above (e.g., in the same data center), or even be incorporated into an appliance that includes one or more storage systems, one or more FPGA-accelerated servers, a network infrastructure that supports communication between the one or more storage systems and the one or more FPGA-accelerated servers, and other hardware and software components. Alternatively, the FPGA-accelerated server may reside within a cloud computing environment that may be used to perform computing-related tasks for AI and ML jobs. Any of the embodiments described above may be used to collectively function as an FPGA-based AI or ML platform. The reader will appreciate that in some embodiments of an FPGA-based AI or ML platform, the FPGAs contained within the FPGA-accelerated server may be reconfigured for different types of ML models (e.g., LSTM, CNN, GRU). The ability to reconfigure the FPGA contained within the FPGA acceleration server can enable acceleration of the ML or AI application based on optimal numerical accuracy and the memory model used. The reader will appreciate that by treating the collection of FPGA-accelerated servers as a pool of FPGAs, any CPU in the data center can use the pool of FPGAs as a shared hardware micro-service, rather than limiting the servers to dedicated accelerators inserted therein.

The FPGA-accelerated server and GPU-accelerated server described above may implement a computational model in which instead of saving a small amount of data in the CPU and running a long instruction stream thereon as in more traditional computational models, a machine learning model and parameters are fixed into high bandwidth on-chip memory, with a large amount of data flowing through the high bandwidth on-chip memory. For such a computational model, an FPGA may even be more efficient than a GPU, as the FPGA may be programmed with only the instructions needed to run such a computational model.

The storage system described above may be configured to provide parallel storage, for example, by using a parallel file system such as BeeGFS. Such a parallel file system may include a distributed metadata architecture. For example, a parallel file system may include multiple metadata servers across which metadata is distributed, as well as components including services for clients and storage servers.

The system described above may support the execution of a wide range of software applications. Such software applications may be deployed in a variety of ways, including container-based deployment models. Various tools may be used to manage the containerized application. For example, the containerized application may be managed using Docker Swarm, kubernetes, and others. The containerized application may be used to facilitate server-less, cloud-local computing deployment and management models for the software application. To support a serverless, cloud-local computing deployment and management model for software applications, a container may be used as part of an event handling mechanism (e.g., AWS Lambdas) such that various events cause the containerized application to be launched to operate as an event handler.

The system described above may be deployed in a variety of ways, including in a manner that supports fifth generation ('5G') networks. The 5G network may support substantially faster data communications than previous generations of mobile communication networks, and thus may lead to disaggregation of data and computing resources, as modern large-scale data centers may become less prominent and may be replaced by more local micro data centers, e.g., close to mobile network towers. The system described above may be included in such a local micro-data center and may be part of or paired with a multiple access edge computing ('MEC') system. Such MEC systems may implement cloud computing capabilities and IT service environments at the edge of the cellular network. By running applications and performing related processing tasks closer to the cellular clients, network congestion may be reduced and applications may perform better.

The storage system described above may also be configured to implement NVMe partition namespaces. By using the NVMe partition namespace, the logical address of the namespace is emptyThe compartment is divided into a plurality of regions. Each region provides a logical block address range that must be written in sequence and explicitly reset prior to overwriting, thereby enabling creation of namespaces exposing the natural boundaries of the device and offloading management of the internal mapping tables to the host. To implement the NVMe partition namespace ('ZNS'), a ZNS SSD or some other form of partition block device that exposes the namespace logical address space using a region may be utilized. With the regions aligned with the internal physical properties of the device, several inefficiencies in data placement can be eliminated. In such embodiments, each region may be mapped to a separate application, for example, such that functions such as wear leveling and garbage collection may be performed on a region-by-region or application-by-application basis, rather than across the entire device. To support ZNS, the storage controller described herein may be configured to use, for example, linux ^TM The kernel blocking device interfaces or other tools to interact with the blocking device.

The storage systems described above may also be configured to implement partitioned storage in other ways, such as by using Shingled Magnetic Recording (SMR) storage devices, for example. In instances where partitioned storage is used, embodiments of device management may be deployed in which a storage device hides this complexity by managing it in firmware, presenting an interface as any other storage device. Alternatively, partition storage may be implemented via a host managed embodiment that relies on the operating system to know how to handle a drive and writes to certain areas of the drive only in sequence. Partition storage may similarly be implemented using a host-aware embodiment, where a combination of implementations of driver management and host management are deployed.

The storage systems described herein may be used to form data lakes. The data lake may operate as a first location of an organized data stream, where the data may be in a raw format. Metadata tagging may be implemented to facilitate searching of data elements in a data lake, particularly in embodiments where the data lake contains multiple data stores (e.g., unstructured data, semi-structured data, structured data) in a format that is not readily accessible or readable. From the data lake, the data may be passed downstream into a data warehouse, where the data may be stored in a deeply processed, packaged, and consumed format. The storage system described above may also be used to implement such a data warehouse. In addition, a data mart or data center may allow for more readily consumable data, where the storage system described above may also be used to provide the underlying storage resources required by the data mart or data center. In an embodiment, a query to a data lake may require a read-time pattern approach, where data is applied to a schema or pattern as it is pulled from a storage location, rather than as it enters a storage location.

The storage systems described herein may also be configured to implement a recovery Point target ('RPO'), which may be established by a user, by an administrator, as a system default, as part of a storage class or service that the storage system is participating in rendering, or in some other manner. The "recovery point target" is the target of the maximum time difference between the last update of the source data set and the last recoverable replication data set update that would, if warranted, be recovered correctly from consecutive or frequently updated replicas of the source data set. If all updates processed on the source data set before the last recoverable replication data set update are properly considered, then the updates can be recovered correctly.

In synchronous replication, the RPO will be zero, which means that under normal operation, all completed updates on the source data set should exist and can be correctly restored on the replicated data set. In a copy that is best effort to get near synchronization, the RPO may be as low as a few seconds. In snapshot-based replication, the RPO may be roughly calculated as the time interval between snapshots plus the time to transfer the modification between the previously transferred snapshot and the snapshot that was most recently to be replicated.

If the update accumulates faster than it is replicated, the RPO may be missed. For snapshot-based replication, an RPO may be missed if the data to be replicated accumulates more between two snapshots than can be replicated between taking a snapshot and replicating the accumulated updates of the snapshot to a replica. Again, in snapshot-based replication, if the rate of accumulation of data to be replicated is faster than can be transferred in the time between subsequent snapshots, the replication may begin to further lag, which may extend the miss time between the intended recovery point target and the actual recovery point represented by the update of the last correct replication.

The storage system described above may also be part of a shared-nothing storage cluster. In a shared-nothing storage cluster, each node of the cluster has local storage and communicates with other nodes in the cluster over a network, with the storage used by the cluster (in general) being provided only by storage connected to each individual node. The set of nodes that synchronously replicate the data set may be one example of a shared-nothing storage cluster in that each storage system has local storage and communicates with other storage systems over a network, where those storage systems (in general) do not use storage from elsewhere they share access over some interconnect. In contrast, some of the storage systems described above are themselves built as shared storage clusters, as there are drive racks shared by the paired controllers. However, other storage systems described above are built as shared-nothing storage clusters, because all storage is local to a particular node (e.g., blade), and all communication is through a network linking computing nodes together.

In other embodiments, other forms of shared-nothing storage clusters may include the following embodiments: wherein any node in the cluster has a local copy of all the storage that they need, and wherein the data is mirrored to other nodes in the cluster by synchronous replication to ensure that the data is not lost, or because the other nodes are also using the storage. In such an embodiment, if the new cluster node needs some data, the data may be copied from the other nodes that have copies of the data to the new node.

In some embodiments, a shared storage cluster based on mirrored copies may store multiple copies of all storage data of the cluster, with each subset of data being copied to a particular set of nodes and different subsets of data being copied to different sets of nodes. In some variations, an embodiment may store all stored data of a cluster in all nodes, while in other variations, the nodes may be partitioned such that a first group of nodes will all store the same data set and a second, different group of nodes will all store different data sets.

Readers will appreciate that RAFT-based databases (e.g., etcd) may operate like shared-nothing storage clusters, where all RAFT nodes store all data. However, the amount of data stored in the RAFT cluster may be limited so that the additional copies do not consume too much storage. Assuming that the containers do not tend to be too large and their bulk data (data manipulated by the application running in the container) is stored elsewhere, such as in an S3 cluster or an external file server, the container server cluster may also be able to copy all data to all cluster nodes. In such instances, container storage may be provided by the cluster directly through its shared-nothing storage model, with those containers providing images of the execution environment that form part of the application or service.

For further explanation, fig. 3D illustrates an exemplary computing device 350 that can be specifically configured to perform one or more of the processes described herein. As shown in fig. 3D, computing device 350 may include a communication interface 352, a processor 354, a storage device 356, and an input/output ('I/O') module 358 communicatively connected to each other via a communication infrastructure 360. Although the exemplary computing device 350 is shown in fig. 3D, the components illustrated in fig. 3D are not intended to be limiting. Additional or alternative components may be used in other embodiments. The components of the computing device 350 shown in fig. 3D will now be described in more detail.

The communication interface 352 may be configured to communicate with one or more computing devices. Examples of communication interface 352 include, but are not limited to, a wired network interface (e.g., a network interface card), a wireless network interface (e.g., a wireless network interface card), a modem, an audio/video connection, and any other suitable interface.

Processor 354 generally represents any type or form of processing unit capable of processing data and/or interpreting, executing, and/or directing the execution of one or more of the instructions, processes, and/or operations described herein. The processor 354 may perform operations by executing computer-executable instructions 362 (e.g., application programs, software, code, and/or other executable data instances) stored in the storage device 356.

The storage device 356 may include one or more data storage media, devices, or configurations, and may take any type, form, and combination of data storage media and/or devices. For example, the storage device 356 may include, but is not limited to, any combination of the non-volatile media and/or volatile media described herein. Electronic data, including the data described herein, may be temporarily and/or permanently stored in the memory device 356. For example, data representing computer-executable instructions 362 configured to direct processor 354 to perform any of the operations described herein may be stored within storage device 356. In some examples, the data may be arranged in one or more databases residing within the storage device 356.

The I/O module 358 may include one or more I/O modules configured to receive user input and provide user output. The I/O module 358 may include any hardware, firmware, software, or combination thereof that supports input and output capabilities. For example, the I/O module 358 may include hardware and/or software for capturing user input, including but not limited to a keyboard or keypad, a touch screen component (e.g., a touch screen display), a receiver (e.g., an RF or infrared receiver), a motion sensor, and/or one or more input buttons.

The I/O module 358 may include one or more devices for presenting output to a user, including but not limited to a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., a display driver), one or more audio speakers, and one or more audio drivers. In certain embodiments, the I/O module 358 is configured to provide graphical data to a display for presentation to a user. The graphical data may represent one or more graphical user interfaces and/or any other graphical content that may serve a particular implementation. In some examples, any of the systems, computing devices, and/or other components described herein may be implemented by computing device 350.

For further explanation, FIG. 3E illustrates a method for providing storage services (also hereinReferred to as 'data services') on the storage system group 376. The storage system group 376 depicted in fig. 3 includes a plurality of storage systems 374a, 374b, 374c, 374d, 374n, each of which may be similar to the storage systems described herein. The storage systems 374a, 374b, 374c, 374d, 374n in the storage system group 376 may be embodied as the same storage system or different types of storage systems. For example, the two storage systems 374a, 374n depicted in fig. 3E are depicted as cloud-based storage systems, as the resources that collectively form each of the storage systems 374a, 374n are provided by different cloud service providers 370, 372. For example, the first cloud service provider 370 may be Amazon AWS ^TM While the second cloud service provider 372 is Microsoft Azure ^TM In other embodiments, however, one or more public clouds, private clouds, or a combination thereof may be used to provide the underlying resources for forming a particular storage system in storage system group 376.

According to some embodiments of the present disclosure, the example depicted in fig. 3E includes an edge management service 382 for rendering storage services. The rendered storage services (also referred to herein as 'data services') may include, for example, services that provide a quantity of storage to consumers, services that provide storage to consumers according to a predetermined service level agreement, services that provide storage to consumers according to predetermined regulatory requirements, and many others.

The edge management service 382 depicted in fig. 3E may be embodied as one or more modules of computer program instructions executing, for example, on computer hardware, such as one or more computer processors. Alternatively, the edge management service 382 may be embodied as one or more modules of computer program instructions executing on a virtualized execution environment, such as one or more virtual machines, in one or more containers, or in some other manner. In other embodiments, the edge management service 382 may be embodied as a combination of the above-described embodiments, including embodiments in which one or more modules of computer program instructions contained in the edge management service 382 are distributed across multiple physical or virtual execution environments.

The edge management service 382 may operate as a gateway for providing storage services to storage consumers, where the storage services utilize storage provided by one or more storage systems 374a, 374b, 374c, 374d, 374n. For example, the edge management service 382 may be configured to provide storage services to host devices 378a, 378b, 378c, 378d, 378n that are executing one or more applications that consume the storage services. In such instances, the edge management service 382 may operate as a gateway between the host devices 378a, 378b, 378c, 378d, 378n and the storage systems 374a, 374b, 374c, 374d, 374n, rather than requiring the host devices 378a, 378b, 378c, 378d, 378n to directly access the storage systems 374a, 374b, 374c, 374d, 374n.

The edge management service 382 of FIG. 3E exposes the storage service modules 380 to the host devices 378a, 378b, 378c, 378d, 378n of FIG. 3E, but in other embodiments the edge management service 382 may expose the storage service modules 380 to other consumers of the various storage services. The various storage services may be presented to the consumer via one or more user interfaces, via one or more APIs, or by some other mechanism provided by storage service module 380. Accordingly, the storage service module 380 depicted in FIG. 3E may be embodied as one or more modules of computer program instructions executing on physical hardware, virtualized execution environment, or a combination thereof, wherein executing such modules enables consumers of storage services to be provided, selected, and accessed various storage services.

The edge management service 382 of fig. 3E also includes a system management service module 384. The system management service module 384 of fig. 3E contains one or more modules of computer program instructions that, when executed, perform various operations in coordination with the storage systems 374a, 374b, 374c, 374d, 374n to provide storage services to the host devices 378a, 378b, 378c, 378d, 378n. The system management service module 384 may be configured to, for example, perform tasks such as provisioning storage resources from the storage systems 374a, 374b, 374c, 374d, 374n via one or more APIs exposed by the storage systems 374a, 374b, 374c, 374d, 374n, migrating data sets or workloads among the storage systems 374a, 374b, 374c, 374d, 374n via one or more APIs exposed by the storage systems 374a, 374b, 374d, 374n, setting one or more adjustable parameters (i.e., one or more configurable settings) on the storage systems 374a, 374b, 374c, 374d, 374n via one or more APIs exposed by the storage systems 374a, 374b, 374d, 374n, and so forth. For example, many of the services described below relate to embodiments in which the storage systems 374a, 374b, 374c, 374d, 374n are configured to operate in some manner. In such instances, the system management service module 384 may be responsible for configuring the storage systems 374a, 374b, 374c, 374d, 374n to operate in the manner described below using the APIs (or some other mechanism) provided by the storage systems 374a, 374b, 374c, 374d, 374 n.

In addition to configuring the storage systems 374a, 374b, 374c, 374d, 374n, the edge management service 382 itself may be configured to perform various tasks required to provide various storage services. Consider an example in which a storage service contains a service that when selected and applied results in personally identifiable information ('PII') contained in a data set being obfuscated when the data set is accessed. In such examples, the storage systems 374a, 374b, 374c, 374d, 374n may be configured to confuse PII when servicing read requests directed to a data set. Alternatively, the storage systems 374a, 374b, 374c, 374d, 374n may service the read by returning data containing the PII, but the edge management service 382 itself may confuse the PII as the data passes the edge management service 382 on its way from the storage systems 374a, 374b, 374c, 374d, 374n to the host devices 378a, 378b, 378c, 378d, 378 n.

The storage systems 374a, 374b, 374c, 374D, 374n depicted in fig. 3E may be embodied as one or more of the storage systems described above with reference to fig. 1A-3D, including variations thereof. In fact, the storage systems 374a, 374b, 374c, 374d, 374n may be used as a pool of storage resources, wherein individual components in the pool have different performance characteristics, different storage characteristics, and the like. For example, one of the storage systems 374a may be a cloud-based storage system, the other storage system 374b may be a storage system that provides block storage, the other storage system 374c may be a storage system that provides file storage, the other storage system 374d may be a relatively high-performance storage system, the other storage system 374n may be a relatively low-performance storage system, and so on. In alternative embodiments, only a single storage system may be present.

The storage systems 374a, 374b, 374c, 374d, 374n depicted in fig. 3E may also be organized into different failure domains, such that failure of one storage system 374a should be completely independent of failure of another storage system 374 b. For example, each storage system may receive power from a separate power system, each storage system may be coupled for data communication over a separate data communication network, and so on. Furthermore, the storage systems in the first failure domain may be accessed via a first gateway, while the storage systems in the second failure domain may be accessed via a second gateway. For example, the first gateway may be a first instance of the edge management service 382 and the second gateway may be a second instance of the edge management service 382, including embodiments in which each instance is different or each instance is part of the distributed edge management service 382.

As an illustrative example of available storage services, the storage services may be presented to users associated with different levels of data protection. For example, a storage service may be presented to a user that, when selected and implemented, ensures that data associated with the user will be protected to the user such that various recovery point targets ('RPOs') may be guaranteed. The first available storage service may ensure that, for example, some of the data sets associated with the user will be protected so that any data that has passed over 5 seconds may be recovered in the event of a failure of the primary data store, while the second available storage service may ensure that the data sets associated with the user will be protected so that any data that has passed over 5 minutes may be recovered in the event of a failure of the primary data store.

Additional examples of storage services that may be presented to, selected by, and ultimately applied to a data set associated with a user may include one or more data compliance services. Such data compliance services may be embodied as, for example, services that may provide data compliance services to consumers (i.e., users) to ensure that the user's data sets are managed in a manner that meets various regulatory requirements. For example, one or more data compliance services may be provided to the user to ensure that the user's data set is managed in a manner that complies with general data protection regulations (' GDPR '), one or more data compliance services may be provided to the user to ensure that the user's data set is managed in a manner that complies with the Sabans-Ox Act (' SOX ') of 2002, or one or more data compliance services may be provided to the user to ensure that the user's data set is managed in a manner that complies with some other regulation. In addition, one or more data compliance services may be provided to the user to ensure that the user's data set is managed in a manner consistent with certain non-government guidelines (e.g., consistent with best practices for auditing purposes), one or more data compliance services may be provided to the user to ensure that the user's data set is managed in a manner consistent with specific customer or organization requirements, and so forth.

Consider an example in which a particular data compliance service is designed to ensure that a user's data set is managed in a manner that meets the requirements set forth in GDPR. While a list of all the requirements of the GDPR may be found in the regulations themselves, for purposes of illustration, the example requirements set forth in the GDPR require that a pseudonymization process must be applied to the stored data in order to transform personal data in such a way that the resulting data cannot be attributed to a particular body of data without using additional information. For example, data encryption techniques may be applied to make the original data unintelligible and cannot be reversed without accessing the correct decryption key. Thus, the GDPR may require that the decryption key be maintained separately from the pseudonymized data. A specific data compliance service may be provided to ensure compliance with the requirements set forth in this paragraph.

To provide such a particular data compliance service, the data compliance service may be presented to and selected by the user (e.g., via a GUI). In response to receiving a selection of a particular data compliance service, one or more storage service policies may be applied to a data set associated with a user to effectuate the particular data compliance service. For example, a storage service policy may be applied that requires the data set to be encrypted before storage in the storage system, before storage in the cloud environment, or before storage elsewhere. To implement such a policy, not only may a requirement be implemented that the data set be encrypted at the time of storage, but a requirement may be implemented that the data set be encrypted prior to transmission (e.g., sending the data set to another party). In such instances, a storage service policy may also be enforced that requires that any encryption keys used to encrypt the data set not be stored on the same system that stores the data set itself. The reader will appreciate that many other forms of data compliance services may be provided and implemented in accordance with embodiments of the present disclosure.

The storage systems 374a, 374b, 374c, 374d, 374n in the storage system group 376 may be commonly managed, for example, by one or more group management modules. The group management module may be part of or separate from the system management service module 384 depicted in fig. 3E. The group management module may perform tasks such as monitoring the health of each storage system in the group, initiating updates or upgrades on one or more storage systems in the group, migrating workloads for load balancing or other performance purposes, and many other tasks. Accordingly, and for many other reasons, the storage systems 374a, 374b, 374c, 374d, 374n may be coupled to one another via one or more data communication links in order to exchange data between the storage systems 374a, 374b, 374c, 374d, 374 n.

The storage systems described herein may support various forms of data replication. For example, two or more storage systems may copy data sets synchronously with each other. In synchronous replication, different copies of a particular data set may be maintained by multiple storage systems, but all accesses (e.g., reads) to the data set should produce consistent results, regardless of which storage system the access is directed to. For example, a read directed to any storage system that synchronously replicates a data set should return the same result. Thus, while updates to versions of a data set need not occur at the same time, precautions must be taken to ensure consistent access to the data set. For example, if an update (e.g., a write) directed to a data set is received by a first storage system, the update may be confirmed as complete only if all storage systems that synchronously copy the data set have applied the update to their copies of the data set. In such instances, synchronous replication may be performed using I/O forwarding (e.g., writes received at a first storage system are forwarded to a second storage system), communication between storage systems (e.g., each storage system indicates that it has completed an update), or otherwise.

In other embodiments, the data set may be replicated by using checkpoints. In checkpoint-based replication (also referred to as "near synchronous replication"), a set of updates to a data set (e.g., one or more write operations directed to the data set) may occur between different checkpoints such that the data set has been updated to a particular checkpoint only if all updates to the data set have been completed prior to the particular checkpoint. Consider an example in which a first storage system stores a real-time copy of a data set being accessed by a user of the data set. In such an instance, assume that a data set is copied from a first storage system to a second storage system using checkpoint-based copying. For example, the first storage system may send a first checkpoint (at time t=0) to the second storage system, followed by a first set of updates to the data set, followed by a second checkpoint (at time t=1), followed by a second set of updates to the data set, followed by a third checkpoint (at time t=2). In such an instance, if the second storage system has performed all of the updates in the first set of updates, but has not performed all of the updates in the second set of updates, then the copy of the data set stored on the second storage system may be up to date until the second checkpoint. Alternatively, if the second storage system has performed all of the updates in both the first set of updates and the second set of updates, then the copy of the data set stored on the second storage system may be up to date until the third checkpoint. Readers will appreciate that various types of checkpoints (e.g., metadata-only checkpoints) may be used, checkpoints may be scattered based on various factors (e.g., time, number of operations, RPO settings), and so forth.

In other embodiments, the data set may be replicated by snapshot-based replication (also referred to as "asynchronous replication"). In snapshot-based replication, a snapshot of a data set may be sent from a replication source, such as a first storage system, to a replication target, such as a second storage system. In such embodiments, each snapshot may include the entire data set or a subset of the data set, such as, for example, only the portion of the data set that has changed since the last snapshot was sent from the replication source to the replication target. The reader will appreciate that the snapshots may be sent on demand, based on policies that take into account various factors (e.g., time, number of operations, RPO settings), or in some other manner.

The storage systems described above may be configured, alone or in combination, to function as continuous data protection storage. Continuous data protection storage is a feature of storage systems that records updates to a data set such that consistent images of the previous contents of the data set can be accessed at a low granularity of time (typically on the order of seconds or even less) and traced back for a reasonable period of time (typically hours or days). These allow access to the most recent consistent point in time of the data set, and also allow access to points in time of the data set that may immediately precede an event (e.g., resulting in portions of the data set being damaged or otherwise lost), while maintaining close to the maximum number of updates before the event. Conceptually, they appear as a series of snapshots of a data set taken very frequently and kept for a long period of time, but continuous data protection storage is typically implemented quite differently from snapshots. The storage system implementing the continuous data protection storage of data may further provide a means to access these points in time, to access one or more of these points in time as a snapshot or clone copy, or to restore the data set to one of those recorded points in time.

Over time, to reduce overhead, some points in time held in the continuous data protection store may be merged with other nearby points in time, essentially deleting some of these points in time from the store. This may reduce the capacity required to store updates. A limited number of these time points may also be converted into a longer-lasting snapshot. For example, such storage may keep a sequence of low granularity time points back for several hours from now on, with some time points being merged or deleted to reduce overhead for up to an additional day. Back to the past farther than that, some of these time points can be converted into snapshots representing consistent time point images only every few hours.

Although some embodiments are described primarily in the context of a storage system, those skilled in the art will recognize that embodiments of the disclosure may also take the form of a computer program product disposed on a computer-readable storage medium for use with any suitable processing system. Such computer-readable storage media may be any storage media for machine-readable information, including magnetic media, optical media, solid-state media, or other suitable media. Examples of such media include magnetic or floppy disks in hard disk drives, optical disks for optical disk drives, magnetic tape, and other media as will occur to those of skill in the art. Those skilled in the art will immediately recognize that any computer system having suitable programming means will be capable of executing the steps described herein, as embodied in a computer program product. Those skilled in the art will also recognize that, while some embodiments described in this specification are directed to software installed and executed on computer hardware, alternative embodiments implemented as firmware or hardware are well within the scope of the disclosure.

In some examples, a non-transitory computer-readable medium storing computer-readable instructions may be provided in accordance with the principles described herein. The instructions, when executed by a processor of a computing device, may direct the processor and/or the computing device to perform one or more operations, including one or more operations described herein. Such instructions may be stored and/or transmitted using any of a variety of known computer-readable media.

The non-transitory computer-readable media referred to herein may comprise any non-transitory storage medium that participates in providing data (e.g., instructions) that may be read and/or executed by a computing device (e.g., by a processor of a computing device). For example, a non-transitory computer-readable medium may include, but is not limited to, any combination of non-volatile storage media and/or volatile storage media. Exemplary non-volatile storage media include, but are not limited to, read-only memory, flash memory, solid state drives, magnetic storage devices (e.g., hard disk, floppy disk, magnetic tape, etc.), ferroelectric random access memory ("RAM"), and optical disks (e.g., optical disk, digital video disk, blu-ray disk, etc.). Exemplary volatile storage media include, but are not limited to, RAM (e.g., dynamic RAM).

Turning now to a solution to the reliability problem affecting multi-chassis storage systems and storage systems with a large number of blades, the storage systems are described as having elastic groups. The blades of the storage system are divided into groups, referred to as elastic groups, and the storage system performs writes to select target blades belonging to the same group. Embodiments are grouped into four mechanisms presented below in the various embodiments.

1) The formation of elastic groups and how to increase the reliability of the storage system by expansion of the storage system.

2) Write path, and how to reliably expand data writes through storage system expansion: flash writes (segment formation) and NVRAM inserts should not cross the boundaries of the write group.

3) Multiple groups, and how to reliably upgrade the boot by storage system expansion: the majority of blades participate in the consensus algorithm. The majority group is used by the startup process.

4) Authoritative election, how to reliably extend the system through storage system extension: witness algorithms may be performed with witness groups within elastic group boundaries.

FIG. 4 depicts a resilient group 402 in a storage system, supporting data recovery in the event that up to a specified number of blades 252 of the resilient group 402 are lost. In some embodiments, the formation of elastic groups is determined only by the cluster geometry, including the location of each blade in each chassis. An example cluster geometry has only one current version of the elastic group. The elastic groups may change if blades are added, removed, or moved to different slots. In some embodiments, no more than one blade may be added or removed from the cluster geometry at a time. In some embodiments, elastic groups are computed by an array leader (running on a top of rack (ToR) switch) and pushed as part of cluster geometry updates.

For the write path, in some embodiments, the flash segment forms a target blade selected from a single elastic group. NVRAM insertion first selects a target blade from a single elastic group. In addition, NVRAM insertion attempts to select blades that are also in the same chassis to avoid overburdening the ToR switch. When the cluster geometry changes, the array leader recalculates the elastic groups. If there is a change, the array leader pushes a new version of the cluster geometry to each blade in the storage system. Each authority rescans the base storage segments and gathers a list of segments that are partitioned into two or more new elastic groups. Garbage collection uses the list described above and remaps segments into the same elastic group to achieve super redundancy, i.e., where all data segments are in the respective elastic group, no data segments span or traverse two or more elastic groups.

As illustrated in fig. 4, in some embodiments, one or more write groups 404 are formed within each elastic group 402 for data striping (i.e., writing data into data stripes). For example, each resiliency group 402 may include fewer than all blades/storage nodes of a chassis, or blades and storage nodes of two chassis, or fewer than those, or more than those. It should be appreciated that the number and/or arrangement of blade/memory equipped components for the elastic groups 402 may be user-definable, adaptable, or determined based on component characteristics, system architecture, or other criteria.

In one embodiment, write group 404 is a stripe of data across a group of blades 252 (or other physical devices having memory) on which the data striping is performed. In some embodiments, the write group is a RAID stripe across a group of blades 252. Error coded data written in stripes of data in write group 404 may be recovered by error correction in the event of loss of up to a specified number of blades 252. For example, using n+2 redundancy and error coding/error correction, data can be recovered in the event of two blades being lost. Other redundancy levels are readily defined for the storage system. In some embodiments, each write group 404 may extend across several blades 252, less than or equal to the total number of blades 252 in the elastic group 402 to which the write group 404 belongs, depending on the level of redundancy. For example, a resilient group 402 of ten blades may support a write group 404 of three blades 252 (e.g., double mirrored), seven blades 252 (e.g., n+2 redundant, n=5), and/or 10 blades 252 (e.g., n+2 redundant, n=8), among other possibilities. Because the write groups 404 are established in this manner, any and all data in the elastic groups 402 may be recovered by error correction within the elastic groups 402 in the event that up to the same specified number of blades 252 as the error codes and error corrections in each write group 404 are lost. For write group 404 with n+2 redundancy in elastic group 402, all data in elastic group 402 is recoverable even if up to two blades 252 fail, fail to respond, or are otherwise lost.

Some embodiments of the storage system have authority 168 for ownership and access of data, see for example the storage clusters described above with reference to fig. 2B-2G. Authority 168 may be a software or firmware entity or agent executing on one or more processors that generates and accesses metadata, having ownership and access rights to a specified range of user data (e.g., a series of segment numbers or one or more inode numbers). Data owned and accessed by the authority 168 is striped across one or more write groups 404. In versions of the storage system with authorities 168 and elastic groups 402, authorities on one blade 252 in the elastic group 402 may access write groups 404 in the same elastic group 402, as shown by some arrows in fig. 4. Also, the authority on blade 252 and elastic group 402 may access write group 404 in another elastic group 402, as further shown by the other arrows in FIG. 4.

FIG. 5 is a scenario of a geometry change 506 of a storage system, where blades 252 are added to storage cluster 161, resulting in a change from a previous version of elastic group 502 to a current version of elastic group 504. In this example, the storage system enforces a rule that the elastic group 402 cannot be as large as the two chassis 138 filled with blades 252. In one embodiment, where the capacity of chassis 138 is a maximum of 15 blades 252, this rule will mean that the maximum number of blades 252 in elastic group 402 is 29 blades in two chassis 138 (e.g., 1 less than 2×15). Adding blade 252 to storage cluster 161 such that the total blade count is greater than the maximum value of elastic group 402 results in the storage system reassigning the blade to elastic group 402 that does not violate this rule. Thus, 29 blades 252 of the previous version of the elastic group 502 are reassigned to the current version of the elastic group 504 labeled "elastic group 1" and "elastic group 2". Other rules and other scenarios for the elastic groups 402 and geometry changes 506 of the storage system are readily designed in accordance with the teachings herein. For example, a storage cluster with 150 blades may have 10 elastic groups, each with 15 blades. Removing blades 252 from storage cluster 161 may result in blades 252 of more than one elastic group 402 being reassigned to a fewer number or one elastic group 402. Adding or removing chassis with blades 252 to a storage system may result in blades 252 being reassigned to elastic groups 402. By specifying a maximum number and/or relative arrangement of blades 252 for a resilient group 402, and assigning blades 252 according to resilient group 402, the storage system can expand or contract storage capacity, and have appropriately expanded component loss survivability and data recoverability without an exponential increase in probability of data loss.

FIG. 6 is a system and action diagram showing authorities 168 in distributed computing resources 602 of a storage system and switches 146 in communication with chassis 138 and blades 252 to form a resilient group 402. Embodiments of switch 146 are described above with reference to fig. 2F and 2G. Embodiments of the distributed computing resources 602 and memory 604 are described above with reference to fig. 2A-2G and 3B. Other embodiments of the switch 146, distributed computing 602, and memory 604 are readily designed in accordance with the teachings herein.

Continuing with FIG. 6, switch 146 has an array leader 606 in communication with chassis 138 and blades 252, implemented in software, firmware, hardware, or a combination thereof. Array leader 606 evaluates the number and arrangement of blades 252 in one or more chassis 138, determines membership of elastic group 402, and communicates information regarding membership of elastic group 402 to chassis 138 and blades 252. In various embodiments, such determination may be rule-based, form-based, from a data structure or artificial intelligence, or the like. In other embodiments, the blades 252 may form the elastic groups 402 through a decision process in the distributed computation 602.

FIG. 7 depicts garbage collection with respect to elastic groups 402 reclaiming memory and relocating data. In such a scenario, as in FIG. 5, adding blade 252 to cluster 161 causes the storage system to reassign blade 252 of the previous version of elastic group 502 to the current version of elastic group 504. The data stripe 702 in the previous version of the elastic group 502 is split across the two current versions of the elastic group 504. That is, some of the data in data stripe 702 is physically located in memory on some (e.g., one or more) blades 252 in elastic group 1, and some of the data in data stripe 702 is physically located in memory on some (e.g., one or more) blades 252 in elastic group 2. If the data of the data stripe is not relocated, the data will be susceptible to the loss of two blades in elastic group 1 and the loss of one blade in elastic group 2, or vice versa, while the data in the data stripe in write group 404 in each current elastic group 504 (i.e., elastic group 1 and elastic group 2) will be recoverable (see FIG. 4). To remedy this situation, so that the data conforms to the current version of the elastic groups 504, the data in the data stripe 702 is relocated (i.e., reassigned, transferred) to one, the other or both of the current version of the elastic groups 504, but does not span both elastic groups 402. That is, in some versions, the data of data stripe 702 is transferred to elastic group 1, or the data of data stripe 702 is transferred to elastic group 2. In some versions, a data stripe may be split into two data stripes, with the data of one of the two new data stripes being transferred to elastic group 1 and the data of the other of the two new data stripes being transferred to elastic group 2. It should not be the case that the data stripe 702 or any subdivided new data stripe is partitioned across the current elastic group 504. While for other embodiments, such relocation may be defined as a separate process, in the embodiment illustrated in FIG. 7, garbage collection performs the relocation of data with the assistance from authority 168. Such a version is described below with reference to fig. 8. Variations of the scenario shown in fig. 7 are readily designed, with other geometric changes and other arrangements of the elastic groups 402.

FIG. 8 is a system and action diagram showing details of garbage collection 802 coordinating memory recovery and scanning 810 and relocation 822 of data. Standard garbage collection, a known process for moving data followed by erasing and reclaiming memory (e.g., in software, firmware, hardware, or a combination thereof), is modified herein to relocate 822 real-time data 808 within a resilient group 402 or to another resilient group 402 so that memory with obsolete data 806 can be erased 816 and reclaimed 818. By repositioning data relative to the elastic groups 402, the storage system improves reliability and data recoverability in the face of failure and expansion scenarios such as those discussed with respect to FIG. 4. By combining garbage collection with relocation of the data of the current elastic group 504, the storage system improves efficiency over what is a separate process.

In the embodiment and scenario shown in FIG. 8, garbage collection 802 cooperates with authority 168 to scan 810, access 812, and restore 814 memory. The authority 168 provides addressing information about where the data is located and information (e.g., metadata) about the data stripe 804, write group 404, and elastic group 402 to which the data is associated or mapped for use by the garbage collection 802 in determining whether to move the data or where to move the data. Typically, the physical location in memory is in an erased condition or a written condition, and if in a written condition, then in some embodiments the data is either real-time data (i.e., valid data that can be read) or obsolete data (i.e., invalid data because it is overwritten or deleted). Memory blocks having a sufficient amount (e.g., threshold) of erased condition or obsolete data locations or addresses may be erased or reclaimed after any remaining real-time data is relocated, i.e., removed/copied from the block and obsolete in the block. In cooperation with authority 168, garbage collection 802 may identify candidate regions for reclamation 818 in memory from a top-down process that begins with a data segment and works down to physical addressing of the memory through an address indirection level. Alternatively, garbage collection 802 may use a bottom-up process that begins with physical addressing of memory and works indirectly up through addresses to data segment numbers and is associated with data stripes 804, write groups 404, and elastic groups 402. Other embodiments of mechanisms for identifying regions for reclamation 818 are readily developed in accordance with the teachings herein.

Once the garbage collection 802 of FIG. 8 has identified candidate regions in memory, the garbage collection 802 determines whether the regions contain outdated data 806, real-time data 808, erased, unused/unwritten addresses, or some combination of these. Areas (e.g., solid state memory blocks) containing completely obsolete data 806 may be erased 816 and reclaimed 818 without data movement. For example, an area with a greater amount of obsolete data 806 or unused addresses than a threshold, but with some real-time data 808, is a candidate area for repositioning the real-time data 808. To relocate 822 the real-time data 808, the garbage collection 802 reads the real-time data 808, replicates the real-time data 808, and writes the real-time data 808 to the same current elastic group 504 (if the real-time data 808 is found there) or from the previous version of the elastic group 502 to the current version of the elastic group 504 (if the real-time data 808 is found in the previous version of the elastic group 502). After relocation 822, garbage collection 802 may discard 820 data in previous locations, then erase 816 and reclaim 818 memory occupied by obsolete data 806. The relocated real-time data 808 is remapped 824, wherein the garbage collection 802 cooperates with the authority 168 to write metadata 826, which metadata 826 reassigns the real-time data 808 to the write groups 404 in the elastic group 402, as described above.

Fig. 9 depicts a majority group 904 for a startup process 902 using the elastic group 402. The storage system forms a majority group 904 for initiating the process 902 based on the elastic groups 402 such that, even if expanded by storage system expansion, there are sufficient resources within each elastic group 402 available for successful and reliable operation. When no data stripes 702, 804 span two or more elastic groups 402, the majority group 904 used to initiate the process 902 is the same as the elastic group 402. When there are stripes of startup information data across two or more elastic groups 402 (e.g., due to recent cluster geometry changes 506), the majority group 904 is a superset of the elastic groups 402, which is formed by merging blades 252 from two or more elastic groups 252 into one, and may include all storage nodes 150 or blades 252. The startup process 902 of the multi-pie group 904 coordinates across all blades/storage nodes in the multi-pie group 904 to ensure that all data stripes with startup information (e.g., from the commit log and the base storage segment) can be read correctly.

In some embodiments, in steady state, the multi-pie group 904 is the same as the elastic group 402. In any given state, the formation of the majority group 904 is determined by the past history of the current version of the elastic group 504 and the elastic group 402 formation. To begin the startup process 902, various services must reach the multiple genres (N-2 blades, N being the size of the group, as opposed to N being N+2 redundant) in each multiple-genres group 904. When the elastic group 402 changes, the array leader 606 calculates a new most-group 904 from the current elastic group 402 and the past most-group 904 (if any). The array leader 606 saves the new majority group 904 to the distributed store. After the new multi-pie group 904 is saved, the array leader 606 pushes the new cluster geometry to the blade 252. Each authority 168 flushes NVRAM and remaps segments until all basic storage segments and commit logs reach super-redundancy (i.e., consistency with elastic group 402). The array leader 606 continues to poll the NFS status until it reaches super redundancy. Next, the array leader 606 causes the most group 904 to be retired from the distributed storage.

FIG. 10 depicts witness group 1006 and authoritative election 1002 using elastic group 402. Storage node 150 coordinates and selects witness group 1006 as first elastic group 402 of the storage cluster. Witness group 1006 hosts authoritative election 1002 and continues to authoritative lease 1004 (defined as the time span for which authority 168 is available for user data I/O). In some embodiments, elastic group 402 and witness group 1006 are updated as the cluster geometry changes.

FIG. 11 is a system and action diagram showing details of authoritative elections 1002 and assignments of majority votes 1102 organized by witness groups 1006 and relating to blades 252 across a storage system. At startup, the members of witness group 1006 receive majority votes from all blades 252 in the cluster geometry (e.g., N/2+1, where N is the number of blades in the storage cluster, as opposed to N that is N+2 redundant). Upon receiving majority vote 1102, the highest ranked blade 252 in elastic group 402 determines the authority 168 placement in blade 252 (from the static dictionary), as illustrated by the arrow in FIG. 11, and sends an endorsement message to blade 252 with the authority number to be initiated. Upon receipt of the authoritative endorsement message, each blade 252 checks and ensures that it is from the same geometric version and initiates the authority 168 (e.g., executing threads in a multi-threaded processor environment, accessing metadata and data in memory 604). Witness group 1006 will continue to delegate lease 1004 until elastic group 402 changes again.

FIG. 12 is a flow chart of a method of operating a storage system having elastic groups. The method may be practiced by the storage system described herein and variations thereof, more specifically by a processor in the storage system. In act 1202, an elastic group of blades in a storage system is formed. The elastic groups are discussed with reference to fig. 4. In act 1204, the data stripe is written into the elastic group. In a determination act 1206, a query is made as to whether a geometric change exists in the storage system. If the answer is no, then there is no geometry change and flow returns to act 1204 to continue writing the data stripe. If the answer is yes, then there is a geometric change in the storage system and flow proceeds to decision act 1208.

In a determination act 1208, a determination is made as to whether the geometry change meets criteria for modifying the elastic groups. For example, for elastic groups, these criteria may be rules based on the maximum number and placement of blades relative to the chassis. If the answer is no, then the geometry change does not meet the criteria, flow returns to act 1204 to continue writing the data stripe. If the answer is yes, then the geometry change meets the criteria and flow proceeds to act 1210. In act 1210, the blade is reassigned from the previous version of the elastic group to the current version of the elastic group. In various embodiments, the number of elastic groups and the number and arrangement of blades in an elastic group are case and rule specific. The storage system propagates the assignment to the blade and retains version information.

In act 1212, data is transferred from the previous version of the elastic group to the current version of the elastic group. In various embodiments, data transfer is coordinated with the aid of a garbage collection with the authority to own and manage access to data in memory (see FIG. 13), or a separate process from garbage collection. In act 1214, a determination is made as to whether all data is transferred from the previous version of the elastic group to the current version of the elastic group. If the answer is no, then all data has not been transferred to the current version of the resiliency group, and flow returns to act 1212 to continue transferring data. If the answer is yes, then all data has been transferred to the current version of the resiliency group and flow continues to act 1216. In act 1216, the previous version of the elastic group is retired. New data is not written to any previous version of the elastic group, and in some embodiments, all references to the previous version of the elastic group may be deleted. In some embodiments, once exited, data cannot be read from any previous version of the elastic group because all data has been transferred to the current version of the elastic group.

FIG. 13 is a flow chart of a method of operating a storage system with garbage collection in elastic groups. The method may be practiced by the storage system described herein and variations thereof, more specifically by a processor in the storage system. In various embodiments, the method of FIG. 12 may be practiced. In a determine action 1302, it is determined whether a current version of the elastic group is formed. If the answer is no, then no new elastic group exists and flow loops back to determine act 1302 (or alternatively branches elsewhere to perform other tasks). If the answer is yes, then there is a current version of the resiliency group, for example, because a change in the geometry of the storage system has occurred or was detected, and the placement of the blades and chassis meets the criteria (see act 1208 of FIG. 12), flow proceeds to act 1304.

In act 1304, the garbage collection coordinates recovery of memory and scanning and relocation of data through attachment mapping with the assistance of an authority in the storage system. The authority executes in the blade and manages the metadata through the addressing scheme and error coding and error correction and owns and accesses a specified range of user data in the memory of the storage system. In a determination act 1306, it is determined whether real-time data in the previous version of the elastic group spans two or more current versions of the elastic group. FIG. 7 shows an example of a data stripe that spans two elastic groups after a change in geometry to a storage cluster results in a change in the elastic groups. If the answer is no, then there is no real-time data in the previous version of the elastic group that spans two or more of the current version of the elastic group, and flow proceeds to decision action 1312. If the answer is yes, then there is real-time data in the previous resiliency group that spans two or more current versions of the resiliency group, and flow proceeds to act 1308.

In act 1308, real-time data is relocated from the previous version of the elastic group to the current version of the elastic group. Repositioning may include reading real-time data from a previous version of the elastic group, copying the real-time data, writing the real-time data to a current version of the elastic group, and then discarding the data in the previous version of the elastic group so that garbage collection may reclaim memory locations. In act 1310, the garbage collection erases and restores memory holding the obsolete data. For example, a solid-state memory block having only obsolete data or obsolete data combined with an erased state location may be erased and then made available for writing as restored memory.

In decision act 1312, a determination is made as to whether real-time data in the current version of the elastic group occupies a target location for garbage collection restoration of memory. If the answer is no, then in that case no data exists, flow branches back to act 1304 for garbage collection to continue to coordinate memory recovery with data scanning and relocation. If the answer is yes, then the real-time data in the current version of the elastic group occupies the target location for garbage collection recovery of memory, and flow proceeds to act 1314. In act 1314, the real-time data is relocated elsewhere in the current version of the elastic group. As above, repositioning includes reading, copying, and writing real-time data to a new location in the current version of the elastic group, and discarding real-time data at an earlier location, but this time repositioning within the same elastic group, because the elastic group that found real-time data is the current version of the elastic group. In act 1316, garbage collection erases and restores memory holding stale data.

It should be appreciated that the methods described herein may be performed with a digital processing system, such as a conventional general purpose computer system. A special purpose computer designed or programmed to perform only one function may also be used in the alternative. FIG. 14 is an illustration showing an exemplary computing device in which embodiments described herein may be implemented. According to some embodiments, the computing device of fig. 14 may be used to perform embodiments of the functionality of elastic grouping and garbage collection. The computing device includes a Central Processing Unit (CPU) 1401 coupled to memory 1403 via bus 1405 and a mass storage device 1407. Mass storage device 1407 represents a persistent data storage device, such as a floppy disk drive or a fixed disk drive, which may be local or remote in some embodiments. Memory 1403 may include read only memory, random access memory, and the like. In some embodiments, applications resident on the computing device may be stored on or accessed via a computer readable medium, such as memory 1403 or mass storage device 1407. The application may also be in the form of a modulated electronic signal that is modulated for access via a network modem or other network interface of the computing device. It will be appreciated that in some embodiments, the CPU 1401 may be embodied in a general purpose processor, special purpose processor, or special purpose programmed logic device.

The display 1411 communicates with the CPU 1401, memory 1403 and mass storage device 1407 over the bus 1405. The display 1411 is configured to display any visualization tools or reports associated with the systems described herein. An input/output device 1409 is coupled to bus 1405 to communicate information in command selections to CPU 1401. It will be appreciated that to and from external devicesMay be passed through the input/output device 1409. The CPU 1401 may be defined to perform the functionality described herein to implement the functionality described with reference to fig. 1A to 13. In some embodiments, code embodying such functionality may be stored in memory 1403 or mass storage device 1407 for execution by a processor such as CPU 1401. The operating system on the computing device may be MSDOS ^TM 、MS-WINDOWS ^TM 、OS/2 ^TM 、UNIX ^TM 、LINUX ^TM Or other known operating systems. It should be appreciated that the embodiments described herein may also be integrated with virtualized computing systems implemented with physical computing resources.

Fig. 15 depicts the formation of elastic groups 1508, 1510, 1512 from blades 252 having different amounts 1504, 1506 of memory 604. This may occur when there are different types of blades 252 in the storage system, such as when the system upgrades, a failed blade is physically replaced with a newer blade 252 with more memory 604, a spare blade is brought on line, or a new chassis is connected to make or extend a multi-chassis system. If there are a sufficient number of each type of blade to form multiple elastic groups, the computing resources 1502 of the system may form one elastic group 1508 with one type of blade 252 (e.g., 15 blades each having 8TB storage capacity) and another elastic group 1510 with another type of blade 252 (e.g., 7 blades each having 52TB storage capacity). As described above, each elastic group supports data recovery, such as by erasure coding, in the event that two blades 252 in the elastic group fail, or in the event that some other predetermined number of blades 252 in the elastic group fail in a variation.

Alternatively, the computing resources 1502 of the system may form one elastic group 1512 with two types of blades. A longer data segment 1518 is formed using all amounts 1504 of memory 604 of one type of blade (e.g., all 8TB amounts 1504 of memory in each of a first subset of blades of the storage system and 8TB of memory from each of a second subset of blades of the storage system (e.g., blades 252 each having 52TB of memory)). The remaining amount of memory in the second blade subset is used to form a shorter data segment 1520, e.g., when the 8TB amount 1504 of memory used in the longer data segment 1518 is subtracted from the total memory amount 52TB of the blade, the remaining 44TB amount 1516 of memory. Because the longer data segment 1518 has a higher storage efficiency (i.e., higher storage efficiency in terms of stored user data relative to redundancy overhead) than the shorter data segment 1520, this formation of the data segments 1518, 1520 has an overall higher storage efficiency than can be formed in the two elastic groups 1508, 1510. Some embodiments may prefer one or the other, and some embodiments may be able to migrate data from two elastic groups to one elastic group, or vice versa, or otherwise reorganize blades and elastic groups in the above or other cases. Variations of the scenario depicted in fig. 15 are readily designed, with other amounts of memory, elasticity groups, blades, data segment formation and allocation in elasticity groups, and single chassis or multi-chassis systems. It should be appreciated that the authority 168 participates in the formation of data segments and the reading and writing of data in the data segments, as discussed above.

Fig. 16 depicts a conservative estimate of the amount of memory space available in the elastic group 1602. The computing resources 1502 of the storage system track memory usage in the blades 252 and determine which blade 252 in the elastic group 1602 is the most full, or equivalently, least empty, in the elastic group 1602. The memory space that is full, i.e., written but not erased, and has real-time data or invalid, outdated but not erased, is indicated by cross-hatching in fig. 16. The empty space is indicated in fig. 16 instead of cross-hatching, i.e. the memory space that is erased and available for writing. In some embodiments, the amount of empty available space 1604 in the most or least empty blades of the elasticity group 1602 is multiplied by the number of blades in the elasticity group 1602 to produce a conservative estimate of the memory space available in the elasticity group 1602. This represents the smallest possible amount of memory space available for writing, and it should be appreciated that the actual amount of memory space available in the elastic group 1602 may be greater. However, such conservative estimates are well-suited for use in various embodiments to make decisions for data writing and garbage collection, as shown in fig. 17.

Fig. 17 depicts a garbage collection module 1702 having various options related to the elastic groups depicted in fig. 15 and 16. Garbage collection performs data reading of real-time data from the memory 604 and performs data writing of real-time data into the memory 604. It should be appreciated that garbage collection is performed to merge obsolete data blocks for block erasure of solid state memory and recovery of memory to make space available for writing. In one embodiment, the memory space tracker 1704 tracks the memory space used and the memory space available and suggests a source selector 1706 for data reading and a destination selector 1708 for data writing to the garbage collection module 1702. The authority 168 participates in data reads, data writes, and decisions about which data is in real-time with respect to which data is outdated, and identifies candidates for memory erasures. The garbage collection module 1702 may read data to the same elastic group 1710 and write data from the same elastic group 1710. Alternatively, the garbage collection module 1702 may read data from one elastic group 1710 and write to a different elastic group 1710. In some embodiments, the elastic groups written to may even be elastic groups that have lower storage efficiency than the elastic groups read from.

In some embodiments, decisions regarding data writing are handled by weighted random distribution of data writing. In some embodiments, the weights are increased, i.e., toward a resilient group with a conservative estimate of the larger available memory space, and the weights are decreased, i.e., away from, but still containing, the resilient group with the conservative estimate of the smaller available memory space. With this weighted random distribution of data writes, more empty elastic groups 1710 tend to be written more frequently, thus filling faster than less empty elastic groups 1710, thus balancing memory usage and available space across blades 252 of the storage system. In some embodiments, a weighted random distribution of data writes is applied to data ingestion by a storage system.

The following are options for various embodiments of the garbage collection module 1702 in some embodiments. Some versions select from two or more of these options, and others pursue a single option. Garbage collection can be performed to and from the same elastic group. Garbage collection may be performed to and from different elastic groups, i.e., reading from one elastic group and writing to another elastic group. Garbage collection may use a weighted random distribution for data writes, as described above. Garbage collection may be performed in the lowest available space elasticity group, i.e., reading and writing in the elasticity group with the lowest available space. Garbage collection may read from and write to the elastic group with the lowest available space, or write using a weighted random distribution weighted toward the elastic group with the largest available space. The estimation of the available space may be performed as described above with reference to fig. 16, using the blade with the lowest amount of available memory for writing to produce a conservative estimate of the memory space available in the elastic group, or may be based on a more detailed tracking of the space available in each blade. These estimates of the space available in the elastic groups may be used to determine the elastic group with the lowest available space and the elastic group with the largest available space. These determinations of available space and elastic groups may then be used for weighted random distribution decisions for garbage collection and/or data writing. Garbage collection may read from elastic groups with higher storage efficiency and write to elastic groups with lower storage efficiency. Other options and combinations of the features described above are readily devised in accordance with the teachings herein.

FIG. 18 is a flow chart of a method of operating a storage system to form elastic groups. The method may be performed by the various embodiments of the storage systems described herein and variations thereof, as well as other storage systems having memory and redundancy. In act 1802, an amount of memory on a blade is determined. This is the total amount of memory erased or not erased, not just the amount of memory available for writing. In decision act 1804, a determination is made as to whether the blade has a different amount of memory. If the answer is yes, then the blade has a different amount of memory and flow proceeds to act 1806. If the answer is no, then the blades have no different amounts of memory, i.e., all have the same amount of memory, and flow continues to act 1808.

In act 1806, elastic groups are formed based on different amounts of memory in blades of the storage system. For example, one or more elastic groups may be formed from blades having the same amount of memory. One or more elastic groups may be formed from blades having different amounts of memory, wherein data segments of different lengths are formed within the elastic groups to maximize the use and storage efficiency of the different amounts of memory.

In act 1808, an elastic group is formed based on the same amount of memory in the blades of the storage system. For example, each elastic group may have a threshold number of blades or more. The elastic groups do not overlap, i.e. one blade may belong to only one elastic group. Each elastic group so formed supports data recovery in the event of failure of two blades or in other embodiments a specific number (e.g., one or three) of blades. Extensions of the above method for performing garbage collection among elastic groups are easily designed, as described above with reference to fig. 17.

FIG. 19 depicts a storage system having one or more elastic groups of computing resources defined within a computing region and one or more elastic groups of storage resources defined within a storage region, according to an embodiment. This storage system embodiment and its variations extend the mechanism of elastic groups from having blade 252 groups (see fig. 2E-2G, 4-7, 11, 12, 15, 16) to having finer granularity of membership and membership in elastic groups, and decouple computing resources and storage resources depending on system and data viability in the event of a failure. This flexibility in defining the elastic groups and membership of the various system resources in the elastic groups, in turn, facilitates scalability of the storage system. The computing resources may be extended independently of the storage resources and vice versa, where system reliability and viability under fault conditions may be addressed by reconfiguring these finer-grained elastic groups. For example, storage resources may be enlarged by adding or upgrading modular storage 1922 and storage resource elastic groups 1912 redefined separately from computing resource elastic groups 1908. For example, computing resources may be enlarged by upgrading or adding blades 1904 (while leaving modular storage 1922 intact) and computing resource elastic groups 1908 redefined independently of storage resource elastic groups 1912. In other embodiments, expanding both computing resources and storage resources may trigger reconfiguration of both computing resource elastic group 1908 and storage resource elastic group 1912 to maintain system reliability and viability at a desired level. Some embodiments perform group level space and load balancing and may have different elastic groups related to group level space and load. Some embodiments decouple computing and storage from the removable drive and define elastic groups related to computing availability and storage/data persistence. In some embodiments, the elastic group decides whether the cluster is available when some blades or devices fail and determines whether this affects storage/data persistence because the removable drive recovers the data.

Continuing with the embodiment in FIG. 19, a storage system 1900 with multiple blades 1904 is illustrated that may span a single chassis or multiple chassis, and that is heterogeneous or homogeneous in terms of capacity and type on flash memory, etc. For example, each blade 1904 may be hybrid computing and storage, storage-only, or computing-only, where the combination of blades 1904 in a particular storage system 1900 implementation has in total the computing resources and storage resources required for distributed data storage, and supports the isomerism or isomorphism of each of the various resources. In this embodiment, each blade may have zero or more modular storage devices 1922, with one blade illustrated as having two modular storage devices 1922, while the other blade may have four modular storage devices 1922, and so on. In some embodiments, modular storage 1922 is pluggable and may be hot-pluggable or, in some embodiments, may be secured in blade 1904. In various embodiments, modular storage 1922 may be homogenous or heterogeneous across a given blade 1904 and homogenous or heterogeneous across blades 1904 of storage system 1900. Solid state memory 1930 may be homogenous or heterogeneous in terms of the memory types and/or amounts of memory (i.e., storage capacity) in a given modular storage 1922, between modular storage 1922 on blades 1904, and/or across blades 1904 in storage system 1900.

The storage system 1900 defines a computing area 1906 spanning the blades 1904 within which one or more elastic groups of computing resources 1908 are defined, and these are both configurable and reconfigurable. The computing area generally contains resources for computing in the storage system 1900. This includes calculations to form data stripes for both writing (see fig. 20) and reading. Computing resources may also be utilized for operating systems, applications, etc. Each computing resource elastic group 1908 includes various computing resources from multiple blades 1904, such as CPUs, processor local memory 1916, and communication module 1918 for connecting and communicating with modular storage 1922 over network 1920. In other embodiments, such resources may be subdivided into other elastic groups, as readily designed in accordance with the teachings herein. For example, the CPU 1914, processor local memory 1916, communication module 1918, authority 168 (see FIGS. 2B, 2E-2G, 4), and/or other types of resources may be included as resource groups in a resiliency group, or subdivided into different regions and different resiliency groups, etc. The computing resource elastic groups 1908 do not necessarily span all of the blades 1904 in the storage system 1900, but one or more computing resource elastic groups may be defined across the blade 1904 groups. Storage system 1900 supports various configurations of elastic groups.

The storage system 1900 defines a storage area 1910 across the modular storage device 1922 within which one or more storage resource elastic groups 1912 are defined, and these are both configurable and reconfigurable. The storage area 1910 generally includes resources for data and metadata storage in the storage system. This includes storage devices for writing (see FIG. 20) and reading data stripes (e.g., RAID stripes), such as solid state memory 1930 in various embodiments. Each storage resource elastic group 1912 includes various storage resources from a plurality of modular storage devices 1922. In some embodiments, the storage area 1910 is subdivided into multiple areas with various granularities, and storage resource elastic groups 1912 may be formed in the storage area 1910 or within an area within the storage area 1910. In the embodiment depicted in fig. 19, there is an NVRAM region 1932 that includes NVRAM 1928 in modular storage 1922, a memory region 1934 that includes solid state memory 1930 of modular storage 1922, a storage staging region 1936 that has a subset of solid state memory 1930 of modular storage 1922, and a long-term storage region 1938 that has another subset of solid state memory 1930 of modular storage 1922. For other embodiments, various combinations or other granularities for defining regions and elastic groups in a storage region 1910 or in a region within a storage region 1910 are readily designed in accordance with the teachings herein.

For example, in other embodiments, NVRAM 1928 may be located on blade 1904 instead of, or in addition to, NVRAM 1928 in modular storage 1922. The controller 1926 in the modular memory device 1922 may be in a controller area within the memory area 1910. The communication modules 1924 of the modular memory device 1922 may be in a communication resource elastic group or communication resource region within the memory region 1910. The memory in the storage buffer 1936 may be, for example, single Level Cell (SLC) flash memory for fast buffering when writing a stripe of data, followed by moving the data of the stripe of data to long-term memory in the long-term storage region 1938, which may be, for example, multi-level cell memory, such as four-level cell (QLC) flash memory, which has a longer write time and higher data bit density. Such subdivision of the storage area 1910 and establishment of elastic groups in the subdivision of the storage area 1910 may be used for background processing or post-processing of stored data, including deduplication, compression, encryption, etc., with each elastic area having a defined level of redundancy for failure viability. The flexibility of granularity and membership of various resources in various defined elastic groups in various regions gives greater system flexibility, particularly in terms of upgrades and extensions, to achieve a desired level of system and data reliability and survivability of resource loss, as will be further described below with reference to fig. 20.

FIG. 20 depicts an embodiment of a memory system that forms data stripes and writes data stripes using resources in a resilient group. In this example, the storage system defines a plurality of computing resource elastic groups 1908 in computing region 1906 and a plurality of storage resource elastic groups 1912 in storage region 1910. The computing resources in each computing resource elastic group 1908 perform actions 2002 to form a data stripe. In addition, the storage resources in each storage resource elastic group 1912 perform actions 2004 to write the data stripes. Each data stripe is transferred from the formed computation area 1906 to the written storage area 1910. This broad description of storage system operation is readily implemented in a variety of storage systems, with additional improvements of the storage systems defining computing resource elastic groups 1908 and storage resource elastic groups 1912, and supporting various configurations of elastic groups, as described herein.

In various operating scenarios, the storage system experiences failures, and the benefits of using various elastic groups in various embodiments may be found herein. When the storage system performs an action 2006 of a survivable number of members of the elastic group being lost, the failure of one or more components (i.e., system resources) is summarized in the depiction in FIG. 20. Each elastic group has a defined level of redundancy 2010 such that a specified number of members of the elastic group may fail or be removed or destroyed, and the storage system operation may still perform the actions of 2008 and continue operation because the data is recoverable and the elastic group is reconfigurable. For example, a resilient group defined as having one redundant member may be designed to be n+1 redundant and have n+1 members, but be able to continue to operate and recover data, losing one member and having N members. Similarly, an elastic group defined as having two redundant members may be designed to have n+2 redundancy for the elastic group, and n+2 members, but be able to continue to operate and recover data, the elastic group having N members due to a failure losing two members. Also, an elastic group defined as having three redundant members may be designed to be n+3 redundant and have n+3 members, but able to continue to operate and recover data, three members of the elastic group having been lost and N members. In some embodiments, different resiliency groups may have different defined levels of redundancy. Some embodiments support n+4, n+5 or even higher redundancy, e.g., for segment encoding. Some embodiments define elastic groups for temporary storage and different elastic groups for long-term storage, and in various embodiments, these elastic groups may have the same or different redundancy levels. It is important to recognize that failure of a particular component may affect one elastic group or more than one elastic group, and that this example of system viability is illustrative and not meant to be limiting.

In one illustrative example, modular storage device 1922 fails while other modular storage devices 1922 and blades 1904 remain operational. In this example, the failed modular storage 1922 is counted as a loss of storage resource elastic group 1912, but the data may still be recovered from the remaining resources of the storage resource elastic group 1912 because there are enough members or resource redundancy. Other storage resource resiliency groups 1912 having other modular storage devices 1922 as members are not affected by the failure. It should be appreciated that the computing resource elastic group 1908 will not be affected in such a scenario.

In the event of a failure of the entire blade 1904, and depending on the nature of the failure, this may or may not affect the modular storage devices 1922 on the blade 1904 in various scenarios. This, in turn, may affect the computing resource elastic group 1908 having the failed blade, or more specifically the blade's computing resources as a member of the computing resource elastic group 1908, and may or may not affect the storage resource elastic group 1912 having the modular storage 1922 installed on the blade 1904 as a member. In such a scenario, other computing resource resiliency groups 1908 having other blades as members will not be affected.

Various other failures and failure granularities of individual components within a memory device or blade, as well as other scenarios of various granularities and compositions of resources, elastic groups and regions, are readily explored. From the description herein, it is readily understood how the flexibility of the elastic group definition, independently or in conjunction with one another, benefits the scalability of computing and storage resources to maintain a desired level of system and data reliability and recoverability in the event of a failure. For example, storage system 1900 may detect a failure of a single processor, a single NVRAM 1928, a single communication path, a single portion of some type of memory, etc., a failure of a larger component, or even a failure of a survivable number of components in each of a plurality of elastic groups, and then recover the data, followed by reconfiguration of one or more elastic groups with appropriate resource content and granularity. The storage system with the reconfigured elastic groups may then continue to operate at the desired level of system and data reliability and recoverability.

FIG. 21 is a flowchart of a method that may be practiced by a storage system using resources in a resiliency group, according to an embodiment. The method may be embodied in a tangible computer-readable storage medium. In various embodiments, the method may be practiced by a processing device of a storage system. In act 2102, the storage system establishes a resiliency group including one or more computing resource resiliency groups and one or more storage resource resiliency groups. Each elastic group has a defined level of resource redundancy for the storage system. Examples of various types of elastic groups, various levels of resource redundancy for elastic groups, and heterogeneous and homogeneous resource support for elastic groups are described above, and variations thereof are readily designed.

In act 2104, the storage system supports the ability to have the configuration of elastic groups of multiples of each type. For example, the storage system may be configured with a single one of multiple types of elastic groups, multiples of one type of elastic group, multiples of a single one of other types of elastic groups, multiples of multiple types of elastic groups, and so on. It should be appreciated that at any given point in operation, the storage system has a particular configuration of the elastic group, but can be reconfigured to other configurations of the elastic group, e.g., in response to a failure, a change in membership of a blade or modular storage device, an upgrade, expansion or extension of the storage system, etc.

In act 2106, the storage system performs distributed data and metadata storage across the modular storage devices according to the resiliency groups. Depending on the particular configuration, the system may form a data stripe using the computing resources of the elastic group, or do so in parallel in more than one elastic group of computing resources. The system may write the stripe of data using the storage resources of the elastic group or do so in parallel in more than one storage resource elastic group. In act 2108, a determination is made as to whether one or more members of the elastic group have failed. If the answer is no, then no fault exists and flow branches back to act 2106 to continue to perform distributed data and metadata storage. If the answer is yes, then one or more members of the elastic group have failed and flow proceeds to act 2110.

In act 2110, the storage system restores the data. Since the elastic group experiencing failure of one or more members is configured with a defined level of resource redundancy, data can be recovered using the remaining resources of the elastic group (and other system resources as appropriate for the particular embodiment). For example, a resilient group with an n+r configuration can tolerate up to R failures and can recover data. In act 2112, the system determines whether one or more members of the failure are restored. For example, members of the elastic group may have gone offline or become non-communicating and then come back online, or restart, and resume communication. As another example, blades containing multiple drives may be physically removed and then replaced into the chassis, resulting in a recoverable storage system operation. If the failed member is restored, flow forms a branch back to act 2106 to continue performing distributed data and metadata storage. If the answer is no, then one or more members of the failure are not restored and flow branches to act 2114.

In act 2114, the storage system reconfigures the one or more elastic groups. The particular elastic group and new configuration to be reconfigured will be case and system specific, and the ability of such reconfiguration will be supported by the system before, during and after reconfiguration. After the elastic group is reconfigured, flow continues back to act 2106 to now continue executing distributed data and metadata according to the reconfigured elastic group.

It should be appreciated that a resilient group with an n+r configuration that loses r+1 or more members may not be able to recover data. If the elastic group loses R members or less, the system may continue to operate and data may be reconstructed (e.g., onto other drives) to reacquire > N members in order to withstand a failure that would otherwise drop the write group below N. The reconfiguration elastic group supports the desired n+r or other configuration and maintains system viability and data recovery capabilities.

The advantages and features of the present disclosure may be further described by the following statements:

1. a method, comprising:

establishing, by a storage system having a plurality of blades, a plurality of elastic groups, each elastic group having a defined resource redundancy level of the storage system, wherein the plurality of elastic groups includes at least one elastic group of computing resources having the plurality of blades and at least one elastic group of storage resources having storage resources of a plurality of modular storage devices;

the ability to support a plurality of configurations by the storage system, each configuration having a plurality of each of the plurality of elastic groups; and

Distributed data and metadata storage is performed across the plurality of modular storage devices by the plurality of blades according to the established plurality of resiliency groups.

2. The method of statement 1, wherein the performing the distributed data and metadata storage comprises:

forming at least one data stripe using the computing resources in a resilient group of computing resources; and

The at least one data stripe is written using the storage resources in the storage resource elastic group.

3. The method according to statement 1, further comprising:

for each of the plurality of elastic groups, the defined redundancy level is selected from the group consisting of n+1 redundancy, n+2 redundancy, and n+3 redundancy.

4. The method according to statement 1, further comprising:

different defined redundancy levels are selected among the plurality of resilient groups.

5. The method according to statement 1, further comprising:

heterogeneous computing resources in the at least one elastic group of computing resources are supported.

6. The method according to statement 1, further comprising:

heterogeneous storage resources in the at least one storage resource elastic group are supported.

7. The method according to statement 1, further comprising:

different defined redundancy levels among a first resilient group of storage resources supporting a scratch area and a second resilient group of storage resources supporting a long-term storage area.

8. A tangible, non-transitory, computer-readable medium having instructions thereon, which when executed by a processor, cause the processor to perform a method comprising:

9. The computer-readable medium of statement 8, wherein the method further comprises:

10. The computer-readable medium of statement 8, wherein the method further comprises:

11. The computer-readable medium of statement 8, wherein the method further comprises:

different redundancy levels are defined among a first resilient group of storage resources of the temporary storage area and a second resilient group of storage resources of the long-term storage area.

12. A storage system, comprising:

a plurality of modular storage devices;

a plurality of blades configurable to establish a plurality of elastic groups, each elastic group having a defined level of resource redundancy;

the plurality of elastic groups includes at least one computing resource elastic group and at least one storage resource elastic group;

the plurality of blades are used to support configurations, each configuration having a plurality of each of the plurality of elastic groups; and

The plurality of blades are configured to perform distributed data and metadata storage across the plurality of modular storage devices according to the plurality of resiliency groups.

13. The storage system of statement 12, wherein the plurality of blades and the plurality of modular storage devices are to:

forming a data stripe using resources in the elastic group of computing resources; and

The data stripe is written using a resource in the storage resource elastic group.

14. The storage system of statement 12, wherein in each of the plurality of elastic groups, the defined level of redundancy is related to a failure viability of the elastic group, and is selectable from the group consisting of n+1 redundancy, n+2 redundancy, and n+3 redundancy.

15. The storage system of statement 12, wherein the plurality of blades support configurations having different defined levels of redundancy among the plurality of resilient groups.

16. The storage system of statement 12, wherein the plurality of blades support heterogeneous computing resources in the at least one resilient group of computing resources.

17. The storage system of statement 12, wherein the plurality of blades support heterogeneous storage resources in the at least one storage resource elastic group.

18. The storage system of statement 12, wherein the plurality of blades support different defined levels of redundancy among a first resilient group of storage resources of the staging area and a second resilient group of storage resources of the long-term storage area.

19. The storage system of statement 12, wherein the plurality of blades support different defined levels of redundancy among a first elastic group that includes an authority in the plurality of blades and a second elastic group that includes other authorities in the plurality of blades.

20. The storage system of statement 12, wherein the plurality of elastic groups further comprises at least one elastic group of non-volatile random access memory (NVRAM).

Detailed illustrative embodiments are disclosed herein. However, specific functional details disclosed herein are merely representative for purposes of describing the embodiments. However, embodiments may be embodied in many alternate forms and should not be construed as limited to only the embodiments set forth herein.

It will be understood that, although the terms first, second, etc. may be used herein to describe various steps or computations, these steps or computations should not be limited by these terms. These terms are only used to distinguish one step or calculation from another step or calculation. For example, a first calculation may be referred to as a second calculation, and similarly, a second step may be referred to as a first step, without departing from the scope of the present disclosure. As used herein, the term "and/or" and "/" symbols include any and all combinations of one or more of the associated listed items.

As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising" (comprises, comprising, "including" and/or "comprising" (includes, including) when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Thus, the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting.

It should also be noted that, in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed substantially concurrently or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

In view of the above embodiments, it should be appreciated that embodiments may employ various computer-implemented operations involving data stored in computer systems. The operations are those requiring physical manipulations of physical quantities. Usually, though not necessarily, these quantities take the form of electrical or magnetic signals capable of being stored, transferred, combined, compared, and otherwise manipulated. Further, the manipulations performed are often referred to in terms, such as producing, identifying, determining, or comparing. Any of the operations described herein that form part of the embodiments are useful machine operations. Embodiments also relate to a device or apparatus for performing these operations. The apparatus may be specially constructed for the required purposes, or the apparatus may be a general-purpose computer selectively activated or configured by a computer program stored in the computer. In particular, various general-purpose machines may be used with computer programs written in accordance with the teachings herein, or it may be more convenient to construct a more specialized apparatus to perform the required operations.

A module, application, layer, agent, or other method operable entity may be implemented as hardware, firmware, or software executed by a processor, or a combination thereof. It should be appreciated that where software-based embodiments are disclosed herein, the software may be embodied in a physical machine such as a controller. For example, the controller may include a first module and a second module. The controller may be configured to perform various actions such as methods, applications, layers, or agents.

Embodiments may also be embodied as computer readable code on a tangible, non-transitory computer readable medium. The computer readable medium is any data storage device that can store data which can be thereafter be read by a computer system. Examples of computer readable media include hard disk drives, network Attached Storage (NAS), read-only memory, random-access memory, CD-ROMs, CD-R, CD-RWs, magnetic tapes, and other optical and non-optical data storage devices. The computer readable medium can also be distributed over network coupled computer systems so that the computer readable code is stored and executed in a distributed fashion. The embodiments described herein may be practiced with various computer system configurations, including hand-held devices, tablet computers, microprocessor systems, microprocessor-based or programmable consumer electronics, minicomputers, mainframe computers, and the like. The embodiments may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a wired or wireless network.

Although the method operations are described in a particular order, it should be understood that other operations may be performed between the described operations, the described operations may be adjusted so that they occur at slightly different times, or the described operations may be distributed in a system that allows processing operations to occur at various time intervals associated with the processing.

In various embodiments, one or more portions of the methods and mechanisms described herein may form part of a cloud computing environment. In such embodiments, the resources may be provided as services over the Internet according to one or more of a variety of models. These models may include infrastructure as a service (IaaS), platform as a service (PaaS), and software as a service (SaaS). In IaaS, the computer infrastructure is presented as a service. In this case, the computing device is typically owned and operated by a service provider. In the PaaS model, software tools and underlying devices used by developers to develop software solutions can be provided as services and hosted by service providers. SaaS typically contains service provider licensed software as an on-demand service. The service provider may host the software, or may deploy the software to the customer within a given period of time. Various combinations of the above models are possible and contemplated.

Various units, circuits, or other components may be described or required as "configured to" or "configurable to" perform one or more tasks. In this context, the phrase "configured to" or "configurable to" is used to connote structure by indicating that the units/circuits/components include structure (e.g., circuitry) that performs one or more tasks during operation. Thus, a given unit/circuit/component may be said to be configured to perform a task, or may be configured to perform a task, even when the unit/circuit/component is not currently operational (e.g., not on). Units/circuits/components used with the terms "configured to" or "configurable to" include hardware, e.g., circuits, memory storing program instructions executable to perform operations, etc. References to a unit/circuit/component being "configured to" or "configurable to" perform one or more tasks are expressly not intended to refer to 35U.S. c112 sixth paragraph for that unit/circuit/component. In addition, "configured to" or "configurable to" may include general-purpose structures (e.g., general-purpose circuitry) that are manipulated by software and/or firmware (e.g., FPGA or general-purpose processor executing software) to operate in a manner that is capable of performing the tasks in question. "configured to" may also include adapting a manufacturing process (e.g., a semiconductor manufacturing facility) to manufacture devices (e.g., integrated circuits) suitable for performing or performing one or more tasks. "configurable to" is expressly intended to be inapplicable to blank media, unprogrammed processors or unprogrammed general-purpose computers, or unprogrammed programmable logic devices, programmable gate arrays, or other unprogrammed devices, unless accompanied by a programmed medium that imparts the ability to the unprogrammed device to be configured to perform the disclosed function.

The foregoing description, for purposes of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and its practical application, to thereby enable others skilled in the art to best utilize the invention and various embodiments with various modifications as are suited to the particular use contemplated. The present embodiments are, therefore, to be considered as illustrative and not restrictive, and the invention is not to be limited to the details given herein, but may be modified within the scope and equivalents of the appended claims.

Claims

1. A method, comprising:

Distributed data storage is performed across the plurality of modular storage devices by the plurality of blades according to the established plurality of resiliency groups.

2. The method of claim 1, wherein the performing the distributed data store comprises:

3. The method as recited in claim 1, further comprising:

4. The method as recited in claim 1, further comprising:

5. The method as recited in claim 1, further comprising:

6. The method as recited in claim 1, further comprising:

7. The method as recited in claim 1, further comprising:

Different defined redundancy levels among the first resilient group and the second resilient group are supported.

9. The computer-readable medium of claim 8, wherein the method further comprises:

10. The computer-readable medium of claim 8, wherein the method further comprises:

11. The computer-readable medium of claim 8, wherein the method further comprises:

at the position ofDifferent redundancy levels are defined among the first elastic group and the second elastic group.

12. A storage system, comprising:

a plurality of modular storage devices;

The plurality of blades are configured to perform distributed data storage across the plurality of modular storage devices according to the plurality of resiliency groups.

13. The storage system of claim 12, wherein the plurality of blades and the plurality of modular storage devices are to:

a data stripe is formed using the resources in the computing resource elastic group and written using the resources in the storage resource elastic group.

14. The storage system of claim 12, wherein in each of the plurality of elastic groups, the defined level of redundancy is related to a failure viability of the elastic group and is selectable from a group comprising n+1 redundancy, n+2 redundancy, and n+3 redundancy.

15. The storage system of claim 12, wherein the plurality of blades support configurations having different defined levels of redundancy among the plurality of resilient groups.

16. The storage system of claim 12, wherein the plurality of blades support heterogeneous computing resources in the at least one resilient group of computing resources.

17. The storage system of claim 12, wherein the plurality of blades support heterogeneous storage resources in the at least one storage resource elastic group.

18. The storage system of claim 12, wherein the plurality of blades support different defined redundancy levels among a first elastic group and a second elastic group.

19. The storage system of claim 12, wherein the plurality of blades support different defined redundancy levels among a first elastic group comprising rights in the plurality of blades and a second elastic group comprising other rights in the plurality of blades.

20. The storage system of claim 12, wherein the plurality of elastic groups further comprises at least one elastic group of non-volatile random access memory (NVRAM).