CN117693731A

CN117693731A - Improving parallelism in a partitioned drive storage system using allocation shares

Info

Publication number: CN117693731A
Application number: CN202280048604.6A
Authority: CN
Inventors: 罗纳德·卡尔; T·W·布伦南
Original assignee: Pure Storage Inc
Current assignee: Pure Storage Inc
Priority date: 2021-06-02
Filing date: 2022-05-11
Publication date: 2024-03-12
Also published as: WO2022256154A1

Abstract

The storage bandwidth of the storage system process is adjusted in response to input output (I/O) write requests to write data to the partitioned storage devices. The storage bandwidth is adjusted by calculating an allocation share of the storage system process requesting writing of the data and opening a new zone for the storage system process when an open zone usage of the storage system process is determined to be below the allocation share of the storage system process.

Description

Improving parallelism in a partitioned drive storage system using allocation shares

Related application

The present application claims the rights of the continuation patent application from U.S. section No. 17/336,999 of the 2021, 6, 2-day application, which is incorporated herein by reference.

Background

Storage systems, such as enterprise storage systems, may include a centralized or non-centralized repository for data that provides common data management, data protection, and data sharing functions, such as through a connection with a computer system.

Drawings

The present disclosure is illustrated by way of example, and not by way of limitation, and may be more fully understood with reference to the following detailed description when considered in conjunction with the accompanying drawings described below.

FIG. 1A illustrates a first example system for data storage according to some embodiments.

FIG. 1B illustrates a second example system for data storage according to some embodiments.

FIG. 1C illustrates a third example system for data storage according to some embodiments.

FIG. 1D illustrates a fourth example system for data storage according to some embodiments.

FIG. 2A is a perspective view of a storage cluster having multiple storage nodes and internal storage coupled to each storage node to provide network attached storage, according to some embodiments.

Fig. 2B is a block diagram showing an interconnect switch coupling multiple storage nodes, according to some embodiments.

FIG. 2C is a multi-level block diagram showing the contents of a storage node and the contents of one of the non-volatile solid state storage units, according to some embodiments.

FIG. 2D shows a storage server environment using embodiments of storage nodes and storage units of some previous figures, according to some embodiments.

FIG. 2E is a blade hardware block diagram showing a control plane, a compute and store plane, and permissions to interact with underlying physical resources, according to some embodiments.

Fig. 2F depicts a resilient software layer in a blade of a storage cluster, according to some embodiments.

FIG. 2G depicts permissions and storage resources in a blade of a storage cluster according to some embodiments.

Fig. 3A sets forth a diagram of a storage system coupled for data communication with a cloud service provider according to some embodiments of the present disclosure.

Fig. 3B sets forth a diagram of a storage system according to some embodiments of the present disclosure.

Fig. 3C sets forth an example of a cloud-based storage system according to some embodiments of the present disclosure.

FIG. 3D illustrates an exemplary computing device that may be specifically configured to perform one or more of the processes described herein.

Fig. 4 is a flow chart illustrating a method for determining whether to adjust the storage bandwidth of a storage system process according to some embodiments.

Fig. 5 is a flow chart illustrating a method for adjusting storage bandwidth of a storage system process according to some embodiments.

Fig. 6 is a flow diagram illustrating a method for determining an allocation share of a storage system process according to some embodiments.

Fig. 7 is a diagram illustrating parameters for determining an allocation share of a storage system process according to some embodiments.

FIG. 8 is an illustration of an example of a storage system utilizing parameters to determine an allocation share of a storage system process in accordance with an embodiment of the present disclosure.

FIG. 9 is an example method of adjusting storage bandwidth of a storage system process to store data at a partitioned storage device in accordance with an embodiment of the present disclosure.

Detailed Description

A system, such as a storage system, may offload device management responsibilities from a storage drive to a host controller. For example, in some systems, firmware, such as a translation layer or flash memory translation layer, may reside on or be executed by a storage driver at the driver level. The translation layer may maintain a mapping between logical sector addresses and physical locations. Executing the translation layer at the driver level may result in inefficient use of memory resources and create more problems due to write amplification.

In an implementation, the storage system may remove the translation layer from the driver stage and perform physical flash memory address handling operations at the host controller stage. Performing physical flash memory address handling operations at the host controller level presents challenges to designers, such as increasing the parallelism of the write process on flash-memory-based solid-state storage drives that write data to the storage array.

Aspects of the present disclosure address the above-mentioned and other deficiencies by adjusting, by a host controller of a storage system, a storage bandwidth of a storage system process during runtime in response to an input output (I/O) write request to write data to the storage system. In an embodiment, a host controller may determine an allocation share of a storage system process requesting to write data. In response to determining that the open segment of the storage system process is used below the allocated share of the storage system process, the host controller opens a new segment for the storage system process.

With reference to the figures, beginning with FIG. 1A, example methods, apparatus, and articles of manufacture to utilize allocation shares to improve parallelism in a partitioned drive storage system are described in accordance with embodiments of the present disclosure. FIG. 1A illustrates an example system for data storage according to some embodiments. For purposes of illustration and not limitation, system 100 (also referred to herein as a "storage system") includes many elements. It may be noted that system 100 may include the same, more, or fewer elements configured in the same or different ways in other embodiments.

The system 100 includes a plurality of computing devices 164A-B. Computing devices (also referred to herein as "client devices") may include, for example, servers in a data center, workstations, personal computers, notebook computers, or the like. Computing devices 164A-B may be coupled for data communication with one or more storage arrays 102A-B through a storage area network ("SAN") 158 or a local area network ("LAN") 160.

SAN 158 may be implemented with various data communication structures, devices, and protocols. For example, the fabric for SAN 158 may comprise fibre channel, ethernet, infiniband, serial attached Small computer System interface ("SAS"), or the like. The data communication protocols used with SAN 158 may include advanced technology attachment ("ATA"), fibre channel protocol, small computer System interface ("SCSI"), internet Small computer System interface ("iSCSI"), hyperSCSI, nonvolatile memory express over structure ("NVMe"), or the like. It is noted that SAN 158 is provided for purposes of illustration and not limitation. Other data communication couplings may be implemented between computing devices 164A-B and storage arrays 102A-B.

LAN 160 may also be implemented with various structures, devices, and protocols. For example, the architecture for LAN 160 may include ethernet (802.3), wireless (802.11), or the like. The data communication protocols for LAN 160 may include transmission control protocol ("TCP"), user datagram protocol ("UDP"), internet protocol ("IP"), hypertext transfer protocol ("HTTP"), wireless access protocol ("WAP"), hand-held device transfer protocol ("HDTP"), session initiation protocol ("SIP"), real-time protocol ("RTP"), or the like.

The storage arrays 102A-B may provide persistent data storage for the computing devices 164A-B. In an implementation, the storage array 102A may be contained in a chassis (not shown) and the storage array 102B may be contained in another chassis (not shown). The storage arrays 102A and 102B may include one or more storage array controllers 110A-D (also referred to herein as "controllers"). The storage array controllers 110A-D may be embodied as modules of an automated computing machinery comprising computer hardware, computer software, or a combination of computer hardware and software. In some implementations, the storage array controllers 110A-D may be configured to perform various storage tasks. Storage tasks may include writing data received from computing devices 164A-B to storage arrays 102A-B, erasing data from storage arrays 102A-B, retrieving data from storage arrays 102A-B and providing data to computing devices 164A-B, monitoring and reporting disk utilization and performance, performing redundancy operations (e.g., redundant array of independent drives ("RAID") or RAID-like data redundancy operations), compressing data, encrypting data, and so forth.

The storage array controllers 110A-D may be implemented in a variety of ways, including as a field programmable gate array ("FPGA"), a programmable logic chip ("PLC"), an application specific integrated circuit ("ASIC"), a system on a chip ("SOC"), or any computing device including discrete components (e.g., a processing device, a central processing unit, a computer memory, or various adapters). The storage array controllers 110A-D may include, for example, data communications adapters configured to support communications via the SAN 158 or LAN 160. In some implementations, the storage array controllers 110A-D may be independently coupled to the LAN 160. In an implementation, storage array controllers 110A-D may include I/O controllers or the like that couple storage array controllers 110A-D for data communications to persistent storage resources 170A-B (also referred to herein as "storage resources") through a midplane (not shown). Persistent storage resources 170A-B generally include any number of storage drives 171A-F (also referred to herein as "storage devices") and any number of non-volatile random access memory ("NVRAM") devices (not shown).

In some implementations, NVRAM devices of persistent storage resources 170A-B may be configured to receive data from storage array controllers 110A-D to be stored in storage drives 171A-F. In some examples, the data may originate from computing devices 164A-B. In some examples, writing data to the NVRAM device may be performed faster than writing data directly to the storage drives 171A-F. In an implementation, the storage array controllers 110A-D may be configured to utilize NVRAM devices as fast accessible buffers destined for data to be written to the storage drives 171A-F.

The latency of write requests using NVRAM devices as buffers may be improved relative to systems in which storage array controllers 110A-D write data directly to storage drives 171A-F. In some embodiments, the NVRAM device may be implemented with computer memory in the form of high bandwidth, low latency RAM. NVRAM devices are referred to as "non-volatile" in that the NVRAM device may receive or contain a sole power source that maintains RAM state after a loss of main power to the NVRAM device. Such a power source may be a battery, one or more capacitors, or the like. In response to a power loss, the NVRAM device may be configured to write the contents of RAM to persistent storage, such as storage drives 171A-F.

In an implementation, storage drives 171A-F may refer to any device configured to continuously record data, where "continuous" or "persistent" refers to the ability of a device to maintain record data after power is turned off. In some implementations, the storage drives 171A-F may correspond to non-disk storage media. For example, storage drives 171A-F may be one or more solid state drives ("SSDs"), flash memory-based storage devices, any type of solid state non-volatile memory, or any other type of non-mechanical storage device. In other implementations, the storage drives 171A-F may include mechanical or rotating hard disks, such as hard disk drives ("HDDs").

In some implementations, the storage array controllers 110A-D may be configured to offload device management responsibilities from the storage drives 171A-F in the storage arrays 102A-B. For example, the storage array controllers 110A-D may manage control information that may describe the state of one or more memory blocks in the storage drives 171A-F. The control information may indicate, for example, that a particular memory block has failed and should no longer be written, that a particular memory block contains boot code for storing array controllers 110A-D, the number of program erase ("P/E") cycles that have been performed on a particular memory block, the age of data stored in a particular memory block, the type of data stored in a particular memory block, and so forth. In some implementations, control information may be stored as metadata with associated memory blocks. In other implementations, control information for the storage drives 171A-F may be stored in one or more particular memory blocks of the storage drives 171A-F selected by the storage array controller 110A-D. The selected memory block may be marked with an identifier indicating that the selected memory block contains control information. The memory array controllers 110A-D may utilize the identifiers in conjunction with the memory drives 171A-F to quickly identify memory blocks containing control information. For example, the memory controllers 110A-D may issue commands to locate memory blocks containing control information. It may be noted that the control information may be so large that portions of the control information may be stored in multiple locations, for example, for redundancy purposes, or the control information may be otherwise distributed across multiple memory blocks in the storage drives 171A-F.

In an implementation, the storage array controllers 110A-D may offload device management responsibilities from the storage drives 171A-F of the storage arrays 102A-B by retrieving control information from the storage drives 171A-F describing the state of one or more memory blocks in the storage drives 171A-F. Retrieving control information from the storage drives 171A-F may be performed, for example, by the storage array controllers 110A-D querying the storage drives 171A-F for the location of control information for a particular storage drive 171A-F. The storage drives 171A-F may be configured to execute instructions that enable the storage drives 171A-F to identify the location of the control information. The instructions may be executed by a controller (not shown) associated with or otherwise located on storage drives 171A-F and may cause storage drives 171A-F to scan a portion of each memory block to identify the memory block storing the control information of storage drives 171A-F. The storage drives 171A-F may respond by sending a response message to the storage array controller 110A-D that includes the location of the control information of the storage drives 171A-F. In response to receiving the response message, the storage array controllers 110A-D may issue a request to read data stored at addresses associated with the locations of the control information of the storage drives 171A-F.

In other implementations, the storage array controllers 110A-D may further offload device management responsibilities from the storage drives 171A-F by performing storage drive management operations in response to receiving control information. Storage drive management operations may include, for example, operations typically performed by storage drives 171A-F, such as controllers (not shown) associated with particular storage drives 171A-F. Storage drive management operations may include, for example, ensuring that data is not written to failed memory blocks within storage drives 171A-F, ensuring that data is written to memory blocks within storage drives 171A-F such that adequate wear leveling is achieved, and so forth.

In an implementation, the storage arrays 102A-B may implement two or more storage array controllers 110A-D. For example, the memory array 102A may include a memory array controller 110A and a memory array controller 110B. In a given example, a single storage array controller 110A-D (e.g., storage array controller 110A) of storage system 100 may be designated with a primary state (also referred to herein as a "primary controller") and other storage array controllers 110A-D (e.g., storage array controller 110A) may be designated with a secondary state (also referred to herein as a "secondary controller"). The master controller may have specific rights, such as permissions to alter data in persistent storage resources 170A-B (e.g., write data to persistent storage resources 170A-B). At least some of the rights of the primary controller may be substituted for the rights of the secondary controller. For example, when a primary controller has the right to alter data in persistent storage resources 170A-B, a secondary controller may not. The states of the memory array controllers 110A-D may change. For example, storage array controller 110A may be designated with a secondary state and storage array controller 110B may be designated with a primary state.

In some implementations, a primary controller, such as storage array controller 110A, may serve as a primary controller for one or more storage arrays 102A-B, and a secondary controller, such as storage array controller 110B, may serve as a secondary controller for one or more storage arrays 102A-B. For example, storage array controller 110A may be the primary controller of storage arrays 102A and 102B, and storage array controller 110B may be the secondary controller of storage arrays 102A and 102B. In some implementations, the storage array controllers 110C and 110D (also referred to as "storage processing modules") may have neither a primary nor a secondary state. The storage array controllers 110C and 110D implemented as storage processing modules may serve as communication interfaces between the primary and secondary controllers (e.g., storage array controllers 110A and 110B, respectively) and the storage array 102B. For example, the storage array controller 110A of the storage array 102A may send write requests to the storage array 102B via the SAN 158. The write request may be received by both memory array controllers 110C and 110D of memory array 102B. The storage array controllers 110C and 110D facilitate communications, such as sending write requests to the appropriate storage drives 171A-F. It may be noted that in some implementations, the storage processing module may be used to increase the number of storage drives controlled by the primary and secondary controllers.

In an implementation, the storage array controllers 110A-D are communicatively coupled to one or more storage drives 171A-F and one or more NVRAM devices (not shown) that are included as part of the storage arrays 102A-B via a midplane (not shown). The storage array controllers 110A-D may be coupled to the midplane via one or more data communication links, and the midplane may be coupled to the storage drives 171A-F and NVRAM devices via one or more data communication links. The data communication links described herein are collectively illustrated by data communication links 108A-D and may include, for example, a peripheral component interconnect express ("PCIe") bus.

FIG. 1B illustrates an example system for data storage according to some embodiments. The memory array controller 101 illustrated in FIG. 1B may be similar to the memory array controllers 110A-D described with respect to FIG. 1A. In one example, storage array controller 101 may be similar to storage array controller 110A or storage array controller 110B. For purposes of illustration and not limitation, the memory array controller 101 includes many elements. It may be noted that the memory array controller 101 may include the same, more, or fewer elements configured in the same or different ways in other embodiments. It may be noted that the elements of FIG. 1A may be included below to help illustrate features of the memory array controller 101.

The memory array controller 101 may include one or more processing devices 104 and random access memory ("RAM") 111. The processing device 104 (or controller 101) represents one or more general-purpose processing devices, such as a microprocessor, central processing unit, or the like. More particularly, the processing device 104 (or controller 101) may be a complex instruction set computing ("CISC") microprocessor, a reduced instruction set computing ("RISC") microprocessor, a very long instruction word ("VLIW") microprocessor, or one processor implementing other instruction sets, or multiple processors implementing a combination of instruction sets. The processing device 104 (or controller 101) may also be one or more special purpose processing devices, such as an ASIC, FPGA, digital signal processor ("DSP"), network processor, or the like.

The processing device 104 may be connected to the RAM 111 via a data communication link 106, which data communication link 106 may be embodied as a high-speed memory bus, such as a double data rate 4 ("DDR 4") bus. An operating system 112 is stored in RAM 111. In some embodiments, instructions 113 are stored in RAM 111. The instructions 113 may include computer program instructions for performing operations in a direct-mapped flash memory storage system. In one embodiment, the direct mapped flash memory storage system is a system that directly addresses blocks of data within a flash memory drive and that does not have address translation performed by the memory controller of the flash memory drive.

In an implementation, the storage array controller 101 includes one or more host bus adapters 103A-C coupled to the processing device 104 via data communication links 105A-C. In implementations, the host bus adapters 103A-C can be computer hardware that connects a host system (e.g., a storage array controller) to other networks and storage arrays. In some examples, host bus adapters 103A-C may be fibre channel adapters enabling storage array controller 101 to connect to a SAN, ethernet adapters enabling storage array controller 101 to connect to a LAN, or the like. The host bus adapters 103A-C may be coupled to the processing device 104 via data communication links 105A-C, such as a PCIe bus.

In an implementation, the storage array controller 101 may include a host bus adapter 114 coupled to the expander 115. Expander 115 may be used to attach host systems to a large number of storage drives. Expander 115 may be, for example, a SAS expander for enabling host bus adapter 114 to be attached to a storage drive in embodiments where host bus adapter 114 is embodied as a SAS controller.

In an implementation, the storage array controller 101 may include a switch 116 coupled to the processing device 104 via a data communication link 109. Switch 116 may be a computer hardware device that may create multiple endpoints from a single endpoint, enabling multiple devices to share a single endpoint. For example, switch 116 may be a PCIe switch coupled to a PCIe bus (e.g., data communication link 109) and presenting multiple PCIe connection points to the midplane.

In an embodiment, the storage array controller 101 includes a data communication link 107 for coupling the storage array controller 101 to other storage array controllers. In some examples, data communication link 107 may be a Quick Path Interconnect (QPI) interconnect.

A conventional storage system using a conventional flash memory drive may implement the process across flash memory drives that are part of the conventional storage system. For example, advanced processes of a storage system may initiate and control processes across flash memory drives. However, the flash memory drive of a conventional storage system may include its own storage controller that also performs the process. Thus, for a conventional storage system, both higher-level processes (e.g., initiated by the storage system) and lower-level processes (e.g., initiated by a storage controller of the storage system) may be performed.

To address various drawbacks of conventional memory systems, operations may be performed by higher-level processes rather than by lower-level processes. For example, a flash memory storage system may include a flash memory drive that does not include a memory controller that provides the process. Thus, the operating system of the flash memory storage system itself can initiate and control the process. This may be accomplished by a direct mapped flash memory storage system that is directly addressing blocks of data within the flash memory drive and that has no address translation performed by the memory controller of the flash memory drive.

In an implementation, the storage drives 171A-F may be one or more partitioned storage devices. In some implementations, one or more of the partitioned storage devices may be a shingled HDD. In an implementation, the one or more storage devices may be flash memory based SSDs. In a partitioned storage device, the partition namespaces on the partitioned storage device are addressable by groups of blocks grouped and aligned by natural size, forming a plurality of addressable regions. In implementations utilizing SSDs, the natural size may be based on the erase block size of the SSD. In some embodiments, a region of a partitioned storage device may be defined during initialization of the partitioned storage device. In embodiments, a single region or its mapping to physical storage within a storage device may be dynamically defined when the region is reset, opened, closed, completed, first written from an empty state, or when data is written to a partitioned storage device.

In some implementations, the regions may be heterogeneous, with some regions each being a page group and other regions being multiple page groups. In an implementation, some regions may correspond to erase blocks, while other regions may correspond to multiple erase blocks. In implementations, for heterogeneous mixes of programming patterns, manufacturers, product types, and/or product generations of storage devices applied to heterogeneous assemblies, upgrades, distributed storage devices, etc., a region may be any combination of different numbers of pages in a page group and/or erase block. In some embodiments, a region may be defined as having a usage characteristic, such as an attribute that supports data having a particular kind of lifetime (e.g., very short lifetime or very long lifetime). The partitioned storage may use these attributes to determine how to manage the region over its expected lifetime.

It should be appreciated that a region is a virtual construct. Any particular region may not have a fixed location on the storage device. The region may not have any location on the storage device prior to allocation. The region may correspond to a number representing a partitioned block of virtually allocatable space, which in various implementations is the size of an erase block or the size of other blocks. When a system utilizing a partition driver allocates or opens a region, or otherwise issues a first write to a region in an empty state, the region may be allocated to flash memory or other solid state storage memory, and when the system writes to the region, a page is written to the mapped flash memory or other solid state storage memory of the partition storage device. When the system completes the region, the associated erase block or other sized block is also completed. At some point in the future, the system may reset a zone, which will free up allocated space for that zone. During its lifetime, a zone may move to a different location of the partitioned storage device, for example, when the partitioned storage device is internally maintained.

In an embodiment, the zones of the partitioned storage device may be in different states. The region may be in an empty state in which data is not stored in the region. The void region may be opened explicitly, or may be opened implicitly by writing data to the region. This is the initial state of the region on the new partition storage, but may also be the result of a region reset. In some implementations, the empty region may have a designated location within the flash memory of the partitioned storage device. In an embodiment, the location of the empty region may be selected the first time the region is opened or written to (or later if the write is buffered in memory). The region may be implicitly or explicitly in an open state, wherein the region in the open state may be written to store data using a write or append command. In an embodiment, the area in the open state may also be written using a copy command that copies data previously stored on the drive. In some implementations, the partitioned storage may have a limit on the number of open areas at a particular time.

The area in the closed state is an area that has been partially written, but enters the closed state after an explicit closing operation is issued. The region in the off state is available for future writing, but some runtime overhead may be reduced, which would otherwise be consumed by keeping the region in the on state. In an embodiment, the partitioned storage may have a limit on the number of closed zones at a particular time. The area in the complete state is an area in which data is being stored and cannot be rewritten. After writing data to the entire area, or as a result of an area complete operation, the area may be in a complete state. The region may or may not have been fully written to before the operation is completed. However, after the operation is completed, if the region reset operation is not performed first, the region may not open further writing.

The mapping from the region to the erase block (or to the shingled tracks in the HDD) may be any, dynamic, and hidden. The process of opening a region may be an operation that allows a new region to be dynamically mapped to the underlying storage of the partitioned storage, and then allows data to be written by additional writes to the region until the region reaches capacity. The area may end at any point in time after which no further data may be written to the area. When the data stored in the region is no longer needed, the region may be reset, effectively deleting the contents of the region from the partitioned storage device, making the physical storage device held by the region available for subsequent storage of the data. Once a zone has been written to and completed, the partition store ensures that data stored in the zone is not lost until the zone is reset. During the time between writing data to the region and resetting the region, the region may move between shingled tracks or erase blocks as part of a sustain operation within the partitioned storage device, such as maintaining data refresh by copying data or handling memory cell aging in an SSD.

In embodiments utilizing an HDD, a reset of a zone may allow shingled tracks to be assigned to a new open zone that may be opened at some point in the future. In implementations utilizing SSDs, a reset of a region may result in the associated physical erase block(s) of the region being erased and subsequently reused for storage of data. In some embodiments, the partitioned storage device may have a limit on the number of open regions at a point in time to reduce the amount of overhead dedicated to keeping the regions open.

An operating system of a flash memory storage system may identify and maintain a list of allocation units across multiple flash memory drives of the flash memory storage system. The allocation unit may be an entire erase block or a plurality of erase blocks. The operating system may maintain a map or address range that directly maps addresses to erase blocks of a flash memory drive of the flash memory storage system.

An erase block that is mapped directly to a flash memory drive may be used to rewrite data and erase data. For example, operations may be performed on one or more allocation units that include first data and second data, where the first data is to be retained and the second data is no longer used by the flash memory storage system. The operating system may initiate a process to write the first data to a new location within the other allocation unit and erase the second data and mark the allocation unit as available for subsequent data. Thus, the process may be performed by only the higher level operating system of the flash memory storage system without requiring additional lower level processes to be performed by the controller of the flash memory drive.

Advantages of processes performed only by the operating system of the flash memory storage system include increased reliability of the flash memory driver of the flash memory storage system because unnecessary or redundant write operations are not performed during the processes. One possible novelty point here is the concept of operating system boot and control procedures in flash memory storage systems. Additionally, the process may be controlled by the operating system across multiple flash memory drives. This is in contrast to the process performed by the memory controller of the flash memory drive.

The storage system may consist of two storage array controllers sharing a set of drives for failover purposes, or it may consist of a single storage array controller providing storage services utilizing multiple drives, or it may consist of a distributed network of storage array controllers, each having a number of drives or a number of flash memories, wherein the storage array controllers in the network cooperate to provide complete storage services and cooperate in various aspects of storage services, including storage allocation and discard item collection.

FIG. 1C illustrates a third example system 117 for data storage according to some embodiments. For purposes of illustration and not limitation, system 117 (also referred to herein as a "storage system") includes many elements. It may be noted that system 117 may include the same, more, or fewer elements configured in the same or different ways in other embodiments.

In one embodiment, the system 117 includes a dual peripheral component interconnect ("PCI") flash memory device 118 with separately addressable flash write memory devices. The system 117 may include a memory controller 119. In one embodiment, the memory controllers 119A-D may be CPU, ASIC, FPGA or any other circuitry that may implement the control structures required in accordance with the present disclosure. In one embodiment, the system 117 includes a flash memory device (e.g., including flash memory devices 120 a-n) that is operably coupled to various channels of the storage device controller 119. The flash memory devices 120 a-n may be presented to the controllers 119A-D as addressable sets of flash memory pages, erase blocks, and/or control elements sufficient to allow the memory device controllers 119A-D to program and retrieve various aspects of flash memory. In one embodiment, the storage device controllers 119A-D may perform operations on the flash memory devices 120 a-n including storing and retrieving data content of pages, arranging and erasing any blocks, tracking statistics related to the use and reuse of flash memory pages, erased blocks and cells, tracking and predicting error codes and faults within flash memory, controlling voltage levels associated with programming and retrieving the contents of flash memory cells, and the like.

In one embodiment, the system 117 may include RAM 121 to separately store addressable fast write data. In one embodiment, RAM 121 may be one or more stand-alone discrete devices. In another embodiment, RAM 121 may be integrated into storage device controllers 119A-D or multiple storage device controllers. RAM 121 may also be used for other purposes such as storing temporary program memory for a processing device (e.g., CPU) in device controller 119.

In one embodiment, the system 117 may include an energy storage device 122, such as a rechargeable battery or capacitor. The storage device 122 may store energy sufficient to power the storage device controller 119, an amount of RAM (e.g., RAM 121), and an amount of flash memory (e.g., flash memories 120 a-120 n) to have enough time to write the contents of the RAM to the flash memory. In one embodiment, if the storage device controller detects an external power loss, the storage device controllers 119A-D may write the contents of the RAM to the flash memory.

In one embodiment, the system 117 includes two data communication links 123a, 123b. In one embodiment, the data communication links 123a, 123b may be PCI interfaces. In another embodiment, the data communication links 123a, 123b may be based on other communication standards (e.g., hyperTransport, infiniband, etc.). The data communication links 123a, 123b may be based on nonvolatile memory express ("NVMe") or on the structural NVMe ("NVMf") specifications that allow external connection to the storage device controllers 119A-D from other components in the storage system 117. It should be noted that for convenience, the data communication link is interchangeably referred to herein as a PCI bus.

The system 117 may also include an external power source (not shown), which may be provided over one or both data communication links 123a, 123b, or may be provided separately. Alternative embodiments include a separate flash memory (not shown) dedicated to storing the contents of RAM 121. The storage device controllers 119A-D may present logical devices on the PCI bus, which may include addressable fast write logical devices, or different portions of the logical address space of the storage device 118, which may be presented as PCI memory or persistent storage. In one embodiment, operations stored into the device are directed into RAM 121. Upon a power failure, the storage device controllers 119A-D may write storage content associated with the addressable flash logical storage devices to flash memory (e.g., flash memory 120 a-n) for long-term persistent storage.

In one embodiment, the logic device may include some representations of some or all of the contents of flash memory devices 120 a-n, where the representations allow a storage system including storage device 118 (e.g., storage system 117) to directly address flash memory pages over the PCI bus and to directly reprogram erase blocks from storage system components external to the storage device. The representation may also allow one or more external components to control and retrieve other aspects of the flash memory, including some or all of the following: tracking statistics related to the use and reuse of flash memory pages, erase blocks, and cells across all flash memory devices; tracking and predicting error codes and faults within and across flash memory devices; controlling voltage levels associated with programming and retrieving the contents of flash memory cells; etc.

In one embodiment, the storage energy device 122 may be sufficient to ensure that ongoing operations on the flash memory devices 120 a-120 n are completed, the storage energy device 122 may power the storage device controllers 119A-D and associated flash memory devices (e.g., 120 a-n) for doing so, as well as for storing fast write RAM to flash memory. The energy storage device 122 may be used to store accumulated statistics and other parameters maintained and tracked by the flash memory devices 120 a-n and/or the storage device controller 119. Separate capacitors or stored energy devices (e.g., smaller capacitors near or embedded in the flash memory device itself) may be used for some or all of the operations described herein.

Various schemes may be used to track and optimize the life of the energy storage components, such as adjusting voltage levels over time, partially discharging the energy storage device 122 to measure corresponding discharge characteristics, etc. If the available energy decreases over time, the effective available capacity of the addressable fast write memory may decrease to ensure that it can be safely written based on the currently available stored energy.

FIG. 1D illustrates a third example system 124 for data storage according to some embodiments. In one embodiment, the system 124 includes memory controllers 125a, 125b. In one embodiment, memory controllers 125a, 125b are operatively coupled to dual PCI memory devices 119a, 119b and 119c, 119d, respectively. The storage controllers 125a, 125b are operably coupled (e.g., via a storage network 130) to a number of host computers 127 a-n.

In one embodiment, two storage controllers (e.g., 125a and 125 b) provide storage services, such as SCS block storage arrays, file servers, object servers, databases or data analysis services, and the like. The storage controllers 125a, 125b may provide services to host computers 127 a-n external to the storage system 124 through a number of network interfaces (e.g., 126 a-d). The storage controllers 125a, 125b may provide integrated services or applications entirely within the storage system 124, forming a fused storage and computing system. The storage controllers 125a, 125b may utilize fast write memory within the storage devices 119 a-d or across the storage devices 119 a-d to record ongoing operations to ensure that operations are not lost upon a power failure, storage controller removal, storage controller or storage system shutdown, or some failure of one or more software or hardware components within the storage system 124.

In one embodiment, the controllers 125a, 125b operate as one or the other PCI master of the PCI buses 128a, 128 b. In another embodiment, 128a and 128b may be based on other communication standards (e.g., hyperTransport, infiniband, etc.). Other memory system embodiments may operate the memory controllers 125a, 125b as multiple masters for both PCI buses 128a, 128 b. Alternatively, the PCI/NVMe/NVMf switching infrastructure or fabric may connect multiple storage controllers. Some storage system embodiments may allow storage devices to communicate directly with each other, rather than only with a storage controller. In one embodiment, the storage device controller 119a is operable under the direction of the storage controller 125a to synthesize and transfer data to be stored into the flash memory device from data already stored in RAM (e.g., RAM 121 of fig. 1C). For example, after the storage controller has determined that an operation has been fully committed across the storage system, or when the flash memory on the device has reached a certain used capacity, or after a certain amount of time, a recalculated version of the RAM content may be transferred to ensure that the security of the data is improved or that the addressable flash memory is released for reuse. For example, this mechanism may be used to avoid a second transfer from the memory controller 125a, 125b over the bus (e.g., 128a, 128 b). In one embodiment, the recalculation may include compressing the data, attaching an index or other metadata, combining multiple data segments together, performing erasure code calculations, and so forth.

In one embodiment, under direction from the storage controllers 125a, 125b, the storage device controllers 119a, 119b are operable to calculate data from data stored in RAM (e.g., RAM 121 of fig. 1C) and transfer the data to other storage devices without involving the storage controllers 125a, 125b. This operation may be used to mirror data stored in one controller 125a to another controller 125b, or it may be used to offload compression, data aggregation, and/or erasure coding calculations and transfers to a storage device to reduce the load on the storage controller or storage controller interface 129a, 129b to the PCI bus 128a, 128b.

The storage device controllers 119 a-d may include mechanisms for implementing high availability primitives for use by other portions of the storage system external to the dual-PCI storage device 118. For example, a reservation or exclusion primitive may be provided such that in a storage system having two storage controllers providing high availability storage services, one storage controller may prevent the other storage controller from accessing or continuing to access the storage device. This may be used, for example, in the event that one controller detects that another is not functioning properly or that the interconnection between two storage controllers may itself be functioning improperly.

In one embodiment, a storage system for use with dual PCI direct mapped storage devices with individually addressable fast write storage includes a system that manages erase blocks or groups of erase blocks as allocation units for storing data on behalf of a storage service, or for storing metadata (e.g., indexes, logs, etc.) associated with the storage service, or for proper management of the storage system itself. Flash memory pages, which may be several kilobytes in size, may be written when data arrives or when the storage system holds the data for a longer time interval (e.g., exceeding a defined time threshold). To commit data faster, or to reduce the number of writes to the flash memory device, the memory controller may first write data to an individually addressable fast write memory device on one or more memory devices.

In one embodiment, the storage controllers 125a, 125b may initiate the use of erase blocks within and across the storage devices (e.g., 118) based on the age and expected remaining life of the storage devices, or based on other statistics. The memory controllers 125a, 125b may initiate garbage collection and data migration data between the memory devices according to pages that are no longer needed, as well as manage flash memory page and erase block life, and manage overall system performance.

In one embodiment, storage system 124 may utilize a mirroring and/or erasure coding scheme as part of storing data to the addressable fast write memory and/or as part of writing data to allocation units associated with an erasure block. The erase code may be used across storage devices, as well as within erase blocks or allocation units, or within and across flash memory devices on a single storage device to provide redundancy against single or multiple storage device failures, or to prevent internal corruption of flash memory pages caused by flash memory operations or flash memory cell degradation. Different levels of mirroring and erasure coding can be used to recover from multiple types of faults occurring alone or in combination.

The embodiments depicted with reference to fig. 2A-G illustrate a storage cluster storing user data, such as user data originating from one or more users or client systems or other sources external to the storage cluster. The storage clusters distribute user data across storage nodes housed within the chassis or across multiple chassis using erasure coding and redundant copies of metadata. Erasure coding refers to a method of data protection or reconstruction in which data is stored across a set of different locations, such as disks, storage nodes, or geographic locations. Flash memory is one type of solid state memory that may be integrated with embodiments, although embodiments may be extended to other types of solid state memory or other storage media, including non-solid state memory. Control of storage locations and workloads is distributed across storage locations in a clustered peer-to-peer system. Tasks such as mediating communications between various storage nodes, detecting when a storage node becomes unavailable, and balancing I/O (input and output) across the various storage nodes are all handled on a distributed basis. In some embodiments, data is arranged or distributed across multiple storage nodes in data segments or stripes that support data recovery. Ownership of data may be reassigned within a cluster independent of input and output modes. This architecture, described in more detail below, allows storage nodes in the cluster to fail while the system remains operational because data can be reconstructed from other storage nodes and thus remain available for input and output operations. In various embodiments, a storage node may be referred to as a cluster node, a blade, or a server.

The storage clusters may be contained within a chassis (i.e., a housing that houses one or more storage nodes). The chassis contains mechanisms (e.g., a power distribution bus) to provide power to each storage node and communications mechanisms (e.g., a communications bus capable of communicating between storage nodes). According to some embodiments, a storage cluster may operate as a stand-alone system in one location. In one embodiment, the chassis contains at least two examples of power distribution and communication buses that can be independently enabled or disabled. The internal communication bus may be an ethernet bus, however, other technologies (e.g., PCIe, infiniband, etc.) are equally applicable. The chassis provides ports for an external communication bus for enabling communication between multiple chassis, either directly or through a switch, and with a client system. External communications may use technologies such as ethernet, infiniband, fibre channel, etc. In some embodiments, the external communication bus uses different communication bus technologies for inter-chassis and client communication. If the switch is deployed within a chassis or between chassis, the switch may act as a translation between multiple protocols or technologies. When multiple chassis are connected to define a storage cluster, a client may access the storage cluster using a proprietary interface or standard interface (e.g., network file system ("NFS"), common internet file system ("CIFS"), small computer system interface ("SCSI"), or hypertext transfer protocol ("HTTP")). Translation of the client protocol may occur at the switch, at the chassis external communication bus, or within each storage node. In some embodiments, multiple chassis may be coupled or connected to each other through an aggregator switch. A portion and/or all of the coupled or connected chassis may be designated as a storage cluster. As discussed above, each chassis may have multiple blades, each with a media access control ("MAC") address, but in some embodiments the storage cluster is presented to the external network as having a single cluster IP address and a single MAC address.

Each storage node may be one or more storage servers, and each storage server is connected to one or more non-volatile solid-state memory units, which may be referred to as storage units or storage devices. One embodiment includes a single storage server and one to eight non-volatile solid-state memory units in each storage node, although this one example is not meant to be limiting. The storage server may include a processor, DRAM, and interfaces for internal communication buses and power distribution for each power bus. In some embodiments, the interface and storage units share a communication bus, such as PCI Express, within the storage node. The non-volatile solid state memory unit may directly access the internal communication bus interface through the storage node communication bus or request the storage node to access the bus interface. The non-volatile solid-state memory unit contains an embedded CPU, a solid-state memory controller, and an amount of solid-state mass storage, such as between 2 and 32 terabytes ("TB") in some embodiments. Embedded volatile storage media, such as DRAM, and energy retention devices are included in non-volatile solid state memory cells. In some embodiments, the energy retaining device is a capacitor, supercapacitor, or battery that is capable of transferring a subset of the DRAM content to a stable storage medium in the event of a power loss. In some embodiments, the non-volatile solid-state memory cells are comprised of storage class memory, such as phase change or magnetoresistive random access memory ("MRAM") that replaces DRAM and enables reduced power retention devices.

One of many features of storage nodes and non-volatile solid state storage devices is the ability to actively reconstruct data in a storage cluster. The storage nodes and nonvolatile solid state storage devices may determine when a storage node or nonvolatile solid state storage device in a storage cluster is inaccessible independent of whether an attempt is made to read data related to the storage node or nonvolatile solid state storage device. The storage nodes and the non-volatile solid state storage devices then cooperate to recover and reconstruct data in at least a portion of the new locations. This constitutes an active rebuild in that the system rebuilds the data without waiting for read access required data to be initiated from the client system employing the storage cluster. These and further details of the memory and its operation are discussed below.

FIG. 2A is a perspective view of a storage cluster 161 having a plurality of storage nodes 150 and internal solid state memory coupled to each storage node to provide a network attached storage or storage area network, according to some embodiments. The network attached storage, storage area network, or storage cluster, or other storage memory, may include one or more storage clusters 161, each storage cluster 161 having one or more storage nodes 150, with a flexible and reconfigurable arrangement of both physical components and the amount of storage memory provided thereby. Storage cluster 161 is designed to fit into racks and may be configured and populated with one or more racks as needed for storage. Storage cluster 161 has a chassis 138 with a plurality of slots 142. It should be appreciated that the chassis 138 may be referred to as a housing, shell, or rack unit. In one embodiment, the chassis 138 has 14 slots 142, although other numbers of slots are readily contemplated. For example, some embodiments have four slots, eight slots, sixteen slots, thirty-two slots, or other suitable number of slots. In some embodiments, each slot 142 may house one storage node 150. The chassis 138 includes tabs 148 that may be used to mount the chassis 138 to a rack. Fan 144 provides air circulation for cooling storage node 150 and its components, although other cooling components may be used, or embodiments without cooling components are contemplated. The switch fabric 146 couples storage nodes 150 within the chassis 138 together and to a network for communication with memory. In the embodiments described herein, the slots 142 to the left of the switch fabric 146 and fans 144 are shown occupied by storage nodes 150, while the slots 142 to the right of the switch fabric 146 and fans 144 are empty and available for insertion into storage nodes 150 for illustration. This configuration is one example, and one or more storage nodes 150 may occupy slots 142 in various further arrangements. In some embodiments, the storage node arrangements need not be sequential or contiguous. Storage node 150 is hot-swapped, meaning that storage node 150 may be inserted into slot 142 in chassis 138 or removed from slot 142 without stopping or shutting down the system. Upon insertion of storage node 150 or removal of storage node 150 from slot 142, the system automatically reconfigures to recognize and accommodate the change. In some embodiments, reconfiguring includes recovering redundancy and/or rebalancing data or loads.

Each storage node 150 may have multiple components. In the embodiment shown here, the storage node 150 includes a printed circuit board 159 populated by a CPU 156 (i.e., a processor), a memory 154 coupled to the CPU 156, and a non-volatile solid state storage 152 coupled to the CPU 156, although other mounts and/or components may be used in further embodiments. The memory 154 has instructions executed by the CPU 156 and/or data operated on by the CPU 156. As explained further below, the non-volatile solid-state storage 152 comprises flash memory, or in further embodiments, other types of solid-state memory.

Referring to FIG. 2A, storage cluster 161 is scalable, meaning that storage capacity with a non-uniform storage size is readily added, as described above. In some embodiments, one or more storage nodes 150 may be inserted into or removed from each chassis and the storage cluster self-configures. The plug-in storage nodes 150, whether installed in the chassis at the time of delivery or added later, may be of different sizes. For example, in one embodiment, storage node 150 may have any multiple of 4TB, such as 8TB, 12TB, 16TB, 32TB, or the like. In further embodiments, storage node 150 may have other storage or any multiple of capacity. The storage capacity of each storage node 150 is broadcast and affects the decision of how to stripe the data. For maximum storage efficiency, embodiments may self-configure as wide as possible in a stripe, subject to predetermined requirements of continued operation, losing up to one or two non-volatile solid state storage units 152 or storage nodes 150 within the chassis.

Fig. 2B is a block diagram showing a communication interconnect 173 and a power distribution bus 172 coupling a plurality of storage nodes 150. Referring back to fig. 2A, in some embodiments, the communication interconnect 173 may be included in the switch fabric 146 or implemented with the switch fabric 146. In some embodiments, where multiple storage clusters 161 occupy one rack, communication interconnect 173 may be included on top of or implemented with a rack switch. As illustrated in FIG. 2B, storage clusters 161 are enclosed within a single chassis 138. External port 176 is coupled to storage node 150 through communication interconnect 173, while external port 174 is coupled directly to the storage node. An external power port 178 is coupled to the power distribution bus 172. Storage nodes 150 may include different amounts and different capacities of non-volatile solid-state storage 152 as described with reference to fig. 2A. Additionally, one or more storage nodes 150 may be compute-only storage nodes, as illustrated in FIG. 2B. The permissions 168 are implemented on the non-volatile solid-state storage 152, for example as a list or other data structure stored in memory. In some embodiments, the permissions are stored within the non-volatile solid state storage 152 and are supported by software executing on a controller or other processor of the non-volatile solid state storage 152. In another embodiment, the permissions 168 are implemented on the storage node 150, for example as a list or other data structure stored in the memory 154 and supported by software executing on the CPU 156 of the storage node 150. In some embodiments, the permissions 168 control how data is stored in and in the nonvolatile solid state storage 152. This control helps determine which type of erasure coding scheme is applied to the data and which storage nodes 150 have which portions of the data. Each authority 168 may be assigned to a non-volatile solid state storage 152. In various embodiments, each authority may control a series of inode numbers, segment numbers, or other data identifiers assigned to data by the file system, storage node 150, or non-volatile solid state storage 152.

In some embodiments, each data segment and each metadata segment has redundancy in the system. In addition, each data segment and each metadata segment has an owner, which may be referred to as a right. If the rights cannot be reached, e.g. due to a failure of a storage node, there is an inheritance scheme of how to find the data or the metadata. In various embodiments, there is a redundant copy of the authority 168. In some embodiments, the permissions 168 are related to the storage node 150 and the non-volatile solid state storage 152. Each authority 168 covering a series of data segment numbers or other identifiers of data may be assigned to a particular non-volatile solid state storage 152. In some embodiments, all of these ranges of permissions 168 are distributed over the non-volatile solid-state storage 152 of the storage cluster. Each storage node 150 has a network port that provides access to the non-volatile solid state storage device(s) 152 of that storage node 150. In some embodiments, data may be stored in segments associated with segment numbers, and the segment numbers are an indirect way of configuring RAID (redundant array of independent disks) stripes. Thus, the assignment and use of rights 168 establishes an indirect way to data. According to some embodiments, the indirection approach may be referred to as the ability to indirectly reference data, in this case via permissions 168. The segment identifies a set of non-volatile solid state storage devices 152 and a local identifier into the set of non-volatile solid state storage devices 152 that may contain data. In some embodiments, the local identifier is an offset into the device and may be reused by multiple segment sequences. In other embodiments, the local identifier is unique to a particular segment and is never reused. The offset in the non-volatile solid state storage 152 is applied to locate data (in the form of RAID stripes) for writing to the non-volatile solid state storage 152 or reading from the non-volatile solid state storage 152. Data is striped across multiple units of non-volatile solid state storage 152, the non-volatile solid state storage 152 may contain or be different from the non-volatile solid state storage 152 having the authority 168 for the particular data segment.

If the location of a particular data segment location changes, such as during a data movement or data reconstruction, the authority 168 for the data segment should be queried at the non-volatile solid state storage 152 or storage node 150 having the authority 168. To locate a particular data segment, embodiments calculate a hash value or apply an inode number or data segment number for the data segment. The output of this operation is directed to the non-volatile solid state storage 152 having the authority 168 for that particular data segment. In some embodiments, this operation has two phases. The first stage maps entity Identifiers (IDs) (e.g., segment numbers, inode numbers, or directory numbers) to rights identifiers. This mapping may include, for example, the computation of a hash or a bitmask. The second phase is to map the rights identifier to a specific non-volatile solid state storage 152, which may be done by explicit mapping. The operations are repeatable, so that when a calculation is performed, the result of the calculation can be repeatedly and reliably directed to the particular non-volatile solid state storage 152 having the rights 168. The operations may include a group of reachable storage nodes as input. If the group of reachable nonvolatile solid state storage units changes, then the optimal group also changes. In some embodiments, the persistence value is the current assignment (which is always true) and the calculated value is the target assignment that the cluster will attempt to reconfigure. This calculation may be used to determine the best non-volatile solid-state storage 152 for the rights in the presence of a set of non-volatile solid-state storage 152 that are reachable and constitute the same cluster. The computation also determines a set of ordered peer-to-peer non-volatile solid state storage 152 that also maps recording rights to non-volatile solid state storage so that rights can be determined even if the assigned non-volatile solid state storage is not reachable. If a particular authority 168 is not available in some embodiments, the copy or replacement authority 168 may be queried.

Referring to fig. 2A and 2B, two of the many tasks of the CPU 156 on the storage node 150 are decomposing write data and reorganizing read data. When the system has determined that data is to be written, the authority 168 for the data is located as above. When the segment ID of the data has been determined, the write request is forwarded to the nonvolatile solid state storage 152 of the host currently determined to be the authority 168 according to the segment determination. The host CPU 156 of the storage node 150 on which the nonvolatile solid state storage 152 and corresponding permissions 168 reside then breaks down or fragments the data and transfers the data to the various nonvolatile solid state storage 152. The transmitted data is written into the data stripes according to an erasure coding scheme. In some embodiments, extraction of data is requested, while in other embodiments, data is pushed. Conversely, when reading data, the authority 168 containing the segment ID of the data is located as described above. The host CPU 156 of the storage node 150 on which the nonvolatile solid state storage 152 and corresponding authority 168 reside requests data from the nonvolatile solid state memory and corresponding storage node to which the authority points. In some embodiments, the data is read from the flash memory device as a stripe of data. The host CPU 156 of the storage node 150 then reassembles the read data, corrects any errors (if any) according to the appropriate erasure coding scheme, and forwards the reassembled data to the network. In further embodiments, some or all of these tasks may be handled in non-volatile solid-state storage 152. In some embodiments, the segment host requests to send data to storage node 150 by requesting a page from the storage device and then sending the data to the storage node that issued the original request.

In an embodiment, the authority 168 operates to determine how the operation will proceed for a particular logic element. Each logic element is operable with specific rights across multiple storage controllers of the storage system. The authority 168 may be in communication with a plurality of storage controllers such that the plurality of storage controllers collectively perform operations on the particular logic element.

In an embodiment, the logic element may be, for example, a file, a directory, an object bucket, individual objects, a descriptive portion of a file or object, other forms of key-value versus database or table. In embodiments, performing the operation may include, for example, ensuring consistency, structural integrity, and/or recoverability of other operations with respect to the same logical element, reading metadata and data associated with the logical element, determining which data should be durably written to the storage system to maintain any changes in the operation, or where the metadata and data may be stored across modular storage devices attached to multiple storage controllers in the storage system.

In some embodiments, the operations are token-based transactions to communicate efficiently within the distributed system. Each transaction may be accompanied by or associated with a token that gives permission to execute the transaction. In some embodiments, the permissions 168 can maintain the pre-transaction state of the system until the operation is complete. Token-based communication may be accomplished without a global lock across the system and may also be capable of restarting operations in the event of an interrupt or other failure.

In some systems, such as UNIX-style file systems, data is processed with index nodes (inodes) that specify data structures representing objects in the file system. For example, the object may be a file or a directory. Metadata may accompany the object as an attribute such as license data and a creation timestamp. The segment number may be assigned to all or a portion of this object in the file system. In other systems, data segments are handled with segment numbers assigned elsewhere. For purposes of discussion, a distribution unit is an entity, which may be a file, directory, or segment. That is, an entity is a unit of data or metadata stored by a storage system. Entities are grouped into groups called rights. Each right has a right owner, which is a storage node that has the exclusive rights of the entity in the updated right. In other words, the storage node contains rights, which in turn contain entities.

A segment is a logical container of data according to some embodiments. A segment is an address space between the media address space and the physical flash memory location, i.e. the data segment number is in this address space. The segments may also contain metadata, which enables data redundancy to be recovered (rewritten to different flash memory locations or devices) without involving higher level software. In one embodiment, the internal format of the segment contains client data and media map to determine the location of the data. By dividing the data segments into multiple data and parity fragments, each data segment is timely protected from, for example, memory and other failures. According to the erasure coding scheme, data and parity shards are distributed across the non-volatile solid state storage 152 coupled to the host CPU 156, i.e., striped (see fig. 2E and 2G). In some embodiments, the use of the term segment refers to a container and its location in the address space of the segment. The use of the term stripe refers to the same set of tiles as a segment and encompasses how the tiles are distributed along with redundancy or parity information according to some embodiments.

A series of address space translations occurs across the entire storage system. At the top is a directory entry (file name) linked to the inode. The inode points to a media address space where the data is logically stored. The media addresses may be mapped through a series of indirect media to spread the load of large files or to implement data services such as removing duplicate data or snapshots. The media addresses may be mapped through a series of indirect media to spread the load of large files or to implement data services such as removing duplicate data or snapshots. The segment address is then translated to a physical flash memory location. According to some embodiments, the physical flash memory locations have an address range defined by the amount of flash memory in the system. The media addresses and segment addresses are logical containers and in some embodiments use 128 bit or larger identifiers so as to be virtually unlimited, the likelihood of reuse being calculated to be longer than the expected life of the system. In some embodiments, addresses from logical containers are allocated in a hierarchical manner. Initially, each non-volatile solid-state storage unit 152 may be assigned a series of address spaces. Within this assignment, the non-volatile solid-state storage 152 is able to allocate addresses without synchronizing with other non-volatile solid-state storage 152.

The data and metadata are stored by a set of underlying storage layouts that are optimized for different workload patterns and storage devices. These layouts incorporate a variety of redundancy schemes, compression formats, and indexing algorithms. Some of these layouts store information about rights and rights masters, while others store file metadata and file data. Redundancy schemes include error correction codes that tolerate corrupted bits within a single storage device (e.g., a NAND flash memory chip), erasure codes that tolerate multiple storage node failures, and replication schemes that tolerate data center or region failures. In some embodiments, low density parity check ("LDPC") codes are used within a single memory unit. In some embodiments, reed-solomon encoding is used within the storage clusters and mirroring is used within the storage grid. Metadata may be stored using an ordered log structured index (e.g., a log structured merge tree), and big data may not be stored in a log structured layout.

To maintain consistency across multiple copies of an entity, the storage node implicitly agrees by computing on two things: (1) Containing the rights of the entity, and (2) a storage node containing the rights. Assigning entities to permissions may be accomplished by pseudo-randomly assigning entities to permissions, partitioning entities into ranges based on externally generated keys, or placing a single entity into each permission. Examples of pseudo-random schemes are linear hashes and copy under scalable hashes ("run") hash families, including controlled copies under scalable hashes ("CRUSH"). In some embodiments, the pseudo-random assignment is used only to assign rights to nodes, as the node group may change. The rights group cannot be changed and thus any subjective function can be applied in these embodiments. Some placement schemes automatically place permissions on storage nodes, while other placement schemes rely on explicit mapping of permissions to storage nodes. In some embodiments, a pseudo-random scheme is used to map from each authority to a set of candidate authority owners. Pseudo-random data distribution functions associated with the CRUSH may assign permissions to storage nodes and create a list of where the permissions are assigned. Each storage node has a copy of the pseudo-random data distribution function and can derive the same computation for distributing and subsequently looking up or locating rights. In some embodiments, each pseudo-random scheme requires as input an reachable group of storage nodes in order to arrive at the same target node. Once an entity is placed in the rights, the entity may be stored on the physical device such that the intended failure does not result in unexpected data loss. In some embodiments, the rebalancing algorithm attempts to store copies of all entities within a right on the same layout and the same set of machines.

Examples of expected failures include device failures, stolen machines, data center fires, and regional disasters, such as nuclear or geological events. Different failures result in different levels of acceptable data loss. In some embodiments, the stolen storage node does not affect neither the security nor the reliability of the system, but depending on the system configuration, regional events may result in no loss of data, update loss of seconds or minutes, or even complete data loss.

In an embodiment, the placement of data for storing redundancy is independent of the placement of rights for data consistency. In some embodiments, the storage nodes that contain the rights do not contain any persistent storage. Instead, the storage node is connected to a non-volatile solid state storage unit that does not contain rights. The communication interconnections between storage nodes and non-volatile solid state storage units are made up of a variety of communication technologies and have non-uniform performance and fault tolerance characteristics. In some embodiments, as described above, the nonvolatile solid state storage units are connected to storage nodes via PCI express, the storage nodes are connected together within a single chassis using an Ethernet backplane, and the chassis are connected together to form a storage cluster. In some embodiments, the storage clusters are connected to the clients using ethernet or fibre channel. If multiple storage clusters are configured into a storage grid, the multiple storage clusters are connected using the Internet or other long-range network links (e.g., a "metropolitan-scale" link or a dedicated link that does not traverse the Internet).

The rights-holder has proprietary rights to modify the entity, migrate the entity from one non-volatile solid-state storage unit to another, and add and remove copies of the entity. This allows redundancy of the underlying data to be maintained. When the rights-holder fails, is about to retire or overloaded, the rights will be transferred to the new storage node. Transient faults make it important to ensure that all non-faulty machines agree to a new authority location. Ambiguity due to transient faults can be achieved automatically through consensus protocols (e.g., paxos, hot-warm failover schemes), through manual intervention (e.g., by physically removing the failed machine from the cluster, or pressing a button on the failed machine), by a remote system administrator, or by a local hardware administrator. In some embodiments, a consensus protocol is used and failover is automatic. According to some embodiments, if too many failures or copy events occur within a too short period of time, the system enters self-save mode and stops copying and data movement activities until an administrator intervenes.

When rights are transferred between storage nodes and the rights-holder updates the entity in its rights, the system transfers messages between the storage nodes and the non-volatile solid-state storage unit. With respect to persistent messages, messages with different purposes are of different types. Depending on the type of message, the system maintains different order and durability guarantees. As persistent messages are processed, the messages are temporarily stored in a plurality of durable and non-durable storage hardware technologies. In some embodiments, messages are stored in RAM, NVRAM, and NAND flash memory devices, and various protocols are used in order to efficiently use each storage medium. Delay sensitive client requests may be saved in replicated NVRAM and then in NAND while background rebalancing operations are saved directly to NAND.

The persistent message is persistently stored prior to transmission. This allows the system to continue to satisfy client requests in the event of a failure and replacement of a component. While many hardware components contain unique identifiers that are visible to system administrators, manufacturers, hardware supply chains, and the continuously monitoring quality control infrastructure, application addresses running on the infrastructure virtualize addresses. These virtualized addresses do not change over the life of the storage system, regardless of whether the component fails or is replaced. This allows each component of the storage system to be replaced over time without reconfiguring or interrupting client request processing, i.e., the system supports uninterrupted upgrades.

In some embodiments, virtualized addresses are stored with sufficient redundancy. The continuous monitoring system correlates hardware and software status with a hardware identifier. This allows for detection and prediction of faults due to faulty components and manufacturing details. In some embodiments, by removing components from the critical path, the monitoring system is also able to actively transfer rights and entities from the affected device before the failure occurs.

FIG. 2C is a multi-level block diagram showing the contents of storage node 150 and the contents of non-volatile solid-state storage 152 of storage node 150. In some embodiments, data is transferred to storage node 150 and from storage node 150 by network interface controller ("NIC") 202. As discussed above, each storage node 150 has a CPU 156 and one or more non-volatile solid state storage devices 152. Moving one stage down in fig. 2C, each non-volatile solid-state storage 152 has relatively fast non-volatile solid-state memory, such as non-volatile random access memory ("NVRAM") 204 and flash memory 206. In some embodiments, NVRAM 204 may be a component (DRAM, MRAM, PCM) that does not require a program/erase cycle, and may be memory that may support being written to more frequently than memory is read. Moving another stage down in fig. 2C, NVRAM 204 is implemented in one embodiment as high-speed volatile memory, such as Dynamic Random Access Memory (DRAM) 216 supported by energy retainer 218. The energy retainer 218 provides sufficient power to keep the DRAM 216 powered for a sufficient time to transfer content to the flash memory 206 in the event of a power failure. In some embodiments, the energy retainer 218 is a capacitor, super capacitor, battery, or other device that supplies a suitable supply of energy sufficient to be able to transfer the contents of the DRAM 216 to a stable storage medium in the event of a power loss. The flash memory 206 is implemented as a plurality of flash memory dies 222, which may be referred to as a package of flash memory dies 222 or an array of flash memory dies 222. It should be appreciated that flash memory die 222 may be packaged in any of a variety of ways, a single die per package, multiple dies per package (i.e., multi-chip package), hybrid packages as die on a printed circuit board or other substrate, as encapsulated die, etc. In the illustrated embodiment, the non-volatile solid-state storage 152 has a controller 212 or other processor, and an input output (I/O) port 210 coupled to the controller 212. The I/O port 210 is coupled to the CPU 156 and/or the network interface controller 202 of the flash memory storage node 150. A flash memory input output (I/O) port 220 is coupled to a flash memory die 222, and a direct memory access unit (DMA) 214 is coupled to the controller 212, the DRAM 216, and the flash memory die 222. In the embodiment shown, I/O ports 210, controller 212, DMA unit 214, and flash memory I/O ports 220 are implemented on a programmable logic device ("PLD") 208, such as an FPGA. In this embodiment, each flash memory die 222 has pages organized as 16kB (kilobyte) pages 224, and registers 226 through which data can be written to the flash memory die 222 or read from the flash memory die 222. In further embodiments, other types of solid state memory are used instead of or in addition to the flash memory illustrated within flash memory die 222.

In general, storage cluster 161 may be contrasted with a storage array in various embodiments disclosed herein. Storage node 150 is the portion that creates a collection of storage clusters 161. Each storage node 150 has the data and computation slices needed to provide the data. The plurality of storage nodes 150 cooperatively store and retrieve data. Memory storage or storage devices typically used in storage arrays are less involved in processing and manipulating data. A memory storage or storage device in a memory array receives a command to read, write, or erase data. The storage memory or storage devices in the storage array are not aware of the larger system in which they are embedded nor what the data means. The storage memory or storage devices in the storage array may include various types of storage memory, such as RAM, solid state drives, hard drives, and the like. The storage unit 152 described herein has multiple interfaces that are active simultaneously and serve multiple purposes. In some embodiments, some of the functionality of storage node 150 is transferred into storage unit 152, converting storage unit 152 into a combination of storage unit 152 and storage node 150. Placing the calculation (relative to the stored data) into the storage unit 152 makes this calculation closer to the data itself. Various system embodiments have a hierarchy of storage node layers with different capabilities. In contrast, in a storage array, a controller owns and knows everything about all the data that the controller manages in a shelf or storage device. In storage cluster 161, as described herein, multiple storage units 152 and/or multiple controllers in storage nodes 150 cooperate in various ways (e.g., for erasure coding, data slicing, metadata communication and redundancy, storage capacity expansion or contraction, data recovery, etc.).

Fig. 2D shows a storage server environment that uses embodiments of storage nodes 150 and storage units 152 of fig. 2A-C. In this version, each storage unit 152 has a processor (e.g., controller 212 (see fig. 2C)), FPGA, flash memory 206, and NVRAM 204 (which are supercapacitor-backed DRAM 216, see fig. 2B and 2C) on a PCIe (peripheral component interconnect express) board in chassis 138 (see fig. 2A). Storage unit 152 may be implemented as a single board containing storage devices and may be the largest fault tolerant domain within the chassis. In some embodiments, up to two storage units 152 may fail and the device will continue to operate without losing data.

In some embodiments, the physical storage is partitioned into named regions based on application usage. NVRAM 204 is a contiguous block of reserved memory in memory unit 152DRAM 216 and is supported by NAND flash memory. The NVRAM 204 is logically divided into multiple memory regions, both of which are written as spools (e.g., spool_regions). The space within the NVRAM 204 spool is managed independently by each authority 168. Each device provides a certain amount of storage space to each authority 168. The authority 168 further manages lifetime and allocation within the space. Examples of spooling include distributed transactions or concepts. The on-board supercapacitor provides a short duration power hold when the main power supply of the storage unit 152 fails. During this hold interval, the contents of NVRAM 204 are refreshed to flash memory 206. At the next power-on, the contents of the NVRAM 204 are restored from the flash memory 206.

For a storage unit controller, responsibility for a logical "controller" is distributed across each blade that contains rights 168. This distribution of logic control is shown in fig. 2D as host controller 242, intermediate level controller 244, and storage unit controller 246. The management of the control plane and the storage plane is handled independently, although portions may be physically co-located on the same blade. Each authority 168 effectively acts as a separate controller. Each authority 168 provides its own data and metadata structure, its own background staff, and maintains its own lifecycle.

FIG. 2E is a block diagram of blade 252 hardware showing a control plane 254, compute and store planes 256, 258, and permissions 168 for interacting with underlying physical resources using embodiments of storage nodes 150 and storage units 152 of FIGS. 2A-C in the storage server environment of FIG. 2D. The control plane 254 is partitioned into a plurality of permissions 168 that may run on any blade 252 using computing resources in the computing plane 256. The memory plane 258 is partitioned into a set of devices, each providing access to the flash memory 206 and NVRAM 204 resources. In one embodiment, the computing plane 256 may perform operations of a storage array controller on one or more devices (e.g., storage arrays) of the storage plane 258, as described herein.

In the compute and store planes 256, 258 of fig. 2E, the permissions 168 interact with the underlying physical resources (i.e., devices). From the perspective of the authority 168, its resources are striped across all physical devices. From the perspective of the device, it provides resources to all authorities 168 regardless of where the organization happens to run. Each authority 168 has been allocated or allocated one or more partitions 260 of storage memory in storage 152, such as flash memory 206 and partitions 260 in NVRAM 204. Each authority 168 uses the assigned partition 260 belonging thereto to write or read user data. Rights may be associated with different amounts of physical storage of the system. For example, one authority 168 may have a greater number of partitions 260 or a larger size of partitions 260 in one or more storage units 152 than one or more other authorities 168.

FIG. 2F depicts the resilient software layers in the blades 252 of the storage cluster, according to some embodiments. In the spring structure, the spring software is symmetrical, i.e., the computing module 270 of each blade runs the process of three identical layers depicted in FIG. 2F. The storage manager 274 performs read and write requests for data and metadata stored in the local storage unit 152, the NVRAM 204, and the flash memory 206 from the other blades 252. The authority 168 satisfies the client request by issuing the necessary reads and writes to the blade 252, with the corresponding data or metadata residing on the storage unit 152 of the blade 252. Endpoint 272 parses the client connection request received from the switch fabric 146 supervisory software, relays the client connection request to the authority 168 responsible for enforcement, and relays the response of the authority 168 to the client. The symmetrical three-layer structure achieves a high degree of concurrency of the storage system. In these embodiments, the elasticity is effectively and reliably laterally expanded. In addition, the resilience implements a unique lateral expansion technique that balances work equally across all resources, regardless of client access patterns, and maximizes concurrency by eliminating inter-blade coordination requirements that typically occur with conventional distributed locking.

Still referring to fig. 2F, the permissions 168 running in the computing module 270 of blade 252 perform the internal operations required to satisfy the client request. One feature of the resiliency is that the authority 168 is stateless, i.e., it caches active data and metadata in its own blade 252DRAM for quick access, but the authority stores each update in its NVRAM 204 partition on three separate blades 252 until the update is written to the flash memory 206. In some embodiments, all storage system writes to NVRAM 204 are written in triplicate to the partitions on three separate blades 252. With triple mirrored NVRAM 204 and persistent storage protected by parity and Reed-Solomon RAID checksums, the storage system can withstand concurrent failure of both blades 252 without losing data, metadata, or access to either.

Because the authority 168 is stateless, it may migrate between blades 252. Each authority 168 has a unique identifier. NVRAM 204 and flash memory 206 partitions are associated with identifiers of permissions 168, rather than blades 252 in which they operate. Thus, as the authority 168 migrates, the authority 168 continues to manage the same memory partition from its new location. When a new blade 252 is installed in an embodiment of a storage cluster, the system automatically rebalances the load by: the storage of the new blade 252 is partitioned for use by the system's permissions 168, the selected permissions 168 are migrated to the new blade 252, the endpoint 272 on the new blade 252 is started, and included in the client connection distribution algorithm of the switch fabric 146.

From its new location, the migrated authority 168 saves its contents of the NVRAM 204 partition on the flash memory 206, processes read and write requests from other authorities 168, and satisfies client requests directed to it by endpoint 272. Similarly, if a blade 252 fails or is removed, the system redistributes its rights 168 among the remaining blades 252 of the system. The redistributed rights 168 continue to perform their original functions from their new locations.

FIG. 2G depicts permissions 168 and storage resources in a blade 252 of a storage cluster according to some embodiments. Each authority 168 is exclusively responsible for the partitioning of flash memory 206 and NVRAM 204 on each blade 252. The authority 168 manages the contents and integrity of its partition independently of the other authorities 168. The authority 168 compresses the incoming data and temporarily stores it in its NVRAM 204 partition, and then integrates, RAID protects, and persists the data in its memory segments in the flash memory 206 partition. When the authority 168 writes data to the flash memory 206, the storage manager 274 performs the necessary flash memory translations to optimize write performance and maximize media lifetime. In the background, the rights 168 "discard item collections" or reclaim space occupied by data that the client has outdated by overwriting the data. It should be appreciated that since 168 partitions of permissions are disjoint, no distributed locking is required to perform client and write or perform background functions.

The embodiments described herein may utilize various software, communication, and/or network protocols. In addition, the configuration of hardware and/or software may be adjusted to accommodate various protocols. For example, embodiments may utilize an active directory, which is a database-based system, which is in WINDOWS ^TM Authentication, cataloging, policies, and other services are provided in the environment. In these embodiments, LDAP (lightweight directory Access protocol) is one example application protocol for querying and modifying items in a directory service provider, such as an active directory. In some embodiments, a network lock manager ("NLM") is used as a facility to work in conjunction with a network file system ("NFS") to provide system V style advisory files and record locking over a network. The server message block ("SMB") protocol, a version of which is also referred to as the common internet file system ("CIFS"), may be integrated with the storage system discussed herein. SMP operates as an application layer network protocol and is commonly used to provide shared access to files, printers, and serial ports, as well as various communications between nodes on a network. SMB also provides an authenticated inter-process communication mechanism. AMAZON ^TM S3 (simple storage service) is a web service provided by amazon web services, and the system described herein can interface with amazon S3 through web service interfaces (REST (representational state transfer), SOAP (simple object access protocol), and BitTorrent). The RESTful API (application programming interface) breaks down transactions to create a series of small modules. Each module addresses a particular underlying portion of the transaction. The control or permissions provided by these embodiments, particularly for object data, may include the utilization of access control lists ("ACLs"). An ACL is a list of permissions attached to an object, and an ACL specifies which users or system processes are granted access to the object, and which operations are allowed to be performed on a given object. The system may utilize Internet protocol version 6 ("IPv 6") and IPv4 for communication And a communication protocol that provides an identification and location system for computers on the network and routes traffic through the Internet. Packet routing between network systems may include equal cost multi-path routing ("ECMP"), a routing strategy in which the forwarding of next-hop packets to a single destination may occur over multiple "best paths" that are first side-by-side in the routing metric calculation. Multipath routing can be used in conjunction with most routing protocols because it is a per-hop decision limited to a single router. Software may support multi-tenancy, an architecture in which a single instance of a software application serves multiple clients. Each customer may be referred to as a tenant. In some embodiments, a tenant may be given the ability to customize portions of an application, but may not customize the application's code. Embodiments may maintain an audit log. An audit log is a document that records events in a computing system. In addition to recording which resources are accessed, audit log entries typically contain destination and source addresses, time stamps, and user login information to comply with various regulations. Embodiments may support various key management policies, such as encryption key rotation. In addition, the system may support some variants of a dynamic root password or a dynamically changing password.

Fig. 3A sets forth a diagram of a storage system 306 coupled for data communication with a cloud service provider 302 according to some embodiments of the present disclosure. Although described in less detail, the storage system 306 described in fig. 3A may be similar to the storage system described above with reference to fig. 1A-1D and 2A-2G. In some embodiments, the storage system 306 depicted in fig. 3A may be embodied as a storage system including unbalanced active/active controllers, a storage system including balanced active/active controllers, a storage system including active/active controllers in which less than all of the resources of each controller are utilized such that each controller has spare resources available to support failover, a storage system including fully active/active controllers, a storage system including dataset isolation controllers, a storage system including a dual-layer architecture with front-end controllers and back-end integrated storage controllers, a storage system including a laterally-extending cluster of dual-controller arrays, and combinations of such embodiments.

In the example depicted in fig. 3A, storage system 306 is coupled to cloud service provider 302 via data communication link 304. The data communication link 304 may be embodied as a dedicated data communication link, a data communication path provided through one or more data communication networks using, for example, a wide area network ("WAN") or LAN, or some other mechanism capable of transmitting digital information between the storage system 306 and the cloud service provider 302. This data communication link 304 may be entirely wired, entirely wireless, or some aggregation of wired and wireless data communication paths. In this example, digital information may be exchanged between storage system 306 and cloud service provider 302 via data communication link 304 using one or more data communication protocols. For example, digital information may be exchanged between storage system 306 and cloud service provider 302 via data communication link 304 using a handheld device transport protocol ("HDTP"), hypertext transport protocol ("HTTP"), internet protocol ("IP"), real-time transport protocol ("RTP"), transmission control protocol ("TCP"), user datagram protocol ("UDP"), wireless application protocol ("WAP"), or other protocol.

The cloud service provider 302 depicted in fig. 3A may be embodied, for example, as a system and computing environment that provides a large number of services to users of the cloud service provider 302 through shared computing resources via data communication links 304. Cloud service provider 302 may provide on-demand access to a shared pool of configurable computing resources, such as computer networks, servers, storage devices, applications, and services. The shared pool of configurable resources may be quickly provisioned and released to users of cloud service provider 302 with minimal management effort. Typically, the user of cloud service provider 302 is unaware of the exact computing resources that cloud service provider 302 uses to provide the service. Although in many cases this cloud service provider 302 may be accessible via the internet, readers of skill in the art will recognize that any system that abstracts the use of shared resources to provide services to users over any data communications link may be considered a cloud service provider 302.

In the example depicted in fig. 3A, cloud service provider 302 may be configured to provide various services to storage system 306 and users of storage system 306 through implementation of various service models. For example, cloud service provider 302 may be configured to provide services by implementing an infrastructure as a service ("IaaS") service model, by implementing a platform as a service ("PaaS") service model, by implementing a software as a service ("SaaS") service model, by implementing an authentication as a service ("AaaS") service model, by implementing a storage as a service model in which cloud service provider 302 provides access to its storage infrastructure for use by storage system 306 and users of storage system 306, and so forth. Readers will appreciate that cloud service provider 302 may be configured to provide additional services to storage system 306 and users of storage system 306 by implementing additional service models, as the above-described service models are included for illustrative purposes only and are in no way representative of restrictions on services that cloud service provider 302 may provide or restrictions on service models that cloud service provider 302 may implement.

In the example depicted in fig. 3A, cloud service provider 302 may be embodied, for example, as a private cloud, a public cloud, or a combination of private and public clouds. In embodiments where cloud service provider 302 is embodied as a private cloud, cloud service provider 302 may be dedicated to providing services to a single organization, rather than to multiple organizations. In embodiments where cloud service provider 302 is embodied as a public cloud, cloud service provider 302 may service an organization to multiple offerings. In still alternative embodiments, cloud service provider 302 may be embodied as a hybrid of private and public cloud services with hybrid cloud deployment.

Although not explicitly depicted in fig. 3A, readers will appreciate that a large number of additional hardware components and additional software components may be required in order to facilitate delivery of cloud services to storage system 306 and users of storage system 306. For example, the storage system 306 may be coupled to (or even include) a cloud storage gateway. Such a cloud storage gateway may be embodied as, for example, a hardware-based or software-based appliance that is located inside the storage system 306. This cloud storage gateway may operate as a bridge between local applications executing on storage array 306 and remote cloud-based storage utilized by storage array 306. By using a cloud storage gateway, an organization may move the primary iSCSI or NAS to the cloud service provider 302, enabling the organization to save space on its internal storage system. Such a cloud storage gateway may be configured to emulate a disk array, block-based device, file server, or other storage system, which may translate SCSI commands, file server commands, or other suitable commands into REST space protocols that facilitate communication with cloud service provider 302.

In order to enable storage system 306 and users of storage system 306 to utilize services provided by cloud service provider 302, a cloud migration process may occur during which data, applications, or other elements from an organization's local system (or even from another cloud environment) are moved to cloud service provider 302. To successfully migrate data, applications, or other elements to the environment of cloud service provider 302, middleware, such as a cloud migration tool, may be utilized to bridge the gap between the environment of cloud service provider 302 and the environment of the organization. Such cloud migration tools may also be configured to address potentially high network costs and long transfer times associated with migrating large amounts of data to cloud service provider 302, as well as to address security issues associated with migrating sensitive data to cloud service provider 302 over a data communication network. To further enable storage system 306 and users of storage system 306 to utilize services provided by cloud service provider 302, cloud orchestrators may also be used to arrange and coordinate automation tasks in pursuit of creating integrated processes or workflows. Such a cloud orchestrator may perform tasks such as configuring various components (whether they are cloud components or internal components) and managing interconnections between such components. The cloud orchestrator may simplify communication and connections between components to ensure proper configuration and maintain links.

In the example depicted in fig. 3A, and as briefly described above, cloud service provider 302 may be configured to eliminate the need to install and run applications on local computers by providing services to storage system 306 and users of storage system 306 using the SaaS service model, which may simplify the maintenance and support of applications. Such applications may take a variety of forms according to various embodiments of the present disclosure. For example, cloud service provider 302 may be configured to provide access to data analysis applications to storage system 306 and users of storage system 306. Such data analysis applications may be configured to receive, for example, a large amount of telemetry data sent home by the storage system 306. Such telemetry data may describe various operational characteristics of the storage system 306 and may be analyzed for various purposes including, for example, determining health of the storage system 306, identifying workloads executing on the storage system 306, predicting when the storage system 306 will consume various resources, recommending configuration changes, hardware or software upgrades, workflow migration, or other actions that may improve operation of the storage system 306.

Cloud service provider 302 may also be configured to provide access to virtualized computing environments to storage system 306 and users of storage system 306. Such virtualized computing environment may be embodied as, for example, a virtual machine or other virtualized computer hardware platform, virtual storage, virtualized computer network resources, and the like. Examples of such virtualized environments may include virtual machines created to simulate actual computers, virtualized desktop environments that separate logical desktops from physical machines, virtualized file systems that allow uniform access to different types of specific file systems, and so forth.

Although the example depicted in fig. 3A illustrates storage system 306 being coupled for data communication with cloud service provider 302, in other embodiments storage system 306 may be part of a hybrid cloud deployment, where private cloud elements (e.g., private cloud services, internal infrastructure, etc.) and public cloud elements (e.g., public cloud services, infrastructure, etc., that may be provided by one or more cloud service providers) are combined to form a single solution, orchestrating between the various platforms. This hybrid cloud deployment may utilize hybrid cloud management software, such as from Microsoft ^TM Azure of (A) ^TM Arc, which concentrates the management of hybrid cloud deployment to any infrastructure and enables deployment of services anywhere. In this example, the hybrid cloud management software may be configured to create, update, and delete resources (both physical and virtual) that form the hybrid cloud deployment, allocate computing and storage to specific workloads, monitor workloads, andperformance of resources, policy compliance, updates and patches, security status, or performing various other tasks.

Readers will appreciate that by pairing the storage system described herein with one or more cloud service providers, various products may be enabled. For example, disaster recovery as a service ("DRaaS") may be provided in which cloud resources are utilized to protect applications and data from disruption caused by a disaster, including in embodiments where a storage system may be used as a primary data storage device. In such embodiments, an overall system backup may be performed, which allows for maintenance of business continuity in the event of a system failure. In such embodiments, cloud data backup techniques (either by themselves or as part of a larger DRaaS solution) may also be integrated into an overall solution that includes the storage system and cloud service provider described herein.

The storage systems described herein and cloud service providers may be used to provide a wide range of security features. For example, the storage system may encrypt static data (and data may be sent to or from the encrypted storage system) and may utilize a key management as a service ("KMaaS") to manage encryption keys, keys for locking and unlocking storage devices, and so forth. Also, a cloud data security gateway or similar mechanism may be utilized to ensure that data stored in the storage system is not incorrectly stored in the cloud as part of a cloud data backup operation. Furthermore, micro-segmentation or identity-based segmentation may be utilized within a data center or cloud service provider containing the storage system to create secure zones in the data center and cloud deployment to enable isolation between workloads.

For further explanation, fig. 3B sets forth a diagram of a storage system 306 according to some embodiments of the present disclosure. Although described in less detail, the storage system 306 depicted in fig. 3B may be similar to the storage system described above with reference to fig. 1A-1D and 2A-2G, as the storage system may include many of the above-described components.

The storage system 306 depicted in fig. 3B may include a large number of storage resources 308, which may be embodied in a variety of forms. For example, the storage resources 308 may include nano-RAM or another form of non-volatile random access memory utilizing carbon nanotubes deposited on a substrate, 3D cross-point non-volatile memory, flash memory including single-level cell ("SLC") NAND flash memory, multi-level cell ("MLC") NAND flash memory, three-level cell (TLC) NAND flash memory, four-level cell ("QLC") NAND flash memory, and the like. Likewise, the storage resources 308 may include non-volatile magnetoresistive random access memory ("MRAM"), including spin transfer torque ("STT") MRAM. Example storage resources 308 may alternatively include non-volatile phase change memory ("PCM"), quantum memory that allows for storing and retrieving photonic quantum information, resistive random access memory ("ReRAM"), storage class memory ("SCM"), or other forms of storage resources, including any combination of the resources described herein. Readers will appreciate that the above described storage systems may utilize other forms of computer memory and storage devices, including DRAM, SRAM, EEPROM, general purpose memory, and many others. The storage resources 308 depicted in fig. 3A may be embodied in a variety of form factors including, but not limited to, dual in-line memory modules ("DIMMs"), non-volatile dual in-line memory modules ("NVDIMMs"), m.2, U.2, and others.

The storage resources 308 depicted in fig. 3B may include various forms of SCM. The SCM may effectively treat fast, non-volatile memory (e.g., NAND flash memory) as an extension of DRAM such that the entire data set may be considered an in-memory data set that resides entirely in DRAM. The SCM may include non-volatile media, such as NAND flash memory. Such NAND flash memory may be accessed using NVMe, which may use the PCIe bus as its transport, providing relatively low access latency compared to the old protocol. In fact, the network protocols for SSDs in full flash memory arrays may include NVMe (ROCE, NVME TCP), fibre channel (NVMe FC), infiniband (iWARP), and others where fast, non-volatile memory may be considered DRAM extensions. In view of the fact that DRAMs are typically byte-addressable and fast, nonvolatile memory, such as NAND flash memory, is block-addressable, a controller software/hardware stack may be required to convert block data into bytes that are stored in a medium. Examples of media and software that may be used as SCM may include, for example, 3D XPoint, intel memory drive technology, three-star Z-SSD, and others.

The storage resources 308 depicted in fig. 3B may also include racetrack memory (also referred to as domain wall memory). Such racetrack memory may be embodied in the form of a non-volatile solid-state memory that relies on the inherent strength and orientation of the magnetic field created by electrons spinning in addition to their charge in a solid-state device. By moving the magnetic domains along the nano-permalloy wire using a spin-coherent current, the magnetic domains can pass a magnetic read/write head positioned near the wire as the current passes through the wire, which alters the magnetic domains to record the pattern of bits. To manufacture a racetrack memory device, many such wires and read/write elements may be packaged together.

The example storage system 306 depicted in fig. 3B may implement various storage architectures. For example, a storage system according to some embodiments of the present disclosure may utilize block storage, where data is stored in blocks, and each block essentially acts as an individual hard disk drive. A storage system according to some embodiments of the present disclosure may utilize object storage, where data is managed as objects. Each object may include the data itself, variable amounts of metadata, and a globally unique identifier, where object storage may be implemented at multiple levels (e.g., device level, system level, interface level). Storage systems according to some embodiments of the present disclosure utilize file storage, wherein data is stored in a hierarchical structure. This data may be saved in files and folders and presented in the same format to the system storing it and the system retrieving it.

The example storage system 306 depicted in fig. 3B may be embodied as a storage system in which additional storage resources may be added through the use of a longitudinal expansion model, through the use of a lateral expansion model, or through some combination thereof. In the longitudinally extending model, additional storage may be added by adding additional storage. However, in the lateral expansion model, additional storage nodes may be added to the cluster of storage nodes, where such storage nodes may include additional processing resources, additional network resources, and so forth.

The example storage system 306 depicted in FIG. 3B may utilize the storage resources described above in a variety of different ways. For example, some portion of the storage resources may be used as a write cache, where data is initially written to the storage resources with relatively fast write latency, relatively high write bandwidth, or similar characteristics. In this example, data written to a storage resource used as a write cache may later be written to other storage resources characterized by slower write latency, lower write bandwidth, or similar characteristics than the storage resource used as a write cache. In a similar manner, storage resources within a storage system may be used as a read cache, where the read cache is populated according to a set of predetermined rules or heuristics. In other embodiments, the hierarchy may be implemented within the storage system by placing data within the storage system according to one or more policies such that, for example, frequently accessed data is stored in faster storage tiers and infrequently accessed data is stored in slower storage tiers.

The storage system 306 depicted in fig. 3B also includes communication resources 310 that may be used to facilitate data communication between components within the storage system 306, as well as between the storage system 306 and computing devices external to the storage system 306, including embodiments in which the resources are separated by a relatively wide space. The communication resources 310 may be configured to utilize a variety of different protocols and data communication structures to facilitate data communication between components within the storage system and computing devices external to the storage system. For example, the communication resources 310 may include Fibre Channel (FC) technology such as FC fabric and FC protocol over which SCSI commands may be transmitted over an FC network, FC over ethernet ("FCoE") technology over which FC frames are encapsulated and transmitted over an ethernet network, infiniband ("IB") technology that utilizes a switch fabric topology to facilitate transmission between channel adapters, NVM Express ("NVMe") technology, and NVMe over fabric ("nvmeoh") technology over which non-volatile storage media attached via a PCI Express ("PCIe") bus may be accessed, among others. In fact, the above-described storage systems may directly or indirectly utilize neutrino communication techniques and devices by which neutrino beams are used to transfer information (including binary information).

The communication resources 310 may also include mechanisms for accessing the storage resources 308 within the storage system 306 utilizing serial attached SCSI ("SAS"), a serial ATA ("SATA") bus interface for connecting the storage resources 308 within the storage system 306 to a host bus adapter within the storage system 306, internet Small computer System interface ("iSCSI") technology for providing block-level access to the storage resources 308 within the storage system 306, and other communication resources that may be used to facilitate data communication between components within the storage system 306 and between the storage system 306 and computing devices external to the storage system 306.

The storage system 306 depicted in fig. 3B also includes processing resources 312 that may be used to execute computer program instructions and perform other computing tasks within the storage system 306. The processing resources 312 may include one or more ASICs and one or more CPUs tailored for some specific purposes. The processing resources 312 may also include one or more DSPs, one or more FPGAs, one or more System-on-a-chip ("SoCs"), or other forms of processing resources 312. Storage system 306 may utilize storage resources 312 to perform various tasks, including but not limited to supporting execution of software resources 314, as will be described in more detail below.

The storage system 306 depicted in fig. 3B also includes software resources 314, which software resources 314 may perform a number of tasks when executed by the processing resources 312 within the storage system 306. The software resources 314 may include, for example, one or more modules of computer program instructions that, when executed by the processing resources 312 within the storage system 306, may be used to perform various data protection techniques to maintain the integrity of data stored within the storage system. Readers will appreciate that such data protection techniques may be performed, for example, by system software executing on computer hardware within a storage system, by a cloud service provider, or in other ways. Such data protection techniques may include, for example, a data archiving technique that causes data that is no longer actively used to be moved to a separate storage device or separate storage system for long-term retention, a data backup technique that causes data stored in a storage system to be replicated and stored in a different location to avoid loss of data in the event of a device failure or some other form of disaster of the storage system, a data replication technique that causes data stored in a storage system to be replicated to another storage system such that the data may be accessed via multiple storage systems, a data snapshot technique that captures data states within the storage system at different points in time, a data and database cloning technique that may create copies of the data and databases, and other data protection techniques.

The software resource 314 may also include software for implementing a software defined storage ("SDS"). In this example, the software resources 314 may include one or more modules of computer program instructions that, when executed, are useful in policy-based data store provisioning and management independent of the underlying hardware. Such software resources 314 may be used to implement storage virtualization to separate storage hardware from software that manages the storage hardware.

The software resources 314 may also include software for facilitating and optimizing I/O operations directed to the storage resources 308 in the storage system 306. For example, the software resources 314 may include software modules that perform various data reduction techniques, such as data compression, data duplication elimination, and others. The software resources 314 may include software modules that intelligently group I/O operations together to facilitate better use of the underlying storage resources 308, software modules that perform data migration operations to migrate from within the storage system, and software modules that perform other functions. Such software resources 314 may be embodied as one or more software containers or in many other ways.

For further explanation, fig. 3C sets forth an example of a cloud-based storage system 318 according to some embodiments of the present disclosure. In the example depicted in fig. 3C, cloud-based storage system 318 creates entirely within cloud computing environment 316, such as amazon web services ("AWS"), microsoft Azure, google cloud platform, IBM cloud, oracle cloud, and others. The cloud-based storage system 318 may be used to provide services similar to those available from the storage systems described above. For example, cloud-based storage system 318 may be used to provide block storage services to users of cloud-based storage system 318, cloud-based storage system 318 may be used to provide storage services to users of cloud-based storage system 318 by using solid state storage devices, and so on.

The cloud-based storage system 318 depicted in fig. 3C includes two cloud computing instances 320, 322, each for supporting execution of a storage controller application 324, 326. Cloud computing examples 320, 322 may be embodied, for example, as examples of cloud computing resources (e.g., virtual machines) provided by cloud computing environment 316 to support execution of software applications, such as storage controller applications 324, 326. In one embodiment, cloud computing examples 320, 322 may be embodied as amazon elastic computing cloud ("EC 2") examples. In this example, an amazon machine image ("AMI") including the storage controller applications 324, 326 may be booted to create and configure virtual machines that may execute the storage controller applications 324, 326.

In the example method depicted in fig. 3C, the storage controller applications 324, 326 may be embodied as modules of computer program instructions that, when executed, perform various storage tasks. For example, the storage controller applications 324, 326 may be embodied as modules of computer program instructions that, when executed, perform the same tasks as the controllers 110A, 110B in fig. 1A described above, such as writing data received from a user of the cloud-based storage system 318 to the cloud-based storage system 318, erasing data from the cloud-based storage system 318, retrieving data from the cloud-based storage system 318 and providing this data to a user of the cloud-based storage system 318, monitoring and reporting disk utilization and performance, performing redundancy operations (e.g., RAID or RAID-like data redundancy operations), compressing data, encrypting data, removing duplicate data, and so forth. Readers will appreciate that because there are two cloud computing instances 320, 322, each containing a storage controller application 324, 326, in some embodiments one cloud computing instance 320 may operate as a primary controller as described above, while another cloud computing instance 322 may operate as a secondary controller as described above. The reader will appreciate that the storage controller applications 324, 326 depicted in fig. 3C may include the same source code executing in the different cloud computing instances 320, 322.

Consider an instance in which cloud computing environment 316 is embodied as an AWS and the cloud computing instance is embodied as an EC2 instance. In this example, cloud computing instance 320 operating as a primary controller may be deployed on one of the instance types having a relatively large amount of memory and processing power, while cloud computing instance 322 operating as a secondary controller may be deployed on one of the instance types having a relatively small amount of memory and processing power. In this example, upon a failover event in which the primary and secondary roles are switched, a double failover may actually be performed such that: 1) A first failover event in which cloud computing instance 322, previously operating as a secondary controller, began operating as a primary controller, and 2) a third cloud computing instance (not shown) of the instance type having a relatively large amount of memory and processing power rotated with a copy of the storage controller application, in which the third cloud computing instance began operating as a primary controller, while cloud computing instance 322, initially operating as a secondary controller, began operating again as a secondary controller. In this example, cloud computing instance 320, previously operating as a master controller, may be terminated. Readers will appreciate that in alternative embodiments, cloud computing instance 320 operating as a secondary controller may continue to operate as a secondary controller after a failover event, and cloud computing instance 322 operating as a primary controller after the failover event occurs may be terminated once a third cloud computing instance (not shown) assumes the primary role.

Readers will appreciate that while the above embodiments relate to embodiments in which one of the cloud computing instances 320 operates as a primary controller and the second cloud computing instance 322 operates as a secondary controller, other embodiments are within the scope of the present disclosure. For example, each cloud computing instance 320, 322 may operate as a master controller for some portion of the address space supported by cloud-based storage system 318, each cloud computing instance 320, 322 may operate as a master controller, with services directed to I/O operations of cloud-based storage system 318 partitioned in some other manner, and so on. In fact, in other embodiments where cost savings may be prioritized over performance requirements, there may be only a single cloud computing instance containing a storage controller application.

The cloud-based storage system 318 depicted in fig. 3C includes cloud computing examples 340a, 340b, 340n with local storage 330, 334, 338. The cloud computing examples 340a, 340b, 340n depicted in fig. 3C may be embodied as, for example, instances of cloud computing resources, which may be provided by the cloud computing environment 316 to support execution of software applications. The cloud computing examples 340a, 340b, 340n of fig. 3C may differ from the cloud computing examples 320, 322 described above in that the cloud computing examples 340a, 340b, 340n of fig. 3C have local storage 330, 334, 338 resources, while the cloud computing examples 320, 322 supporting execution of the storage controller applications 324, 326 do not need to have local storage resources. The cloud computing examples 340a, 340b, 340n with the local storage 330, 334, 338 may be embodied, for example, as EC 2M 5 examples including one or more SSDs, EC 2R 5 examples including one or more SSDs, EC2I3 examples including one or more SSDs, and so forth. In some embodiments, the local storage 330, 334, 338 must be embodied as solid state memory (e.g., SSD) rather than storage that utilizes a hard disk drive.

In the example depicted in fig. 3C, each of the cloud computing examples 340a, 340b, 340n with local storage 330, 334, 338 may include a software daemon 328, 332, 336 that, when executed by the cloud computing examples 340a, 340b, 340n, may present itself to the storage controller application 324, 326 as if the cloud computing examples 340a, 340b, 340n were physical storage (e.g., one or more SSDs). In this example, the software daemons 328, 332, 336 may include computer program instructions similar to those typically included on storage devices so that the storage controller applications 324, 326 can send and receive the same commands that the storage controller will send to the storage devices. In this way, the storage controller applications 324, 326 may include code that is the same (or substantially the same) as code that would be executed by a controller in the storage system described above. In these and similar embodiments, communication between the storage controller application 324, 326 and the cloud computing instance 340a, 340b, 340n with the local storage 330, 334, 338 may utilize iSCSI, NVMe over TCP, messaging, custom protocols, or some other mechanism.

In the example depicted in fig. 3C, each of the cloud computing examples 340a, 340b, 340n with local storage 330, 334, 338 may also be coupled to block storage 342, 344, 346 supplied by the cloud computing environment 316. The block storage devices 342, 344, 346 supplied by the cloud computing environment 316 may be embodied, for example, as amazon elastic block storage ("EBS") volumes. For example, a first EBS volume may be coupled to the first cloud computing instance 340a, a second EBS volume may be coupled to the second cloud computing instance 340b, and a third EBS volume may be coupled to the third cloud computing instance 340n. In this example, the block storage devices 342, 344, 346 supplied by the cloud computing environment 316 may be utilized in a manner similar to how the NVRAM devices described above are utilized, as software daemons 328, 332, 336 (or some other module) executing within a particular cloud computing instance 340a, 340b, 340n may initiate writing data to its attached EBS volumes and writing data to its local storage device 330, 334, 338 resources upon receiving a request to write data. In some alternative embodiments, data may be written to only local storage 330, 334, 338 resources within a particular cloud computing instance 340a, 340b, 340n. In an alternative embodiment, instead of using the block storage 342, 344, 346 supplied by the cloud computing environment 316 as NVRAM, the actual RAM on each of the cloud computing examples 340a, 340b, 340n with the local storage 330, 334, 338 is used as NVRAM, thereby reducing network utilization costs to be associated with using EBS volumes as NVRAM.

In the example depicted in fig. 3C, cloud computing examples 320, 322 supporting execution of storage controller applications 324, 326 may utilize cloud computing examples 340a, 340b, 340n with local storage 330, 334, 338 to service I/O operations directed to cloud-based storage system 318. Consider a first cloud computing example 320 in which a storage controller application 324 is executing as an example of host controller operation. In this example, the first cloud computing instance 320 executing the storage controller application 324 may receive a request from a user of the cloud-based storage system 318 to write data to the cloud-based storage system 318 (directly or indirectly via a secondary controller). In this example, the first cloud computing instance 320 executing the storage controller application 324 may perform various tasks, such as deduplicating data contained in the request, compressing data contained in the request, determining where to write the data contained in the request, and so forth, before eventually sending a request to one or more of the cloud computing instances 340a, 340b, 340n with the local storage 330, 334, 338 to write a deduplicated, encrypted, or otherwise potentially updated version of the data. In some embodiments, cloud computing examples 320, 322 may receive requests to read data from cloud-based storage system 318, and may ultimately send requests to read data to one or more of cloud computing examples 340a, 340b, 340n with local storage 330, 334, 338.

Readers will appreciate that when a particular cloud computing instance 340a, 340b, 340n having a local storage 330, 334, 338 receives a request to write data, the software daemon 328, 332, 336 executing on the particular cloud computing instance 340a, 340b, 340n or some other module of computer program instructions may be configured to not only write data to its own local storage resources 330, 334, 338 and any suitable block storage 342, 344, 346 provided by the cloud computing environment 316, but also the software daemon 328, 332, 336 executing on the particular cloud computing instance 340a, 340b, 340n or some other module of computer program instructions may be configured to write data to the cloud-based object storage 348 attached to the particular cloud computing instance 340a, 340b, 340 n. The cloud-based object store 348 attached to the particular cloud computing instance 340a, 340b, 340n may be embodied, for example, as amazon simple storage service ("S3") storage accessible by the particular cloud computing instance 340a, 340b, 340 n. In other embodiments, cloud computing examples 320, 322, each including a storage controller application 324, 326, may initiate storage of data in local storage 330, 334, 338 and cloud-based object storage 348 of cloud computing examples 340a, 340b, 340 n.

Readers will appreciate that cloud-based storage system 318 may be used to provide block storage services to users of cloud-based storage system 318, as described above. While the local storage 330, 334, 338 resources and the block storage 342, 344, 346 resources utilized by the cloud computing examples 340a, 340b, 340n may support block-level access, the cloud-based object storage 348 attached to a particular cloud computing example 340a, 340b, 340n only supports object-based access. To address this, the software daemon 328, 332, 336 executing on the particular cloud computing instance 340a, 340b, 340n or some other module of computer program instructions may be configured to obtain a block of data, package the block into an object, and write the object to the cloud-based object store 348 attached to the particular cloud computing instance 340a, 340b, 340 n.

Consider an example in which data is written to local storage 330, 334, 338 resources and block storage 342, 344, 346 resources utilized in 1MB blocks by cloud computing examples 340a, 340b, 340 n. In this example, assume that a user of cloud-based storage system 318 issues a request to write data, which, after being compressed and de-duplicated by storage controller applications 324, 326, results in the need to write 5MB of data. In this example, writing data to the local storage 330, 334, 338 and block storage 342, 344, 346 resources utilized by the cloud computing examples 340a, 340b, 340n is relatively straightforward because 5 blocks of 1MB size are written to the local storage 330, 334, 338 and block storage 342, 344, 346 resources utilized by the cloud computing examples 340a, 340b, 340 n. In this example, the software daemon 328, 332, 336 or some other module of computer program instructions executing on a particular cloud computing example 340a, 340b, 340n may be configured to: 1) creating a first object containing first 1MB of data and writing the first object to cloud-based object store 348,2) creating a second object containing second 1MB of data and writing the second object to cloud-based object store 348,3) creating a third object containing third 1MB of data and writing the third object to cloud-based object store 348, and so on. Thus, in some embodiments, each object written to cloud-based object store 348 may be the same (or nearly the same) in size. The reader will appreciate that in this example, metadata associated with the data itself may be included in each object (e.g., the first 1MB of the object is data and the remainder is metadata associated with the data).

Readers will appreciate that cloud-based object store 348 may be incorporated into cloud-based storage system 318 to increase the durability of cloud-based storage system 318. Continuing with the example described above, where cloud computing examples 340a, 340b, 340n are EC2 examples, the reader will understand that EC2 examples are guaranteed to have only 99.9% of the monthly uptime, and that data stored in local example storage only persists for the life of EC2 examples. Thus, relying on cloud computing instances 340a, 340b, 340n with local storage 330, 334, 338 as the sole source of persistent data storage in cloud-based storage system 318 may result in a relatively unreliable storage system. Also, the EBS volume is designed for 99.999% availability. Thus, even relying on EBSs as persistent data stores in cloud-based storage system 318 can result in the storage system being less durable. However, amazon S3 is designed to provide 99.999999999% durability, which means that cloud-based storage system 318, which may incorporate S3 into its storage pool, is generally more durable than various other options.

Readers will appreciate that while cloud-based storage system 318, which may incorporate S3 into its storage pool, is generally more durable than various other options, utilizing S3 as the primary storage pool may result in a storage system having a relatively slower response time and a relatively longer I/O latency. Thus, the cloud-based storage system 318 depicted in fig. 3C not only stores data in S3, but the cloud-based storage system 318 also stores data in the local storage 330, 334, 338 resources and block storage 342, 344, 346 resources utilized by the cloud computing examples 340a, 340b, 340n such that read operations are serviced from the local storage 330, 334, 338 resources and block storage 342, 344, 346 resources utilized by the cloud computing examples 340a, 340b, 340n, thereby reducing read latency when a user of the cloud-based storage system 318 attempts to read data from the cloud-based storage system 318.

In some embodiments, all data stored by cloud-based storage system 318 may be stored in both: 1) Cloud-based object storage 348, and 2) at least one of local storage 330, 334, 338 resources or block storage 342, 344, 346 resources utilized by cloud computing examples 340a, 340b, 340 n. In such embodiments, the local storage 330, 334, 338 resources and the block storage 342, 344, 346 resources utilized by the cloud computing examples 340a, 340b, 340n may effectively operate as a cache that typically contains all of the data also stored in S3, such that all reads of the data may be serviced by the cloud computing examples 340a, 340b, 340n without the cloud computing examples 340a, 340b, 340n accessing the cloud-based object storage 348. However, readers will appreciate that in other embodiments, all data stored by the cloud-based storage system 318 may be stored in the cloud-based object storage 348, but less than all data stored by the cloud-based storage system 318 may be stored in at least one of the local storage 330, 334, 338 resources or the block storage 342, 344, 346 resources utilized by the cloud computing examples 340a, 340b, 340 n. In this example, various policies may be utilized to determine which subset of data stored by cloud-based storage system 318 should reside in both: 1) Cloud-based object storage 348, and 2) at least one of local storage 330, 334, 338 resources or block storage 342, 344, 346 resources utilized by cloud computing examples 340a, 340b, 340 n.

As described above, when the cloud computing instances 340a, 340b, 340n with the local storages 330, 334, 338 are embodied as EC2 instances, the cloud computing instances 340a, 340b, 340n with the local storages 330, 334, 338 are only guaranteed to have 99.9% of the monthly uptime, and the data stored in the local instance storages persist only during the lifetime of each cloud computing instance 340a, 340b, 340n with the local storages 330, 334, 338. As such, one or more modules of computer program instructions executing within cloud-based storage system 318 (e.g., a monitoring module executing on its own EC2 instance) may be designed to handle failure of one or more of cloud computing instances 340a, 340b, 340n with local storage 330, 334, 338. In this example, the monitoring module may handle failure of one or more of the cloud computing instances 340a, 340b, 340n with the local storage 330, 334, 338 by creating one or more new cloud computing instances with the local storage, retrieving data stored on the failed cloud computing instances 340a, 340b, 340n from the cloud-based object storage 348, and storing the data retrieved from the cloud-based object storage 348 in the local memory of the newly created cloud computing instances. The reader will appreciate many variations on this process that can be implemented.

Consider an example in which all cloud computing examples 340a, 340b, 340n with local storage 330, 334, 338 fail. In this example, the monitoring module may create a new cloud computing instance with local storage, where a high bandwidth instance type is selected that allows for a maximum data transfer rate between the newly created high bandwidth cloud computing instance with local storage and cloud-based object storage 348. The reader will appreciate that the type of instance that allows the maximum data transfer rate between the new cloud computing instance and the cloud-based object store 348 is selected so that the new high bandwidth cloud computing instance can refill the data from the cloud-based object store 348 as quickly as possible. Once the new high bandwidth cloud computing instance is refilled with data from the cloud-based object store 348, a cheaper low bandwidth cloud computing instance may be created, the data may be migrated to the cheaper low bandwidth cloud computing instance, and the high bandwidth cloud computing instance may be terminated.

Readers will appreciate that in some embodiments, the number of new cloud computing instances created may substantially exceed the number of cloud computing instances required to locally store all of the data stored by cloud-based storage system 318. The number of new cloud computing instances created may substantially exceed the number of cloud computing instances needed to locally store all of the data stored by cloud-based storage system 318 in order to more quickly pull data from cloud-based object storage 348 and to pull new cloud computing instances, as each new cloud computing instance may retrieve some portion of the data stored by cloud-based storage system 318 (in parallel). In such embodiments, once the data stored by cloud-based storage system 318 has been pulled into the newly created cloud computing instance, the data may be integrated within a subset of the newly created cloud computing instance, and an excess of the newly created cloud computing instance may be terminated.

Consider an instance in which 1000 cloud computing instances are required in order for a user of the local storage cloud-based storage system 318 to have written all of the valid data of the cloud-based storage system 318. In this example, assume that all 1,000 cloud computing instances fail. In this example, the monitoring module may result in the creation of 100,000 cloud computing instances, where each cloud computing instance is responsible for retrieving from the cloud-based object store 348 a different 100,000-half block of valid data of the cloud-based storage system 318 that has been written by a user of the cloud-based storage system 318, and storing the different block of data sets that it retrieves locally. In this example, because each of the 100,000 cloud computing instances may retrieve data from the cloud-based object store 348 in parallel, the recovery speed of the cache tier may be 100 times faster than an embodiment in which the monitoring module creates only 1000 alternative cloud computing instances. In this example, data stored locally in 100,000 cloud computing instances may be integrated into 1,000 cloud computing instances over time, while the remaining 99,000 cloud computing instances may be terminated.

The reader will appreciate that various performance aspects of the cloud-based storage system 318 may be monitored (e.g., by a monitoring module executing in the EC2 example) such that the cloud-based storage system 318 may be expanded longitudinally or laterally as desired. Consider an instance in which a monitoring module monitors performance of the CAN-based storage system 318 via communication with one or more of the cloud computing instances 320, 322, respectively, for supporting execution of the storage controller applications 324, 326, via communication between the monitoring cloud computing instances 320, 322, 340a, 340b, 340n and the cloud-based object storage 348, or in some other manner. In this example, assume that the monitoring module determines that the cloud computing instances 320, 322 for supporting execution of the storage controller applications 324, 326 are undersized and insufficient to service I/O requests issued by users of the cloud-based storage system 318. In this example, the monitoring module may create a new, more powerful cloud computing instance (e.g., a type of cloud computing instance that includes more processing power, more memory, etc.) that includes the storage controller application so that the new, more powerful cloud computing instance may begin to operate as the master controller. Likewise, if the monitoring module determines that the cloud computing instances 320, 322 for supporting execution of the storage controller applications 324, 326 are oversized and cost savings can be obtained by switching to smaller, weaker cloud computing instances, the monitoring module may create new, weaker (and cheaper) cloud computing instances containing the storage controller applications so that the new, weaker cloud computing instances may begin to operate as master controllers.

Consider an example that is an additional example of dynamically resizing cloud-based storage system 318, wherein a monitoring module determines that the utilization of local storage collectively provided by cloud computing examples 340a, 340b, 340n has reached a predetermined utilization threshold (e.g., 95%). In this example, the monitoring module may create additional cloud computing instances with local storage to extend the local storage pool provisioned by the cloud computing instances. Alternatively, the monitoring module may create one or more new cloud computing instances having a larger local storage than the existing cloud computing instances 340a, 340b, 340n such that data stored in the existing cloud computing instances 340a, 340b, 340n may migrate to the one or more new cloud computing instances and the existing cloud computing instances 340a, 340b, 340n may be terminated, thereby expanding the local storage pool supplied by the cloud computing instances. Also, if the local storage pool supplied by the cloud computing instance is unnecessarily large, data may be consolidated and some cloud computing instances may be terminated.

Readers will appreciate that cloud-based storage system 318 may be automatically increased and decreased in size by a monitoring module applying a predetermined set of rules, which may be relatively simple or relatively complex. In fact, the monitoring module may not only consider the current state of the cloud-based storage system 318, but the monitoring module may also apply predictive strategies based on, for example, observed behavior (e.g., relatively light usage of the storage system from 10 pm to 6 pm every day), predetermined fingerprints (e.g., an increase in the number of IOPS pointed to the storage system by X each time the virtual desktop infrastructure adds 100 virtual desktops), etc. In this example, dynamic expansion of cloud-based storage system 318 may be based on current performance metrics, predicted workload, and many other factors, including combinations thereof.

Readers will further appreciate that because cloud-based storage system 318 is dynamically expandable, cloud-based storage system 318 may operate even more dynamically. Consider an example of collection of discarded items. In conventional storage systems, the amount of storage is fixed. Thus, at some point the storage system may be forced to perform waste item collection, as the available storage has become so limited that the storage system is at the edge of exhausted storage. Rather, the cloud-based storage system 318 described herein may always "add" additional storage (e.g., by adding more cloud computing instances with local storage). Because the cloud-based storage system 318 described herein may always "add" additional storage, the cloud-based storage system 318 may make more intelligent decisions about when to perform collection of discarded items. For example, cloud-based storage system 318 may implement a policy that performs collection of discarded items only when the number of IOPS served by cloud-based storage system 318 is below a certain level. In some embodiments, other system-level functions (e.g., deduplication, compression) may also be turned off and on in response to system loads, assuming that the size of the cloud-based storage system 318 is not constrained as in conventional storage systems.

Readers will appreciate that embodiments of the present disclosure address the problem of block storage services provided by some cloud computing environments because some cloud computing environments allow only one cloud computing instance to connect to a block storage volume at a single time. For example, in Amazon AWS, only a single EC2 instance may be connected to an EBS volume. By using EC2 instances with local storage, embodiments of the present disclosure may provision multi-connection capabilities, where multiple EC2 instances may be connected to another EC2 instance with local storage ("drive instance"). In such embodiments, the driver instance may include software executing within the driver instance that allows the driver instance to support I/O directed from each connected EC2 instance to a particular volume. Thus, some embodiments of the present disclosure may be embodied as a multi-connection block storage service, which may not include all of the components depicted in fig. 3C.

In some embodiments, particularly in embodiments in which cloud-based object storage 348 resources are embodied as amazon S3, cloud-based storage system 318 may include one or more modules (e.g., modules of computer program instructions executing on the EC2 example) configured to ensure that when the local storage of a particular cloud computing example is refilled with data from S3, the appropriate data is actually in S3. This problem arises primarily because S3 implements a final consistency model, where when an existing object is overwritten, the reading of the object eventually (but not necessarily immediately) will become consistent and eventually (but not necessarily immediately) return to the overwritten version of the object. To address this issue, in some embodiments of the present disclosure, the object in S3 is never overwritten. In contrast, a traditional "overwrite" would result in the creation of a new object (containing an updated version of the data) and eventually the deletion of an old object (containing a previous version of the data).

In some embodiments of the present disclosure, as part of an attempt to never (or almost never) overwrite an object, when data is written to S3, the resulting object may be marked with a sequence number. In some embodiments, these sequence numbers may be stored elsewhere (e.g., in a database) such that at any point in time, the sequence number associated with the latest version of a certain data segment may be known. In this way, it may be determined whether S3 has the latest version of a certain data segment by merely reading the sequence number associated with the object, and not actually reading the data from S3. The ability to make this determination may be particularly important when a cloud computing instance with local storage crashes, as it is undesirable to replace the local storage of the cloud computing instance with outdated data refill water. In fact, because the cloud-based storage system 318 does not need to access data to verify its validity, the data may remain encrypted and access costs may be avoided.

The storage system described above may perform intelligent data backup techniques by which data stored in the storage system may be replicated and stored in different locations to avoid data loss in the event of equipment failure or some other form of disaster. For example, the storage system described above may be configured to check each backup to avoid restoring the storage system to an undesirable state. Consider an example in which malware infects a storage system. In this example, the storage system may include a software resource 314 that may scan each backup to identify the backup captured before the malware infects the storage system and the backup captured after the malware infects the storage system. In this example, the storage system may restore itself from a backup that does not contain malware, or at least does not restore portions of the backup that contain malware. In this example, the storage system may include a software resource 314 that may identify the presence of malware (or viruses, or some other undesirable) by, for example, identifying write operations serviced by the storage system and originating from a network subnet suspected of having delivered malware, identifying write operations serviced by the storage system and originating from a user suspected of having delivered malware, by identifying write operations serviced by the storage system and checking the contents of write operations against fingerprints of malware, and scanning each backup in many other ways.

The reader will further appreciate that backup (typically in the form of one or more snapshots) may also be used to perform a quick restore of the storage system. Consider an example of lux software where a storage system infection locks a user out of the storage system. In this example, the software resource 314 within the storage system may be configured to detect the presence of the lux software, and may be further configured to restore the storage system to a point in time prior to the point in time at which the lux software infects the storage system using the reserved backup. In this example, the presence of the lux software may be explicitly detected by using a software tool utilized by the system, by using a key (e.g., a USB drive) inserted into the storage system, or in a similar manner. Also, the presence of the lux software may be inferred in response to system activity meeting a predetermined fingerprint, e.g., no reads or writes to the system within a predetermined period of time.

The reader will appreciate that the various components described above may be grouped into one or more optimized computing packages as a converged infrastructure. Such a converged infrastructure may include a pool of computer, storage, and network resources that may be shared by multiple applications and managed in a common manner using policy-driven processes. Such a fusion infrastructure may be implemented with a fusion infrastructure reference architecture, a stand-alone device, a software-driven super-fusion method (e.g., a super-fusion infrastructure), or other means.

Readers will appreciate that the above described storage systems may be used to support various types of software applications. For example, the storage system 306 may be used to support artificial intelligence ("AI") applications, database applications, devOps projects, electronic design automation tools, event driven software applications, high performance computing applications, simulation applications, high speed data capture and analysis applications, machine learning applications, media production applications, media services applications, picture archiving and communication systems ("PACS") applications, software development applications, virtual reality applications, augmented reality applications, and many other types of applications by providing storage resources to these applications.

The storage system described above is operable to support a variety of applications. In view of the fact that the storage system includes computing resources, storage resources, and a wide variety of other resources, the storage system may be well suited to support resource-intensive applications, such as AI applications. AI applications may be deployed in a variety of fields, including: predictive maintenance of manufacturing and related fields, healthcare applications (e.g., patient data and risk analysis), retail and marketing deployments (e.g., search advertisements, social media advertisements), supply chain solutions, financial technology solutions (e.g., business analysis and reporting tools), operational deployments (e.g., real-time analysis tools), application performance management tools, IT infrastructure management tools, and many others.

Such AI applications may enable a device to perceive its environment and take actions that maximize its chances of success on a certain target. Examples of such AI applications may include IBM Watson, microsoft Oxford, google deep, hundred degrees Minwa, and others. The storage system described above may also be well suited to support other types of resource intensive applications, such as machine learning applications. The machine learning application may perform various types of data analysis to automate analytical model construction. Using an algorithm that iteratively learns from data, a machine learning application may enable a computer to learn without explicit programming. One particular area of machine learning is known as reinforcement learning, which involves taking appropriate action to maximize return under certain circumstances. Reinforcement learning may be used to find the best possible behavior or path that a particular software application or machine should take in a particular situation. Reinforcement learning differs from other areas of machine learning (e.g., supervised learning, unsupervised learning) in that correct input/output pairs need not be presented for reinforcement learning, and explicit correction of suboptimal actions is not required.

In addition to the resources already described, the above-described storage system may also contain a graphics processing unit ("GPU"), sometimes referred to as a visual processing unit ("VPU"). Such GPUs may be embodied as specialized electronic circuits that quickly manipulate and alter memory to speed up the creation of images in a frame buffer for output to a display device. Such GPUs may be included within any computing device that is part of the storage system described above, including as one of many individual scalable components of the storage system, wherein other examples of individual scalable components of such storage system may include storage components, memory components, computing components (e.g., CPU, FPGA, ASIC), network components, software components, and others. In addition to GPUs, the storage system described above may also include neural network processors ("NNPs") for respective aspects of neural network processing. Such NNPs may be used in place of (or in addition to) GPUs, and may also be independently scalable.

As described above, the storage systems described herein may be configured to support artificial intelligence applications, machine learning applications, big data analysis applications, and many other types of applications. The rapid growth of such applications is driven by three technologies: deep Learning (DL), GPU processor, and big data. Deep learning is a computational model that utilizes a massively parallel neural network inspired by the human brain. The deep learning model writes its own software by learning from a large number of examples, rather than expert manual software. Such GPUs may include thousands of cores that are well suited to running algorithms that loosely represent the parallel nature of the human brain.

Advances in deep neural networks, including the development of multi-layer neural networks, have ignited a new algorithm and tool for data scientists to mine their data with Artificial Intelligence (AI). With improved algorithms, larger data sets, and various frameworks (including open source software libraries for machine learning across a range of tasks), data scientists are dealing with new use cases such as autopilot, natural language processing and understanding, computer vision, machine reasoning, strong AI, and many others. Applications of such techniques may include: machine and vehicle object detection, identification, and avoidance; visual identification, classification and marking; algorithm financial transaction policy performance management; synchronous positioning and mapping; predictive maintenance of high value machines; preventing network security threat and automation of professional knowledge; image identification and classification; question answering; robotics; text analysis (extraction, classification), text generation and translation; and many others. Applications of AI technology have been implemented in a wide range of products, including, for example, speech recognition technology of amazon Echo, which allows users to talk to their machines, google translation, which allows machine-based language translation ^TM The Discover Weekly, which provides a recommended Spotify of new songs and artists that the user may like, based on the user's usage and traffic analysis, will obtain structured data and convert it to a text-generated product of the quick of the narrative story, chatbot that provides real-time, context-specific answers to questions in conversational format, and many others.

Data is the core of modern AI and deep learning algorithms. One problem that must be addressed before training begins is collecting labeled data, which is critical to training an accurate AI model. Comprehensive AI deployments may be required to continually collect, clean up, convert, flag, and store large amounts of data. Adding additional high quality data points translates directly into a more accurate model and better insight. The data sample may undergo a series of processing steps including, but not limited to: 1) ingest data from external sources into a training system and store the data in raw form, 2) clean up and convert the data in a format that facilitates training, including linking data samples to appropriate labels, 3) explore parameters and models, quickly test with smaller data sets, and iterate to converge on the most promising model to push into a production cluster, 4) execute a training phase to select random batches of input data, including new and old samples, and feed the inputs into a production GPU server for computation to update model parameters, and 5) evaluate, including using a reserved portion of the data that is not used in training, in order to evaluate model accuracy that maintains the data. This lifetime may be applicable to any type of parallelized machine learning, not just neural networks or deep learning. For example, a standard machine learning framework may rely on a CPU instead of a GPU, but the data ingest and training workflows may be the same. Readers will appreciate that a single shared storage data center creates a coordination point throughout the life without the need to provide additional copies of data during ingestion, preprocessing, and training phases. The data ingested is rarely used for one purpose only and the shared storage gives flexibility to train multiple different models or apply traditional analysis to the data.

The reader will appreciate that each stage in the AI data pipeline may have different requirements for a data center (e.g., a storage system or collection of storage systems). Laterally expanding storage systems must deliver unbroken performance for various access types and patterns, from small, large metadata volumes of files to large files, from random access patterns to sequential access patterns, and from low concurrency to high concurrency. The above-described storage system may be used as an ideal AI data center because the system may serve unstructured work loads. In the first phase, the data is ideally ingested and stored to the same data center that will be used in the subsequent phase to avoid excessive data duplication. The next two steps can be done on a standard compute server optionally containing a GPU, and then in the fourth and last stage, the complete training production job runs on a powerful GPU acceleration server. Typically, a production pipeline exists alongside an experimental pipeline operating on the same dataset. Furthermore, the GPU-accelerated servers may be used independently for different models, or may be combined together to train on a larger model, or may even be distributed across multiple systems. If the shared storage hierarchy is slow, each phase must copy the data to local storage, wasting time transferring the data to a different server. The ideal data center of the AI training pipeline delivers similar performance to data stored locally on the server node, while also having simplicity and performance enabling all pipeline stages to operate concurrently.

In order for the storage system described above to be used as part of a data center or AI deployment, in some embodiments, the storage system may be configured to provide DMA between a storage device included in the storage system and one or more GPUs used in an AI or big data analysis pipeline. The one or more GPUs may be coupled to a storage system, such as via structural NVMe ("NVMe-orf"), such that, for example, bottlenecks oF a host CPU may be bypassed and the storage system (or one oF the components contained therein) may directly access GPU memory. In this example, the storage system may utilize an API hooking to the GPU to transfer data directly to the GPU. For example, the GPU may be embodied as Nvidia ^TM The GPU, and the storage system may support gpudiect store ("GDS") software, or have similar proprietary software that enables the storage system to transfer data to the GPU via RDMA or similar mechanisms. Readers will appreciate that in embodiments where the storage system is embodied as a cloud-based storage system as described below, virtual drives or other components within such a cloud-based storage system may also be configured.

While the preceding paragraphs discuss a deep learning application, the reader will appreciate that the storage system described herein may also be part of a distributed deep learning ("DDL") platform to support execution of DDL algorithms. The above-described storage system may also be paired with other technologies such as a TensorFlow, which is an open source software library for data flow programming across a series of tasks that may be used for machine learning applications such as neural networks, to facilitate the development of such machine learning models, applications, and the like.

The storage system described above may also be used in neuromorphic computing environments. Neuromorphic calculations are a form of calculation that mimics brain cells. To support neuromorphic computation, the architecture of interconnected "neurons" replaces the traditional computational model with low power signals that pass directly between neurons to achieve more efficient computation. Neuromorphic calculations may utilize Very Large Scale Integration (VLSI) systems containing electronic analog circuits to simulate the neural biological architecture present in the nervous system, as well as analog, digital, mixed-mode analog/digital VLSI and software systems implementing models of the nervous system for sensing, motor control, or multisensory integration.

Readers will appreciate that the above-described storage systems may be configured to support (among other types of data) the storage or use of blockchains. In addition to supporting the storage and use of blockchain technology, the storage system described above may also support the storage and use of derivative items, e.g., as IBM ^TM The Hyperledger project's part of the open source blockchain and related tools, the licensed blockchain that allows a certain number of trusted parties to access the blockchain, the blockchain products that enable developers to build their own distributed ledger project, and others. The blockchains and storage systems described herein may be utilized to support on-chain storage of data as well as off-chain storage of data.

The out-of-chain storage of data may be implemented in a variety of ways and may occur when the data itself is not stored in the blockchain. For example, in one embodiment, a hash function may be utilized and the data itself may be fed into the hash function to generate the hash value. In this example, the hash of the large data segment may be embedded into the transaction, rather than the data itself. Readers will appreciate that in other embodiments, alternatives to blockchains may be used to facilitate the decentralised storage of information. For example, one alternative to blockchain that may be used is blocky braiding. While conventional blockchain stores each transaction to achieve authentication, blockbraiding allows secure decentralization without the use of the entire chain, thereby enabling low cost data chain on-storage. Such blockwise braiding may utilize consensus mechanisms based on access attestation (PoA) and proof of work (PoW).

The above-described storage system may be used alone or in combination with other computing devices to support computing applications in memory. In-memory computing involves storing information in RAM distributed across a cluster of computers. The reader will appreciate that the above-described storage systems, particularly those that are configurable with customizable amounts of processing resources, storage resources, and memory resources (e.g., those that contain blades of each type of resource in configurable amounts), may be configured in a manner that provides an infrastructure that can support in-memory computing. Also, the above-described storage system may include component parts (e.g., NVDIMMs, 3D cross-point storage that provides persistent fast random access memory) that may actually provide an improved in-memory computing environment compared to in-memory computing environments that rely on RAM distributed across dedicated servers.

In some embodiments, the above-described storage system may be configured to operate as a hybrid memory computing environment that includes a generic interface to all storage media (e.g., RAM, flash memory devices, 3D cross-point storage devices). In such embodiments, the user may not know details of where their data is stored, but they may still address the data using the same complete, unified API. In such embodiments, the storage system may move the data (in the background) to the fastest tier available, including intelligently placing the data according to various characteristics of the data or according to some other heuristics. In this example, the storage system may even utilize existing products such as Apache igite and GridGain to move data between the various storage layers, or the storage system may utilize custom software to move data between the various storage layers. The storage systems described herein may implement various optimizations to improve the performance of computations in memory, e.g., to make the computations occur as close to data as possible.

Readers will further appreciate that in some embodiments, the storage system may be paired with other resources to support the application. For example, one infrastructure may include primary computations in the form of servers and workstations that exclusively use general purpose computing ("GPGPU") on a graphics processing unit to accelerate deep learning applications that are interconnected into a compute engine to train parameters of a deep neural network. Each system may have ethernet external connectivity, infiniband external connectivity, some other form of external connectivity, or some combination thereof. In this example, GPUs may be grouped for a single large training or used independently to train multiple models. The infrastructure may also include a storage system as described above to provide, for example, a laterally expanded full flash memory file or object storage device through which data may be accessed via high performance protocols such as NFS, S3, etc. For example, the infrastructure may also include a redundant shelf-top ethernet switch connected to storage and computation via ports in the MLAG port channels to enable redundancy. The infrastructure may also include additional computations in the form of white-box servers, optionally using GPUs, for data ingest, preprocessing, and model debugging. The reader will appreciate that additional infrastructure is also possible.

The reader will appreciate that the above-described storage system may be configured to support other AI-related tools, either alone or in coordination with other computing machines. For example, the storage system may utilize tools such as ONXX or other open neural network switching formats to more easily transfer models written in different AI frameworks. Likewise, the storage system may be configured to support tools such as Amazon's Gluon that allow developers to prototype, build, and train deep learning models. In fact, the storage system described above may be part of a larger platform, such as IBM ^TM A cloud private data platform comprising integrated data science, data engineering, and application build services.

Readers will further appreciate that the storage system described above may also be deployed as an edge solution. This edge solution may optimize a cloud computing system by performing data processing at the network edge near the data source. Edge computing can push applications, data, and computing power (i.e., services) from a central point to the logical extremes of the network. By using an edge solution such as the storage system described above, computing tasks can be performed using computing resources provided by such storage systems, data can be stored using storage resources of the storage system, and cloud-based services can be accessed using various resources (including network resources) of the storage system. By performing computing tasks on edge solutions, storing data on edge solutions, and typically utilizing edge solutions, consumption of expensive cloud-based resources can be avoided, and in fact, performance improvements can be experienced relative to greater reliance on cloud-based resources.

While many tasks may benefit from the utilization of edge solutions, some specific uses may be particularly suited for deployment in this environment. For example, devices such as drones, autopilots, robots, and others may require extremely fast processing speeds, in fact, so fast that sending data to the cloud environment and back to receive data processing support may be too slow. As an additional example, some IoT devices (e.g., connected video cameras) may be less suitable to utilize cloud-based resources because it may be impractical (not only from a privacy, security, or financial perspective) to send data to the cloud simply because the amount of data involved is too large. Thus, many tasks that truly involve data processing, storage, or communication may be more suitable for platforms that include edge solutions (e.g., the storage systems described above).

The storage system can be used alone or in combination with other computing resources, and is used as a network edge platform for combining computing resources, storage resources, network resources, cloud technology, network virtualization technology and the like. As part of the network, edges may have characteristics similar to other network facilities, from customer premises and backhaul aggregation facilities to point of presence (PoP) and regional data centers. Readers will appreciate that network workloads, such as Virtual Network Functions (VNFs) and others, will reside on the network edge platform. By a combination of containers and virtual machines, the network edge platform may rely on controllers and schedulers that are no longer geographically co-located with the data processing resources. As micro services, the functionality may be divided into control planes, user and data planes, or even state machines, allowing independent optimization and expansion techniques to be applied. Such user and data planes may be implemented with added accelerators, either resident in server platforms (e.g., FPGAs and smart NICs) or with SDN enabled commercial silicon and programmable ASICs.

The storage system described above may also be optimized for big data analysis. Big data analysis is generally described as a process of examining a large number of different data sets to find hidden patterns, unknown correlations, market trends, customer preferences, and other useful information, which can help organizations make more intelligent business decisions. As part of the process, semi-structured and unstructured data (e.g., internet click stream data, web server logs, social media content, text from customer emails and survey replies, mobile phone call detail records, ioT sensor data, and other data) may be converted into structured form.

The storage system described above may also support, including being implemented as a system interface, applications that perform tasks in response to human speech. For example, the storage system may support execution of intelligent personal assistant applications such as Alexa, apple Siri, google Voice, sanxingby, microsoft Cortana, and others. Although the example described in the previous sentence utilized voice as input, the above-described storage system may also support chat robots (chatbots), talking robots, chat robots (chat bots), or manual conversation entities or other applications configured to speak through auditory or text methods. Also, the storage system may actually execute this application to enable a user, such as a system administrator, to interact with the storage system via voice. Such applications are typically capable of voice interaction, music playing, making to-do lists, setting alerts, streaming podcasts, playing audio books, and providing weather, traffic, and other real-time information (e.g., news), although such applications may serve as interfaces for various system management operations in embodiments according to the present disclosure.

The storage system described above may also implement an AI platform for delivering the landscape of autopilot storage. Such AI platforms can be configured to deliver global prediction intelligence by collecting and analyzing a large number of storage system telemetry data points to enable easy management, analysis, and support. In fact, such storage systems may implement both predicted capacity and performance, and generate intelligent advice regarding workload deployment, interaction, and optimization. Such AI platforms may be configured to scan all incoming storage system telemetry data for problem fingerprint libraries to predict and resolve events in real-time before they impact the customer environment, and capture hundreds of variables related to performance, which are used to predict performance load.

The storage system described above may support artificial intelligence applications, machine learning applications, data analysis applications, data conversion, and others, which may collectively form an AI ladder. This AI ladder can be effectively formed by combining such elements to form a complete data science conduit, where there are dependencies between the elements of the AI ladder. For example, an AI may require that some form of machine learning have occurred, machine learning may require that some form of analysis have occurred, analysis may require that some form of data and information architecture have occurred, and so on. Thus, each element may be considered a step in an AI ladder, which may together form a complete and complex AI solution.

The above-described storage system may also be used, alone or in combination with other computing environments, to deliver a ubiquitous experience of AI, where AI permeates a broad and broad range of commercial and life. For example, AI may play an important role in delivering deep learning solutions, deep reinforcement learning solutions, artificial general intelligence solutions, automated driving automobiles, cognitive computing solutions, commercial UAVs or drones, conversational user interfaces, enterprise taxonomies, ontology management solutions, machine learning solutions, intelligent dust, intelligent robots, intelligent workplaces, and the like.

The above-described storage systems may also be used, alone or in combination with other computing environments, to deliver a wide range of transparent immersive experiences (including digital twinning using various "things" such as people, places, processes, systems, etc.), where technologies may introduce transparency between people, businesses, and things. Such transparent immersive experiences can be delivered as augmented reality technology, interconnected home, virtual reality technology, brain-to-machine interface, human augmented technology, nanotube electronics, volumetric displays, 4D printing technology, or other technologies.

The storage system described above may also be used to support a wide variety of digital platforms, alone or in combination with other computing environments. Such digital platforms may include, for example, 5G wireless systems and platforms, digital twin platforms, edge computing platforms, ioT platforms, quantum computing platforms, serverless PaaS, software defined security, neuromorphic computing platforms, and the like.

The storage system described above may also be part of a multi-cloud environment, where multiple cloud computing and storage services are deployed in a single heterogeneous architecture. To facilitate the operation of this multi-cloud environment, a DevOps tool may be deployed to enable orchestration across clouds. Likewise, continuous development and continuous integration tools can be deployed to normalize the process around continuous integration and delivery, new function deployment and provisioning cloud workloads. By normalizing these processes, a cloudy policy may be implemented that enables the best provider utilization for each workload.

The storage system described above may be used as part of a platform to enable the use of encryption anchors that may be used to authenticate the source and content of a product to ensure that it matches a blockchain record associated with the product. Similarly, the storage systems described above may implement various encryption techniques and schemes, including lattice encryption, as part of a kit that protects data stored on the storage systems. Lattice encryption may involve the construction of cryptographic primitives that relate to the lattice, either the construction itself or the security certification. Unlike public key schemes such as RSA, diffie-Hellman, or elliptic curve cryptography, which are vulnerable to quantum computers, some lattice-based constructs appear to be resistant to attacks by both classical and quantum computers.

Quantum computers are devices that perform quantum computation. Quantum computing is computation that exploits quantum mechanical phenomena, such as superposition and entanglement. Quantum computers differ from transistor-based traditional computers in that such traditional computers require encoding data into binary digits (bits), each digit always being in one of two definite states (0 or 1). Unlike conventional computers, quantum computers use qubits, which can be a superposition of states. A quantum computer maintains a series of qubits, where a single qubit may represent a 1, 0, or any quantum superposition of the two qubit states. A pair of qubits may be any quantum superposition of 4 states, and three qubits may be any superposition of 8 states. Quantum computers with n qubits can typically be in any superposition of up to 2 n different states at the same time, whereas traditional computers can only be in one of these states at any time. Quantum turing machines are theoretical models of this computer.

The storage system described above can also be paired with an FPGA acceleration server as part of a larger AI or ML infrastructure. Such FPGA-accelerated servers may reside near the storage systems described above (e.g., in the same data center), or even be incorporated into an appliance that includes one or more storage systems, one or more FPGA-accelerated servers, a network infrastructure that supports communication between the one or more storage systems and the one or more FPGA-accelerated servers, and other hardware and software components. Alternatively, FPGA-accelerated servers may reside within a cloud computing environment that may be used to perform computing-related tasks of AI and ML jobs. Any of the above embodiments may be used together as an FPGA-based AI or ML platform. The reader will appreciate that in some embodiments of FPGA-based AI or ML platforms, FPGAs contained within the FPGA-accelerated server may be reconfigured for different types of ML models (e.g., LSTM, CNN, GRU). The ability to reconfigure the FPGA contained in the FPGA acceleration server can enable acceleration of the ML or AI application based on the best numerical accuracy and memory model used. The reader will appreciate that by treating the collection of FPGA acceleration servers as a pool of FPGAs, any CPU in the data center can use the pool of FPGAs as a shared hardware micro-service, rather than limiting the servers to dedicated accelerators inserted therein.

The FPGA-accelerated server and the GPU-accelerated server described above may implement a computational model in which instead of storing small amounts of data in the CPU and running long instruction streams thereon as occurs in more traditional computational models, machine learning models and parameters are fixed into high bandwidth on-chip memory, with large amounts of data flowing through the high bandwidth on-chip memory. For this computational model, the FPGA may even be more efficient than the GPU, as the FPGA may be programmed with only the instructions needed to run such computational model.

The storage system described above may be configured to provide parallel storage, for example, by using a parallel file system such as BeeGFS. Such parallel file systems may include a distributed metadata architecture. For example, a parallel file system may include multiple metadata servers across which metadata is distributed, as well as components including services for clients and storage servers.

The above-described system may support the execution of a wide range of software applications. Such software applications may be deployed in a variety of ways, including container-based deployment models. Various tools may be used to manage the containerized application. For example, the applications containerized may be managed using Docker Swarm, kubemetes, and others. The containerized application may be used to facilitate server-less, cloud-local computing deployment and management models for the software application. To support a serverless, cloud-local computing deployment and management model for software applications, a container may be used as part of an event handling mechanism (e.g., AWS Lambdas) such that various events cause the containerized application to be launched to operate as an event handler.

The above-described systems may be deployed in a variety of ways, including in a manner that supports fifth generation ("5G") networks. The 5G network may support substantially faster data communications than previous generations of mobile communication networks and thus may result in a breakdown of data and computing resources, as modern large-scale data centers may become less prominent and may be replaced by more local miniature data centers, e.g., near mobile network towers. The above-described systems may be included in such local micro-data centers and may be part of or paired with a multiple access edge computing ("MEC") system. Such MEC systems may implement cloud computing capabilities and IT service environments at the edge of the cellular network. By running the application and performing the associated processing tasks closer to the cellular client, network congestion may be reduced and the application may perform better.

The above-described storage system may also be configured to implement an NVMe partition namespace. By using the NVMe partition name space, the logical address space of the name space is divided into a plurality of areas. Each zone provides a logical block address range that must be written sequentially and explicitly reset prior to overwriting, thereby enabling creation of a namespace disclosing the device's natural boundaries and offloading management of the internal mapping tables to the host. To implement the NVMe partition namespace ("ZNS"), a ZNS SSD or some other form of partition block device may be utilized that uses a partition public namespace logical address space. By aligning the region with the internal physical characteristics of the device, several inefficiencies in data placement can be eliminated. In such embodiments, each region may be mapped to a separate application, for example, such that functions such as wear leveling and discard item collection may be performed on a per-region or per-application basis, rather than across the entire device. To support ZNS, the storage controller described herein may be configured to use, for example, linux ^TM The kernel blocking device interface or other tool interacts with the blocking device.

The storage system described above may also be configured to implement partitioned storage in other ways, such as through the use of Shingled Magnetic Recording (SMR) storage. In instances where partitioned storage is used, embodiments of device management may be deployed in which the storage device hides this complexity by being managed in firmware, presenting an interface like any other storage device. Optionally, partition storage may be implemented via a host managed embodiment that relies on the operating system to know how to handle drives and only writes to certain areas of the drive sequentially. Partition storage may similarly be implemented using a host-aware embodiment, where a combination of implementations of driver management and host management are deployed.

For further explanation, fig. 3D illustrates an exemplary computing device 350 that may be specifically configured to perform one or more of the processes described herein. As shown in fig. 3D, computing device 350 may include a communication interface 352, a processor 354, a storage 356, and an input/output ("I/O") module 358 communicatively connected to each other via a communication infrastructure 360. Although the exemplary computing device 350 is shown in fig. 3D, the components illustrated in fig. 3D are not intended to be limiting. In other embodiments, additional or alternative components may be used. The components of the computing device 350 shown in fig. 3D will now be described in more detail.

The communication interface 352 may be configured to communicate with one or more computing devices. Examples of communication interface 352 include, but are not limited to, a wired network interface (e.g., a network interface card), a wireless network interface (e.g., a wireless network interface card), a modem, an audio/video connection, and any other suitable interface.

Processor 354 generally represents any type or form of processing unit capable of processing data and/or interpreting, executing, and/or directing execution of one or more instructions, processes, and/or operations described herein. The processor 354 may perform operations by executing computer-executable instructions 362 (e.g., application programs, software, code, and/or other executable data examples) stored in the storage 356.

The storage 356 may include one or more data storage media, devices, or configurations, and may take any type, form, and combination of data storage media and/or devices. For example, storage 356 may include, but is not limited to, any combination of the non-volatile media and/or volatile media described herein. Electronic data, including the data described herein, may be temporarily and/or permanently stored in the storage 356. For example, data representing computer-executable instructions 362 configured to direct processor 354 to perform any of the operations described herein may be stored within storage 356. In some examples, the data may be arranged in one or more databases residing within the storage 356.

The I/O modules 358 may include one or more I/O modules configured to receive user input and provide user output. The I/O module 358 may include any hardware, firmware, software, or combination thereof that supports input and output capabilities. For example, the I/O module 358 may include hardware and/or software for capturing user input, including but not limited to a keyboard or keypad, a touch screen component (e.g., a touch screen display), a receiver (e.g., an RF or infrared receiver), a motion sensor, and/or one or more input buttons.

The I/O module 358 may include one or more devices for presenting output to a user, including but not limited to a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., a display driver), one or more audio speakers, and one or more audio drivers. In certain embodiments, the I/O module 358 is configured to provide graphical data to a display for presentation to a user. The graphical data may represent one or more graphical user interfaces and/or any other graphical content that may serve a particular implementation. In some examples, any of the systems, computing devices, and/or other components described herein may be implemented by computing device 350.

The storage systems described above may be configured alone or in combination to function as a continuous data protection storage device. Continuous data protection storage is a feature of storage systems that records updates to a data set such that consistent images of the previous contents of the data set can be accessed at a low granularity of time (typically in seconds or even less) and traced back to a reasonable period of time (typically hours or days). These allow access to the data set to the nearest consistent point in time, and also allow access to the data set to a point in time that may occur just prior to an event (e.g., resulting in partial content damage or loss of the data set) while preserving a maximum number of updates that are close to before the event. Conceptually, it takes as a series of snapshots of a data set very frequently and long in storage time, although the implementation of continuous data protection storage is often quite different from snapshots. A storage system implementing a data-continuous data protection storage device may also provide means to access these points in time, to access one or more of these points in time as a snapshot or clone copy, or to restore a data set to one of the recorded points in time.

Over time, to reduce overhead, some points in time held in the contiguous data protection storage may be merged with other nearby points in time, essentially deleting some of these points in time from the storage. This may reduce the capacity required to store updates. A limited number of these time points may also be converted into a snapshot of longer duration. For example, this storage may retain a sequence of low granularity time points that dates back from now on to several hours, with some time points being merged or deleted to reduce overhead up to one day. Going back to the further past, some of these time points may be converted into snapshots representing consistent time point images every few hours.

Although some embodiments are described primarily in the context of a storage system, those skilled in the art will recognize that embodiments of the disclosure may also take the form of a computer program product disposed on a computer-readable storage medium for use with any suitable processing system. Such computer-readable storage media may be any storage media for machine-readable information, including magnetic, optical, solid-state, or other suitable media. Examples of such media include magnetic disks in hard or floppy disk drives, optical disks for optical drives, magnetic tape, and other media as would occur to one of skill in the art. Those skilled in the art will immediately recognize that any computer system having suitable programming means will be capable of executing the steps described herein, as embodied in a computer program product. Those skilled in the art will also recognize that, while some embodiments described in this specification are directed to software installed and executed on computer hardware, alternative embodiments implemented as firmware or hardware are well within the scope of the disclosure.

In some examples, a non-transitory computer-readable medium storing computer-readable instructions may be provided in accordance with the principles described herein. The instructions, when executed by a processor of a computing device, may direct the processor and/or the computing device to perform one or more operations, including one or more operations described herein. Such instructions may be stored and/or transmitted using any of a variety of known computer-readable media.

The non-transitory computer-readable media referred to herein may comprise any non-transitory storage medium that participates in providing data (e.g., instructions) that may be read and/or executed by a computing device (e.g., by a processor of a computing device). For example, a non-transitory computer-readable medium may include, but is not limited to, any combination of non-volatile storage media and/or volatile storage media. Exemplary non-volatile storage media include, but are not limited to, read-only memory, flash memory, solid state drives, magnetic storage devices (e.g., hard disks, floppy disks, tape, etc.), ferroelectric random access memory ("RAM"), and optical disks (e.g., optical disks, digital video disks, blu-ray disks, etc.). Exemplary volatile storage media include, but are not limited to, RAM (e.g., dynamic RAM).

The advantages and features of the present disclosure may be further described by the following statements:

1. a method comprising adjusting, by a processing device storage controller, a storage bandwidth of a storage system process in response to an input output (I/O) write request to write data to a partitioned storage device by: calculating an allocation share of the storage system process requesting writing of the data; and upon determining that an open region of the storage system process is used below the allocated share of the storage system process, opening a new region for the storage system process.

2. The method of statement 1, wherein the allocation share of the storage system process is calculated using a target ratio of open areas assigned to the storage system process, a target ratio of open areas assigned to other storage system processes having open areas, and a target number of open areas of the storage system.

3. The method of any of statements 1-2, wherein determining the allocation share of the storage system process further comprises calculating the allocation share of the storage system process using a ratio between a target ratio of open areas assigned to the storage system process and a total number of target ratios of open areas assigned to a plurality of storage system processes having open areas and a target number of open areas.

4. The method of any of statements 1-3, wherein adjusting the storage bandwidth of the storage system process further comprises: upon determining that the open area usage of the storage system process is not below the allocated share of the storage system process, an open area pool that is not used by other storage system processes is identified by determining a difference between the other allocated shares of the other storage system processes and the open area usage.

5. The method of any of statements 1-4, wherein adjusting the storage bandwidth of the storage system process further comprises: upon determining that the open area usage of the storage system is below a target number of open areas of the storage system, a new area is opened for the storage system process.

6. The method of any of statements 1-5, wherein determining the allocated share of the storage system process is responsive to determining that the open area usage of the storage system is not below the target number of open areas of the storage system.

7. The method of any one of statements 1-6, further comprising: upon determining that the storage system process does not have an open region, a new region is opened for the storage system process and the storage bandwidth of the storage system process is adjusted to include the new region.

One or more embodiments may be described herein by means of method steps illustrating the performance of specified functions and relationships thereof. The boundaries and sequences of these functional building blocks and method steps have been arbitrarily defined for the convenience of the description. Alternate boundaries and sequences may be defined so long as the specified functions and relationships are appropriately performed. Accordingly, any such alternate boundaries or sequences are within the scope and spirit of the claims. Furthermore, the boundaries of these functional building blocks have been arbitrarily defined for the convenience of the description. Alternate boundaries may be defined so long as certain important functions are properly performed. Similarly, flow chart blocks may also be arbitrarily defined herein to illustrate certain important functionalities.

Within the scope of use, the flow chart block boundaries and sequences may be otherwise defined and still perform some significant functionality. Accordingly, such alternatives of functional building blocks and flow charts and sequences are defined within the scope and spirit of the claims. One of ordinary skill in the art will also recognize that the functional building blocks and other illustrative blocks, modules, and components herein may be implemented as illustrated or by discrete components, application specific integrated circuits, processors executing appropriate software, and the like, or any combination thereof.

Although specific combinations of various functions and features of one or more embodiments are described explicitly herein, other combinations of these features and functions are possible as well. The present disclosure is not limited to the specific examples disclosed herein, and is expressly incorporated in these other combinations.

Fig. 4 is a flow chart illustrating a method for determining whether to adjust the storage bandwidth of a storage system process according to some embodiments. The method 400 may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, programmable logic, microcode), software (e.g., instructions run on a processing device to perform hardware simulation), or a combination thereof. In some implementations, the processing logic of the memory controller of one of the memory systems of fig. 1A-3D may perform some or all of the operations described herein.

The method 400 begins at block 405, where processing logic performs the method 400 to receive an input-output write request to write data to a storage system from a storage system process (e.g., storage system process 715A of fig. 7). In an embodiment, the I/O write request may be an I/O command received by processing logic and sent by a storage system process. In an embodiment, a storage system process (also referred to herein as a "client process") may refer to a particular writer or client (e.g., an application or sub-application (e.g., a plug-in)) that performs operations in the storage system. In an embodiment, the storage system process may include a background process or a front-end process that is performed by the storage system. For example, among other storage system processes, the background storage system process may include a discard item collection (GC) process, a refresh process, a replication process, a deduplication process, or a pyramid process (e.g., metadata of a log-structured database). The front-end process may include storing files or data on behalf of the client device.

At block 410, processing logic determines whether the storage system process has an open segment. In an embodiment, once a segment is associated with a particular storage system process, the segment remains associated with the particular storage system process after the segment is closed. It will be appreciated that segments may be re-associated with other storage system processes from time to time. For example, data from a particular segment associated with a particular storage system process may be erased and the segment reopened for a different storage system process (or the same storage system process). In an embodiment, segments associated with a particular storage system process are populated with data from the particular storage system process but not with data from other storage system processes.

At block 415, in response to determining that the storage system process does not have an open segment, processing logic opens a new segment for the storage system process. It may be noted that in embodiments, a storage system process without at least one open segment will not be "starved" and will be assigned one open segment.

In alternative embodiments, in response to determining that the storage system process does have an open segment at block 420, processing logic may determine whether the storage system process has reached an open segment limit for the storage system process. In embodiments, the open segment limit (also referred to herein as a "maximum span limit") may be the maximum number of open segments that may be opened on behalf of a particular storage system process. In an embodiment, the open segment limit may be set by an administrator. Processing logic may compare the number of open segments of a particular process (e.g., open segment usage 710 of storage system process 715 of fig. 7) with the open segment limits of the storage system process to make the determination. In response to determining that the storage system process has met the associated open segment limit, processing logic may move to block 425 and write data to the existing open segment associated with the storage system process. In response to determining that the storage system process does not meet the associated open segment limit, processing logic may move to block 430 and adjust the storage bandwidth of the storage system process (e.g., adjust the number of open segments of the storage system process). In other implementations, processing logic may move directly from block 410 to block 430.

Fig. 5 is a flow chart illustrating a method for adjusting storage bandwidth of a storage system process according to some embodiments. The method 500 may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, programmable logic, microcode), software (e.g., instructions run on a processing device to perform hardware simulation), or a combination thereof. In some implementations, the processing logic of the memory controller of one of the memory systems of fig. 1A-3D may perform some or all of the operations described herein.

The method 500 begins at block 505, where processing logic determines whether an open segment use of the storage system (e.g., open segment use 710 of FIG. 7) is below a target number of open segments of the storage system (also referred to herein as a "target parallelism", e.g., target parallelism 725 of FIG. 7). Open segment usage may refer to the number of open segments that a storage system or a particular storage system process actively opens in any given instance. Open segment usage of a storage system may refer to a total open segment of all storage system processes (e.g., a predetermined process group) active in the storage system. It may be noted that the storage system process may be idle and have no open segments. The free storage system procedure cannot be used to calculate the open segment usage (or contribute 0 to the value). The target parallelism (or target number of open segments of the storage system) may refer to a predetermined soft target amount of open segments allocated at any given time in the storage system. In one example, the target parallelism may be the number of dies per storage drive multiplied by the number of write groups controlled by a particular host controller (e.g., storage array controllers 110A and 110B). It may be noted that the actual open segment usage of the storage system may be the same, higher or lower than the target parallelism. In one example, to determine whether an open segment usage of the storage system is below a target number of open segments of the storage system, the storage system may subtract the open segment usage from the target parallelism. The remainder of greater than 1 indicates that the open segment of the storage system is used below the target number of open segments of the storage system. A remainder (e.g., oversubscription) equal to or less than 1 indicates that the open segment usage of the storage system is not below the target number of open segments of the storage system.

At block 510, in response to determining that the open segment of the storage system is used below the target number of open segments of the storage system, processing logic opens a new segment for the storage system process. In response to determining that the open segment usage of the storage system is not below the target number of open segments of the storage system (e.g., full or oversubscription), processing logic moves to block 515 and determines an allocation share (e.g., allocation share 720, also referred to as a "fair share") of the storage system process requesting the write of data. An allocation share may refer to a different target number of open segments of a given storage system process in a given instance, where the allocation share is tunable at runtime. The operation of block 515 may be further described with respect to fig. 7.

Fig. 6 is a flow diagram illustrating a method for determining an allocation share of a storage system process according to some embodiments. Method 600 may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, programmable logic, microcode), software (e.g., instructions run on a processing device to perform hardware simulation), or a combination thereof. In some implementations, the processing logic of the memory controller of one of the memory systems of fig. 1A-3D may perform some or all of the operations described herein.

Method 600 begins at block 605, where processing logic determines a ratio between a target ratio (e.g., quota, such as quota 730A of FIG. 7) assigned to an open segment of a storage system process and a target ratio (e.g., quota, such as quota 730A of FIG. 7) assigned to an open segment of a plurality of storage system processes having open segments. The quota (or target ratio of the open segments) may refer to a value indicating the target ratio of a particular open segment. In some embodiments, the quota may use the target parallelism as a scaling factor.

For example, FIG. 7 illustrates quota 730 for three different storage system processes 715 with open segments. It may be noted that the free storage system process is not shown in fig. 7 because the free storage system process has no open segments allocated. Quota 730A for storage system process 715A is 7, quota 730B for storage system process 715B is 2, and quota 730C for storage system process 715C is 1. The quota of storage system process 715 may be assigned by an administrator. For example, the ratio between quota 730A assigned to storage system process 715A and the total number of quota 730 assigned to storage system process 715 with an open segment may be calculated using quota 730A divided by the sum of quotas 730A through C (e.g., ratio = 7/(7+2+1) =0.7).

At block 610, processing logic determines a target number of open segments of the storage system. For example, in FIG. 7, the target number of open segments of the storage system is 100 (e.g., target parallelism 725). At block 615, processing logic calculates an allocation share of the storage system process using the ratio and the target number of open segments. For example, in fig. 7, the allocation share 720A of the storage system process 715A is the ratio (0.7) calculated above times the target parallelism 726 of 100 (0.7x100=70). The allocation share 720A of the storage system process 715A is 70 open segments. It may be noted that in another given example, parameters such as the open segment usage 710, the number of non-free storage system processes may change, which may result in an adjustment of the allocation share 720 for any given storage system process. It may be noted that allocation shares 720 for other storage system processes 715 may be determined in a similar manner as described above.

Returning to the description of FIG. 5, at block 520, processing logic determines whether the open segment usage of the storage system process is below the allocation share of the storage system process.

In response to determining that the open segment of the storage system process is used below the allocated share of the storage system process, processing logic moves to block 525 and opens a new segment for the storage system process. For example, in FIG. 7, storage system process 715A has an allocation share 720A of 70. The open segment usage 710 of storage system process 715A is 65 open segments, which is 5 open segments lower than allocation share 720A. If storage system process 715A is the storage system process that sent the I/O write request, processing logic will grant the open segment (e.g., at least up to 5 open segments) because the open segment usage 710 of storage system process 715A is below the allocation share 720A of storage system process 715A.

In response to determining that the open segment usage of the storage system process is not below the allocated share of the storage system process, processing logic moves to block 530 and determines other allocated shares of other storage system processes having open segments. For example, in FIG. 7, if storage system process 715C is the storage system process that has sent an I/O write request to write data to the storage system, processing logic will determine that open segment usage 710 of storage system process 715C is 20 open segments, which is above allocation share 720B (e.g., 10 open segments). Processing logic may determine that allocation shares 720A and 720B of other storage system processes 715A and 715B are 65 and 15, respectively. Processing logic may determine the allocation shares of other storage system processes in a similar manner as described above. It may be noted that for clarity of the remaining description of FIG. 5, storage system process 715C is the storage system process that sent the I/O write request, and storage system processes 715A and 715B are other storage system processes unless otherwise described.

At block 535, processing logic determines open segment usage of other storage system processes, such as storage system processes 715A and 715B (e.g., 65 and 15 open segments, respectively). At block 540, processing logic identifies a pool of segments that are not used by other storage system processes (e.g., storage system processes 715A and 715B) by determining differences between other allocation shares (e.g., allocation shares 720A and 720B) of other storage system processes 715A and 715B and open segment usage 710. For example, the other storage system processes 715A and 715B each have a difference between the allocation shares 720A and 720B and the open segment usage 710 of 5 unused open segments (e.g., 65 and 15 open segments). Unused open segments of storage system processes 715A and 715B may be added to the open segment pool.

At block 545, processing logic distributes new segments from the segment pool to the storage system process. For example, if a new storage system process (not shown) requests an additional open segment (e.g., having at least 1 open segment prior to the request), allocation share 720 may be recalculated based on the new storage system process. If the new storage system process is under the recalculated allocation share of the new storage system process, the new storage system process may receive some or all of the new open segments from the open segment pool. In other implementations, the open segment pool may be partitioned between oversubscribed storage system processes (e.g., higher than the calculated allocation share for a particular storage system process). In some implementations, opening the segment pool can be divided equally between oversubscribed storage system processes. In other embodiments, the open segment pool may be partitioned between oversubscribed storage system processes according to the ratio of the quota 730 of the storage system processes. For example, if the oversubscribed storage system process 715C with quota 730C of 1 is split by a pool of 10 open segments with a new storage system process (not shown) with quota of 4, then storage system process 715C may obtain one fifth of the pool of open segments (e.g., 1/5 = 2 open segments) and the new storage system process may obtain four fifth of the pool of open segments (e.g., 4/5 = 8 open segments). The storage system process 715C may obtain an allocation share 720B of 10 open segments plus another 2 open segments from the open segment pool for a total of 12 open segments. It may be noted that the 20 open segments that have been assigned to storage system process 715C are not removed from storage system process 715C, but in an embodiment, storage system process 715C may not obtain a new open segment unless the storage system experiences a change in an operating parameter, such as a change in the pool of open segments or a change in the allocation share 720C.

In some embodiments, aspects of the present disclosure may be applied to storage systems that utilize partitioned storage. Thus, the embodiments previously described in FIGS. 4 through 7 may be applied to the zones of a partitioned storage device in addition to or in place of the segments. As previously described in FIG. 1A, in a partitioned storage device, the partition namespaces on the partitioned storage device may be addressed by grouping and aligning groups of blocks by natural size, forming a plurality of addressable regions. In implementations utilizing SSDs, the natural size may be based on the erase block size of the SSD. The zones of the partitioned storage device may be in different states. The region may be in an empty state in which data is not stored in the region. The void region may be opened explicitly, or may be opened implicitly by writing data to the region. This is the initial state of the region on the new partition storage, but may also be the result of a region reset. In some implementations, the empty region may have a designated location within the flash memory of the partitioned storage device. In an embodiment, the location of the empty region may be selected the first time the region is opened or written to (or later if the write is buffered in memory). The region may be implicitly or explicitly in an open state, wherein the region in the open state may be written to store data using a write or append command.

Processing logic of a storage controller of a storage system may adjust a storage bandwidth of a storage system process requesting writing data to the storage system by calculating an allocation share of the storage system process. Processing logic may use the calculated allocation shares to determine whether to open additional regions of one or more partitioned storage devices to facilitate executing storage system processes in parallel with other storage system processes, as will be described in more detail below.

FIG. 8 is an illustration of an example of a storage system 800 that utilizes parameters to determine an allocation share of a storage system process in accordance with an embodiment of the present disclosure. As previously described in fig. 6, processing logic of a storage system controller of storage system 800 determines a ratio between a target ratio (e.g., quota, such as quota 830A of fig. 8) assigned to an open region of a storage system process and a total number of target ratios (e.g., quota, such as quota 830 of fig. 8) assigned to open regions of a storage system process having open regions. The quota (or target ratio of open regions) may refer to a value indicating a target ratio of open regions for a particular storage system process. In some embodiments, the quota may use the target parallelism 825 as a scaling factor.

In an embodiment, the storage system process may be an aspect of the processing logic of the processing device of storage system 800. The logical aspect may obtain an open area of the storage system 800 and ultimately determine the obtained area of the storage system 800. The storage system 800 may contain multiple storage system processes competing with each other for open areas. These multiple storage system processes may run within storage system 800 in various ways or contexts and may be aspects of the processing logic of one or more processing devices of storage system 800. These aspects may not be simple operating system processes, but may be tasks, subsystems, work queues, or other data or computing structures and relationships that operate within or across various operating system processes or threads of a storage system, where the tasks, subsystems, work queues, or other data or computing structures and relationships implement aspects of storage system processing logic that are obtained or ultimately determined as part of implementing the particular logic.

For example, FIG. 8 illustrates quota 830 for three different storage system processes 815 with open regions. It may be noted that some storage system processes of storage system 800 may not be shown in fig. 8 if the storage system processes do not have an allocated open area. For example, processing logic of the storage system may determine that a particular storage system process has no additional work to do and may free up some or all of the regions previously allocated to the particular storage system process. Quota 830A for storage system process 815A is 7, quota 830B for storage system process 815B is 2, and quota 830C for storage system process 815C is 1. In an embodiment, the quota of storage system process 815 may be assigned by an administrator. For example, the ratio between quota 830A assigned to storage system process 815A and the total number of quotas 830 assigned to storage system process 815 with open area may be calculated using quota 830A divided by the sum of quotas 830A through C (e.g., ratio = 7/(7+2+1) =0.7).

Processing logic of storage system 800 may utilize target parallelism 825 of storage system 800 to determine allocation shares 820A-C of storage system processes 815A-C. For example, in fig. 8, the allocation share 820A of the storage system process 815A is the target parallelism 825 (0.7x100=70) multiplied by 100 by the ratio (0.7) calculated as above. The allocation share 820A of the storage system process 815A is 70 open areas. It may be noted that in another given example, a parameter such as the number of storage system processes that open a zone usage 810 or have no allocated zone may change, which may result in an adjustment of the allocation share 820 for any given storage system process. It may be noted that allocation shares 820 for other storage system processes 815 may be determined in a similar manner as described above.

As previously described in fig. 5, processing logic determines whether the open area usage 810 of the storage system process 815 is below the allocation share 820 of the storage system process. If processing logic determines that open region usage 810 of storage system process 815 is below allocation share 820 of storage system process 815, processing logic opens a new region for storage system process 815. For example, in FIG. 8, storage system process 815A has an allocation share 820A of 70. The open area usage 810A of the storage system process 815A is 65 open areas, which is 5 open areas lower than the allocation share 820A. If storage system process 815A is the storage system process that sent the I/O write request, processing logic will grant open areas (e.g., at least up to 5 open areas) because the open area usage 810A of storage system process 815A is below the allocation share 820A of storage system process 815A.

In response to determining that the open area usage of the storage system process is not below the allocated share of the storage system process, processing logic determines other allocated shares of other storage system processes having open areas, as previously described in fig. 5. For example, in FIG. 8, if storage system process 815C is a storage system process that has sent an I/O write request to write data to the storage system, processing logic will determine that open area usage 810C of storage system process 815C is 20 open areas, which is higher than allocation share 820B (e.g., 10 open areas). Processing logic may determine that allocation shares 820A and 820B of other storage system processes 815A and 815B are 65 and 15, respectively. Processing logic may determine the allocation shares of other storage system processes in a similar manner as described above. It may be noted that for clarity of the remaining description of FIG. 8, storage system process 815C is the storage system process that sent the I/O write request, and storage system processes 815A and 815B are other storage system processes unless otherwise described. It may also be noted that the values used in fig. 8 are shown for illustration purposes only, and that embodiments of the present disclosure may utilize any number of storage system processes, open areas, allocated shares, quotas, and the like.

In determining open region usage 810A-C of storage system processes 815A-C, processing logic identifies a pool of regions that are not used by other storage system processes (e.g., storage system processes 815A-815B) by determining differences between other allocation shares (e.g., allocation shares 820A-820B) of other storage system processes 815A-815B and open region usage 810. For example, the other storage system processes 815A and 815B each have a difference between the allocation shares 820A and 820B and the open area usage 810 of the 5 unused open areas (e.g., 65 and 15 open areas, respectively). Unused open areas of storage system processes 815A and 815B may be added to the open area pool.

Processing logic may distribute new regions from the pool of regions to the storage system process. For example, if a new storage system process (not shown) requests an additional open region (e.g., has at least 1 open region prior to the request), the allocation share 820 may be recalculated based on the new storage system process. If the new storage system process is under the recalculated allocation share of the new storage system process, the new storage system process may receive some or all of the new open regions from the open region pool. In other implementations, the open pool of zones may be partitioned between oversubscribed storage system processes (e.g., higher than the calculated allocation share for a particular storage system process). In some implementations, the open pool of zones may be divided equally between oversubscribed storage system processes. In other embodiments, the open pool may be partitioned between oversubscribed storage system processes according to the ratio of the quota 830 of the storage system processes. For example, if the oversubscribed storage system process 815C with quota 830C of 1 splits a pool of 10 open regions with a new storage system process (not shown) with quota 4, then storage system process 715C may obtain one fifth of the pool of open regions (e.g., 1/5 = 2 open regions) and the new storage system process may obtain four fifths of the pool of open regions (e.g., 4/5 = 8 open regions). The storage system process 815C may obtain an allocation share 820B of 10 open areas plus another 2 open areas from the open area pool for a total of 12 open areas. It may be noted that the 20 open areas that have been allocated to storage system process 815C are not removed from storage system process 815C, but in an embodiment, storage system process 815C may not obtain a new open area unless storage system 800 experiences a change in an operating parameter, such as a change in an open area pool or a change in allocation share 820C.

FIG. 9 is an example method 900 of adjusting storage bandwidth of a storage system process to store data at a partitioned storage device in accordance with an embodiment of the present disclosure. In general, method 900 may be performed by processing logic that may comprise hardware (e.g., a processing device, circuitry, dedicated logic, programmable logic, microcode, hardware of the device, integrated circuits, etc.), software (e.g., instructions run or executed on a processing device), or a combination thereof. In some embodiments, as depicted in fig. 1A-3D, processing logic of a processing device of a memory controller may perform method 900.

The method 900 may begin at block 910, where processing logic determines a ratio between a target ratio assigned to an open region of a storage system process and a total number of target ratios assigned to open regions of a plurality of storage system processes having open regions, as previously described in fig. 8.

At block 920, processing logic determines a target number of open areas of the storage system. In an embodiment, the target number of open regions may correspond to a target parallelism value, as previously described in fig. 8.

At block 930, processing logic receives an input/output (I/O) write request to write data to the storage system.

At block 940, processing logic calculates an allocation share of a storage system process associated with the I/O write request using the ratio and the target number of open regions.

At block 950, after determining that the open region usage of the storage system process is below the allocated share of the storage system process, processing logic opens a new region for the storage system process.

Claims

1. A storage system, comprising:

a plurality of partitioned storage devices; and

a storage controller operably coupled to the plurality of partitioned storage devices, the storage controller comprising a processing device for:

adjusting a storage bandwidth of a storage system process in response to an input-output (I/O) write request to write data to one or more of the plurality of partitioned storage devices by:

calculating an allocation share of the storage system process requesting writing of the data; and

upon determining that an open region of the storage system process is used below the allocated share of the storage system process, a new region is opened for the storage system process.

2. The storage system of claim 1, wherein the allocation share of the storage system process is calculated using a target ratio of open areas assigned to the storage system process, a target ratio of open areas assigned to other storage system processes having open areas, and a target number of open areas of the storage system.

3. The storage system of claim 1, wherein to determine the allocation share of the storage system process, the processing device is further to:

the allocation share of the storage system process is calculated using a ratio between a target ratio of open areas assigned to the storage system process and a total number of target ratios of open areas assigned to a plurality of storage system processes having open areas and a target number of open areas.

4. The storage system of claim 1, wherein to adjust the storage bandwidth of the storage system process, the processing device is further to:

upon determining that the open area usage of the storage system process is not below the allocated share of the storage system process, an open area pool that is not used by other storage system processes is identified by determining a difference between other allocated shares of other storage system processes and open area usage.

5. The storage system of claim 1, wherein to adjust the storage bandwidth of the storage system process, the processing device is further to:

upon determining that the open area usage of the storage system is below a target number of open areas of the storage system, a new area is opened for the storage system process.

6. The storage system of claim 5, wherein determining the allocation share of the storage system process is responsive to determining that the open area usage of the storage system is not below the target number of open areas of the storage system.

7. The storage system of claim 1, wherein the processing device further:

upon determining that the storage system process does not have an open region, a new region is opened for the storage system process and the storage bandwidth of the storage system process is adjusted to include the new region.

8. A method, comprising:

adjusting, by the processing device storage controller, a storage bandwidth of a storage system process in response to an input-output (I/O) write request to write data to the partitioned storage device by:

9. The method of claim 8, wherein the allocation share of the storage system process is calculated using a target ratio of open areas assigned to the storage system process, a target ratio of open areas assigned to other storage system processes having open areas, and a target number of open areas of the storage system.

10. The method of claim 8, wherein determining the allocation share of the storage system process further comprises:

11. The method of claim 8, wherein adjusting the storage bandwidth of the storage system process further comprises:

upon determining that the open area usage of the storage system process is not below the allocated share of the storage system process, an open area pool that is not used by other storage system processes is identified by determining a difference between the other allocated shares of the other storage system processes and the open area usage.

12. The method of claim 8, wherein adjusting the storage bandwidth of the storage system process further comprises:

13. The method of claim 12, wherein determining the allocation share of the storage system process is responsive to determining that the open area usage of the storage system is not below the target number of open areas of the storage system.

14. The method as recited in claim 8, further comprising:

15. A non-transitory computer-readable storage medium comprising instructions that, when executed by a processing device of a storage controller, cause the processing device to:

adjusting, by the processing device, a storage bandwidth of a storage system process in response to an input-output (I/O) write request to write data to a partitioned storage device by:

16. The non-transitory computer-readable storage medium of claim 15, wherein the allocation share of the storage system process is calculated using a target ratio of open areas assigned to the storage system process, a target ratio of open areas assigned to other storage system processes having open areas, and a target number of open areas of the storage system.

17. The non-transitory computer-readable storage medium of claim 15, wherein to determine the allocation share of the storage system process, the processing device is further to:

18. The non-transitory computer-readable storage medium of claim 15, wherein to adjust the storage bandwidth of the storage system process, the processing device is further to:

19. The non-transitory computer-readable storage medium of claim 15, wherein to adjust the storage bandwidth of the storage system process, the processing device is further to:

20. The non-transitory computer-readable storage medium of claim 19, wherein determining the allocation share of the storage system process is responsive to determining that the open area usage of the storage system is not below the target number of open areas of the storage system.