CN117616378A

CN117616378A - Efficient writing of data in a partitioned drive storage system

Info

Publication number: CN117616378A
Application number: CN202280049124.1A
Authority: CN
Inventors: 罗纳德·卡尔
Original assignee: Pure Storage Inc
Current assignee: Pure Storage Inc
Priority date: 2021-06-24
Filing date: 2022-05-27
Publication date: 2024-02-27
Also published as: EP4359899A1; WO2022271412A1; WO2022271412A9

Abstract

A list of available zones across respective SSD storage portions of a plurality of partitioned storage devices of the storage system is maintained. Data is received from a plurality of sources, wherein the data is associated with processing a data set that includes a plurality of volumes and associated metadata. A number of slices of the data are determined such that each slice can be written in parallel with the remaining slices. The tiles are mapped to subsets of the available bands, respectively. The tiles are written in parallel to the subset of the available zones.

Description

Efficient writing of data in a partitioned drive storage system

Related application

The united states portion of the present application claiming 2021, 6-24, and filed herewith is entitled to patent application No. 17/356,870, which is hereby incorporated by reference.

Background

Storage systems, such as enterprise storage systems, may include centralized or decentralized data repositories that provide common data management, data protection, and data sharing functions, such as through connections with computer systems.

Drawings

The present disclosure is illustrated by way of example, and not by way of limitation, and may be more fully understood with reference to the following detailed description when considered in connection with the accompanying drawings as described below.

FIG. 1A illustrates a first example system for data storage according to some embodiments.

FIG. 1B illustrates a second example system for data storage according to some embodiments.

FIG. 1C illustrates a third example system for data storage according to some embodiments.

FIG. 1D illustrates a fourth example system for data storage according to some embodiments.

FIG. 2A is a perspective view of a storage cluster having multiple storage nodes and internal storage coupled to each storage node to provide network-attached storage, according to some embodiments.

Fig. 2B is a block diagram showing an interconnect switch coupling multiple storage nodes, according to some embodiments.

FIG. 2C is a multi-level block diagram showing the contents of a storage node and the contents of one of the non-volatile solid state storage units, according to some embodiments.

FIG. 2D shows a storage server environment using embodiments of storage nodes and storage units of some previous figures, according to some embodiments.

FIG. 2E is a blade hardware block diagram showing a control plane, a compute and store plane, and an authority to interact with underlying physical resources, according to some embodiments.

FIG. 2F depicts a resilient software layer in a blade of a storage cluster, according to some embodiments.

FIG. 2G depicts authoritative and storage resources in blades of a storage cluster, in accordance with some embodiments.

Fig. 3A illustrates a diagram of a storage system coupled in data communication with a cloud service provider, according to some embodiments of the present disclosure.

Fig. 3B illustrates a diagram of a memory system according to some embodiments of the present disclosure.

Fig. 3C illustrates an example of a cloud-based storage system according to some embodiments of the present disclosure.

FIG. 3D illustrates an exemplary computing device that may be specifically configured to perform one or more of the processes described herein.

FIG. 3E illustrates an example of a storage system cluster for providing storage services according to an embodiment of the disclosure.

FIG. 4 illustrates an example system for block merging according to some embodiments.

FIG. 5 illustrates an example system for block merging according to some embodiments.

Fig. 6 illustrates a flow diagram for block merging, according to some embodiments.

FIG. 7 illustrates an example system for parallel zone writing, according to some embodiments.

FIG. 8 illustrates an example system for writing to bands of a partitioned storage device in parallel, according to some implementations.

FIG. 9 is an example method to efficiently write data in a partitioned drive storage system in accordance with an embodiment of the present disclosure.

Detailed Description

In flash memory systems, it is desirable to be able to directly access block addresses without relying on address translation from a controller associated with the flash drive. The ability to directly access block addresses allows for higher efficiency and better control of data reading, writing, and erasing. In addition, conventional storage systems write data to blocks, erase portions of the data, and attempt to write new data into the holes created by the erasures. This is inefficient from a temporal and storage space perspective, as the new data may not fit entirely into the space created by the erasure of the old data. The present invention proposes an efficient method for block merging for writing in a direct mapped flash memory system.

To address the above-described deficiencies, in one embodiment, a controller associated with a plurality of flash memory devices in a flash memory system maintains a list of available allocation units across the plurality of flash devices of the flash memory system. Flash devices map erase blocks into directly addressable storage devices. Flash memory systems may classify erase blocks as available, in use, or unavailable. In one implementation, at least a portion of an erase block may be assigned as an allocation unit. Data to be stored in a flash memory system may be received from a number of different sources. The data may be associated with processing a data set and the data set may include a plurality of file systems and associated metadata. Processing logic of the storage system may determine the number of subsets of data (to be written into cblock) such that each subset can be written in parallel with the remaining subsets. Processing logic may then map each of the several subsets to an available allocation unit and write the multiple subsets in parallel.

By offloading operations from a controller associated with a drive to controllers associated with many drives, a flash memory system may rely on a drive with less firmware overhead associated with a drive-specific controller. In addition, having a controller that manages many drives allows for better drive management operations, including wear leveling. By implementing the block merge operation, time and storage space efficiency are also improved.

Exemplary methods, apparatus, and products for efficiently repositioning data using different programming modes according to embodiments of the present disclosure are described with reference to the accompanying drawings, beginning with FIG. 1A. FIG. 1A illustrates an example system for data storage according to some embodiments. For purposes of illustration and not limitation, system 100 (also referred to herein as a "storage system") includes numerous elements. It may be noted that in other embodiments, the system 100 may include the same, more, or fewer elements configured in the same or different ways.

The system 100 includes a number of computing devices 164A-164B. For example, a computing device (also referred to herein as a "client device") may be embodied as a server, workstation, personal computer, notebook computer, or the like in a data center. The computing devices 164A-164B may be coupled in data communication with one or more storage arrays 102A-102B through a storage area network ('SAN') 158 or a local area network ('LAN') 160.

SAN 158 may be implemented using a variety of data communication mesh architectures, devices, and protocols. For example, the mesh architecture for SAN 158 may include fibre channel, ethernet, infiniband, serial attached small computer system interface ('SAS'), and so forth. The data communication protocols for use with SAN 158 may include advanced technology attachment ('ATA'), fibre channel protocol, small computer System interface ('SCSI'), internet Small computer System interface ('iSCSI'), hyperSCSI, flash nonvolatile memory via mesh architecture ('NVMe'), and so forth. It is noted that SAN 158 is provided for purposes of illustration and not limitation. Other data communication couplings may be implemented between the computing devices 164A-164B and the storage arrays 102A-102B.

LAN 160 may also be implemented using a variety of mesh architectures, devices, and protocols. For example, the mesh architecture for LAN 160 may include ethernet (802.3), wireless (802.11), and so on. The data communication protocols for use in the LAN 160 may include transmission control protocol ('TCP'), user datagram protocol ('UDP'), internet protocol ('IP'), hypertext transfer protocol ('HTTP'), wireless access protocol ('WAP'), handset transfer protocol ('HDTP'), session initiation protocol ('SIP'), real-time protocol ('RTP'), and the like.

In an implementation, the storage arrays 102A-102B may provide persistent data storage for the computing devices 164A-164B. The storage array 102A may be contained in a chassis (not shown) and the storage array 102B may be contained in another chassis (not shown). The storage arrays 102A and 102B may include one or more storage array controllers 110A-110D (also referred to herein as "controllers"). The storage array controllers 110A-110D may be embodied as modules of an automated computing machine including computer hardware, computer software, or a combination of computer hardware and software. In some implementations, the storage array controllers 110A-110D may be configured to perform various storage tasks. Storage tasks may include writing data received from the computing devices 164A-164B to the storage arrays 102A-102B, erasing data from the storage arrays 102A-102B, retrieving data from the storage arrays 102A-102B and providing data to the computing devices 164A-164B, monitoring and reporting disk utilization and performance, performing redundant operations such as redundant array of independent drives ('RAID') or RAID-like data redundancy operations, compressing data, encrypting data, and so forth.

The memory array controllers 110A-110D may be implemented in numerous ways, including as a field programmable gate array ('FPGA'), a programmable logic chip ('PLC'), an application specific integrated circuit ('ASIC'), a system on a chip ('SOC'), or any computing device including discrete components such as a processing device, a central processing unit, a computer memory, or various adapters. For example, the storage array controllers 110A-110D may include a data communications adapter configured to support communications via the SAN 158 or LAN 160. In some implementations, the storage array controllers 110A-110D may be independently coupled to the LAN 160. In an implementation, the storage array controllers 110A-110D may include I/O controllers or the like that couple the storage array controllers 110A-110D for data communications to persistent storage resources 170A-170B (also referred to herein as "storage resources") through a midplane (not shown). Persistent storage resources 170A-170B generally include any number of storage drives 171A-171F (also referred to herein as "storage") and any number of non-volatile random access memory ('NVRAM') devices (not shown).

In some implementations, NVRAM devices of persistent storage resources 170A-170B may be configured to receive data from storage array controllers 110A-110D to be stored in storage drives 171A-171F. In some examples, the data may originate from computing devices 164A-164B. In some examples, writing data to the NVRAM device may be performed faster than writing data directly to the storage drives 171A-171F. In an implementation, the storage array controllers 110A-110D may be configured to utilize NVRAM devices as fast accessible buffers for data intended to be written to the storage drives 171A-171F. Latency for write requests using NVRAM devices as buffers may be improved relative to systems in which storage array controllers 110A-110D write data directly to storage drives 171A-171F. In some implementations, the NVRAM device may be implemented with computer memory in the form of high-bandwidth, low-latency RAM. NVRAM devices are referred to as "non-volatile" because the NVRAM device may receive or contain a unique power source that maintains the state of the RAM after the main power of the NVRAM device is lost. Such a power source may be a battery, one or more capacitors, or the like. In response to a power loss, the NVRAM device may be configured to write the contents of RAM to persistent storage, such as storage drives 171A-171F.

In an implementation, storage drives 171A-171F may refer to any device configured to record data continuously, where "continuously" or "persistence" refers to the ability of a device to maintain recorded data after a loss of power. In some implementations, the storage drives 171A-171F may correspond to non-disk storage media. For example, storage drives 171A-171F may be one or more solid state drives ('SSDs'), flash memory-based storage devices, any type of solid state non-volatile memory, or any other type of non-mechanical storage device. In other embodiments, storage drives 171A-171F may comprise mechanical or rotating hard disks, such as hard disk drives ('HDD').

In some implementations, the storage array controllers 110A-110D may be configured to offload device management responsibilities from the storage drives 171A-171F in the storage arrays 102A-102B. For example, the storage array controllers 110A-110D may manage control information, which may describe the state of one or more memory blocks in the storage drives 171A-171F. For example, the control information may indicate that a particular memory block has failed and should no longer be written to, that a particular memory block contains boot code for the storage array controllers 110A-110D, the number of program erase ('P/E') cycles that have been performed on a particular memory block, the age of data stored in a particular memory block, the type of data stored in a particular memory block, and so forth. In some implementations, control information may be stored as metadata with associated memory blocks. In other implementations, control information for the storage drives 171A-171F may be stored in one or more particular memory blocks of the storage drives 171A-171F selected by the storage array controllers 110A-110D. The selected memory block may be marked with an identifier indicating that the selected memory block contains control information. The identifiers may be used by the storage array controllers 110A-110D in conjunction with the storage drivers 171A-171F to quickly identify memory blocks containing control information. For example, the memory controllers 110A-110D may issue commands to locate memory blocks containing control information. It may be noted that the control information may be large such that portions of the control information may be stored in multiple locations, for example, for redundancy purposes, or the control information may be otherwise distributed across multiple memory blocks in the storage drives 171A-171F.

In an implementation, the storage array controllers 110A-110D may offload device management responsibilities from the storage drives 171A-171F of the storage arrays 102A-102B by retrieving control information from the storage drives 171A-171F describing the state of one or more memory blocks in the storage drives 171A-171F. For example, control information may be retrieved from storage drives 171A-171F by storage array controllers 110A-110D querying storage drives 171A-171F for the location of control information for a particular storage drive 171A-171F. The storage drives 171A-171F may be configured to execute instructions that enable the storage drives 171A-171F to identify the location of the control information. The instructions may be executed by a controller (not shown) associated with or otherwise located on the storage drives 171A-171F and may cause the storage drives 171A-171F to scan a portion of each memory block to identify the memory block that stores control information for the storage drives 171A-171F. The memory drives 171A-171F may respond by sending a response message to the memory array controllers 110A-110D that includes the location of the control information of the memory drives 171A-171F. In response to receiving the response message, the storage array controllers 110A-110D may issue a request to read data stored at addresses associated with the locations of the control information of the storage drives 171A-171F.

In other implementations, the storage array controllers 110A-110D may further offload device management responsibilities from the storage drives 171A-171F by performing storage drive management operations in response to receiving control information. For example, storage drive management operations may include operations typically performed by storage drives 171A-171F, such as controllers (not shown) associated with particular storage drives 171A-171F. For example, storage drive management operations may include ensuring that data is not written to failed memory blocks within storage drives 171A-171F, ensuring that data is written to memory blocks within storage drives 171A-171F in a manner that achieves adequate wear leveling, and so forth.

In an implementation, the storage arrays 102A-102B may implement two or more storage array controllers 110A-110D. For example, the memory array 102A may include a memory array controller 110A and a memory array controller 110B. In a given case, a single storage array controller 110A-110D of the storage system 100 (e.g., storage array controller 110A) may be designated as having a primary state (also referred to herein as a "primary controller"), and other storage array controllers 110A-110D (e.g., storage array controller 110A) may be designated as having a secondary state (also referred to herein as a "secondary controller"). The primary controller may have certain rights, such as permissions to alter data in persistent storage resources 170A-170B (e.g., write data to persistent storage resources 170A-170B). At least some of the rights of the primary controller may supersede the rights of the secondary controller. For example, when a primary controller has the right, a secondary controller may not have permission to alter the data in persistent storage resources 170A-170B. The state of the memory array controllers 110A-110D may change. For example, the storage array controller 110A may be designated as having a secondary state and the storage array controller 110B may be designated as having a primary state.

In some implementations, a primary controller, such as storage array controller 110A, may be used as the primary controller for one or more storage arrays 102A-102B, and a secondary controller, such as storage array controller 110B, may be used as the secondary controller for one or more storage arrays 102A-102B. For example, storage array controller 110A may be a primary controller of storage arrays 102A and 102B, and storage array controller 110B may be a secondary controller of storage arrays 102A and 102B. In some implementations, the storage array controllers 110C and 110D (also referred to as "storage processing modules") may have neither a primary nor a secondary state. The storage array controllers 110C and 110D implemented as storage processing modules may serve as communication interfaces between the primary and secondary controllers (e.g., storage array controllers 110A and 110B, respectively) and the storage array 102B. For example, the storage array controller 110A of the storage array 102A may send a write request to the storage array 102B via the SAN 158. The write request may be received by both memory array controllers 110C and 110D of memory array 102B. The storage array controllers 110C and 110D facilitate communications, such as sending write requests to the appropriate storage drives 171A-171F. It may be noted that in some implementations, the storage processing module may be used to increase the number of storage drives controlled by the primary and secondary controllers.

In an implementation, the storage array controllers 110A-110D are communicatively coupled to one or more storage drives 171A-171F via a midplane (not shown) and one or more NVRAM devices (not shown) included as part of the storage arrays 102A-102B. The storage array controllers 110A-110D may be coupled to the midplane via one or more data communication links and the midplane may be coupled to the storage drives 171A-171F and NVRAM devices via one or more data communication links. The data communication links described herein are collectively illustrated by data communication links 108A-108D and may include, for example, a peripheral component interconnect express ('PCIe') bus.

FIG. 1B illustrates an example system for data storage according to some embodiments. The memory array controller 101 illustrated in FIG. 1B may be similar to the memory array controllers 110A-110D described with reference to FIG. 1A. In one example, storage array controller 101 may be similar to storage array controller 110A or storage array controller 110B. For purposes of illustration and not limitation, the memory array controller 101 includes numerous elements. It may be noted that in other embodiments, the memory array controller 101 may include the same, more, or fewer elements configured in the same or different ways. It may be noted that the elements of fig. 1A may be included below to help illustrate features of the memory array controller 101.

The memory array controller 101 may include one or more processing devices 104 and random access memory ('RAM') 111. The processing device 104 (or controller 101) represents one or more general-purpose processing devices, such as a microprocessor, central processing unit, or the like. More particularly, the processing device 104 (or the controller 101) may be a complex instruction set computing ('CISC') microprocessor, a reduced instruction set computing ('RISC') microprocessor, a very long instruction word ('VLIW') microprocessor, or a processor implementing other instruction sets, or a processor implementing a combination of instruction sets. The processing device 104 (or controller 101) may also be one or more special purpose processing devices, such as an ASIC, FPGA, digital signal processor ('DSP'), network processor, or the like.

The processing device 104 may be connected to the RAM 111 via a data communication link 106, which may be embodied as a high-speed memory bus, such as a double data rate 4 ('DDR 4') bus. An operating system 112 is stored in RAM 111. In some implementations, the instructions 113 are stored in the RAM 111. The instructions 113 may include computer program instructions for performing operations in a direct-mapped flash memory system. In one embodiment, a direct mapped flash memory system is a system that addresses data blocks within a flash drive directly without address translation performed by the memory controller of the flash drive.

In an implementation, the storage array controller 101 includes one or more host bus adapters 103A-103C coupled to the processing device 104 via data communication links 105A-105C. In implementations, the host bus adapters 103A-103C can be computer hardware connecting a host system (e.g., a storage array controller) to other networks and storage arrays. In some examples, host bus adapters 103A-103C may be fibre channel adapters enabling storage array controller 101 to connect to a SAN, ethernet adapters enabling storage array controller 101 to connect to a LAN, or the like. Host bus adapters 103A-103C may be coupled to processing device 104 via data communication links 105A-105C (e.g., PCIe bus).

In an implementation, the storage array controller 101 may include a host bus adapter 114 coupled to the expander 115. Expander 115 may be used to attach a host system to a larger number of storage drives. In embodiments in which host bus adapter 114 is embodied as a SAS controller, expander 115 may be, for example, a SAS expander for enabling host bus adapter 114 to be attached to a storage drive.

In an implementation, the storage array controller 101 may include a switch 116 coupled to the processing device 104 via a data communication link 109. Switch 116 may be a computer hardware device that may create multiple endpoints from a single endpoint, thereby enabling multiple devices to share a single endpoint. For example, switch 116 may be a PCIe switch coupled to a PCIe bus (e.g., data communication link 109) and presenting a plurality of PCIe connection points to the midplane.

In an embodiment, the storage array controller 101 includes a data communication link 107 for coupling the storage array controller 101 to other storage array controllers. In some examples, data communication link 107 may be a Quick Path Interconnect (QPI) interconnect.

A conventional storage system using a conventional flash drive may implement a process that spans the flash drive as part of the conventional storage system. For example, higher level processes of a storage system may initiate and control processes across a flash drive. However, the flash drives of conventional storage systems may include their own storage controllers that also perform the process. Thus, for a conventional storage system, both higher-level processes (e.g., initiated by the storage system) and lower-level processes (e.g., initiated by the storage controller of the storage system) may be performed.

To address various drawbacks of conventional memory systems, operations may be performed by higher-level processes rather than by lower-level processes. For example, a flash memory system may include a flash drive that does not include a memory controller that provides the process. Thus, the operating system of the flash memory system itself may initiate and control the process. This may be achieved by a direct mapped flash memory system that directly addresses data blocks within the flash drive without address translation performed by the memory controller of the flash drive.

In implementations, the storage drives 171A-171F may be one or more partitioned storage devices. In some implementations, the one or more partitioned storage devices may be shingled HDDs. In an implementation, the one or more storage devices may be flash-based SSDs. In a partitioned storage device, the partitioned namespaces on the partitioned storage device are addressable by groups of blocks that are grouped and aligned by natural size, forming a number of addressable zones. In implementations utilizing SSDs, the natural size may be based on the erase block size of the SSD. In some implementations, a zone of a partitioned storage device may be defined during initialization of the partitioned storage device. In an embodiment, the zones may be dynamically defined when data is written to the partitioned storage.

In some implementations, the zones may be heterogeneous, with some zones each being a page group and other zones being multiple page groups. In implementations, some zones may correspond to erase blocks and other zones may correspond to multiple erase blocks. In implementations, a band may be any combination of different numbers of pages in a group of pages and/or an erase block for heterogeneous mixing of programming patterns, manufacturers, product types, and/or product generations of a storage device, such as for heterogeneous assembly, upgrades, distributed storage, and the like. In some embodiments, a band may be defined as an attribute having a usage characteristic, such as supporting data having a particular kind of lifetime (e.g., very short lifetime or very long lifetime). These attributes may be used by the partitioned storage to determine how a zone will be managed over its expected lifetime.

It should be appreciated that the zones are virtual constructs. Any particular zone may not have a fixed location at the storage device. The zone may not have any location at the storage device before being allocated. In various implementations, the bands may correspond to numbers representing virtual allocatable spatial partitions, which are the sizes of the erase blocks or other block sizes. When the system allocates or opens a band, the band is allocated to flash or other solid state storage memory, and when the system writes to the band, pages are written to the mapped flash or other solid state storage memory of the partitioned storage device. When the system shuts down the zone, the associated erase block or other sized block is completed. At some point in the future, the system may delete a band, which will free up the allocated space for that band. The zones may be moved to different locations of the partitioned storage device during the lifetime of the zones, such as when the partitioned storage device is being serviced internally.

In an implementation, the zones of a partitioned storage device may be in different states. The zone may be in an empty state, where data has not been stored at the zone. The empty zone may be explicitly opened or may be implicitly opened by writing data to the zone. This is the initial state of the zone on the new partitioned storage device, but may also be the result of a zone reset. In some implementations, the empty zone may have a specified location within the flash memory of the partitioned storage device. In an embodiment, the location of the empty zone may be selected when the zone is first opened or written to (or later if the write is buffered in memory). The zones may be implicitly or explicitly in an open state, where the zones in the open state may be written with write or additional commands to store data. In an embodiment, a zone in an open state may also be written using a copy command that copies data from a different zone. In some implementations, a partitioned storage device may have a limit on the number of open zones at a particular time.

The band in the closed state is a band that has been partially written to but has entered the closed state after issuing an explicit close operation. The band in the closed state is available for future writing, but some of the runtime overhead consumed by keeping the band in the open state may be reduced. In an embodiment, a partitioned storage device may have a limit on the number of closed zones at a particular time. The band in the full state is a band in which data is stored and data cannot be written any more. After a write operation has written data to the entire band or because the band completes the operation, the band may be in a full state. The band may or may not have been fully written to before the operation is completed. However, after the operation is completed, the band may not be further open written without first performing a band reset operation.

The mapping from the bands to the erase blocks (or to shingled tracks in the HDD) may be arbitrary, dynamic, and hidden from view. The process of opening the zone may be the following operations: which allows new zones to be dynamically mapped to the underlying storage area of the partitioned storage device and then allows data to be written by appending writes to the zones until the zones reach capacity. The band may end at any time after which no other data may be written into the band. When the data stored at the zone is no longer needed, the zone may be reset, effectively deleting the contents of the zone from the partitioned storage device, making the physical storage area maintained by the zone available for subsequent data storage. Once a zone is written to and completed, the partitioned storage device ensures that data stored at the zone is not lost until the zone is reset. During the time between writing data to a band and a reset of the band, the band may be moved back and forth between shingled tracks or erase blocks as part of a maintenance operation within the partitioned storage device (e.g., maintaining data refreshing by copying data or handling memory cell aging in an SSD).

In embodiments utilizing HDDs, a reset of a zone may allow shingled tracks to be assigned to a new open zone that may be open at some future time. In implementations utilizing SSDs, a reset of a band may cause an associated physical erase block of the band to be erased and subsequently reused for storage of data. In some embodiments, a partitioned storage device may have a limit on the number of open zones at a time to reduce the amount of overhead dedicated to keeping the zones open.

An operating system of a flash memory system may identify and maintain a list of allocation units across multiple flash drives of the flash memory system. The allocation unit may be an entire erase block or a plurality of erase blocks. The operating system may maintain a map or address range that maps addresses directly to erase blocks of the flash drive of the flash memory system.

An erase block that is mapped directly to a flash drive may be used to rewrite data and erase data. For example, an operation may be performed on one or more allocation units that include first data and second data, where the first data is to be retained and the second data is no longer used by the flash memory system. The operating system may initiate a process of writing the first data to a new location within the other allocation unit and erase the second data and mark the allocation unit as available for subsequent data. Thus, the process may be performed only by the higher level operating system of the flash memory system without the need for additional lower level processes performed by the controller of the flash drive.

Advantages of processes performed only by the operating system of the flash memory system include increased reliability of the flash driver of the flash memory system because unnecessary or redundant write operations are not performed during the processes. One possible novelty here is the concept of initiating and controlling the process at the operating system of the flash memory system. Additionally, the process may be controlled by the operating system across multiple flash drives. This is in contrast to the process performed by the memory controller of the flash drive.

The storage system may consist of two storage array controllers that share a set of drives for failover purposes, or it may consist of a single storage array controller that provides storage services that utilize multiple drives, or it may consist of a distributed network of storage array controllers, each with a certain number of drives or a certain number of flash memory devices, where the storage array controllers in the network cooperate to provide a complete storage service and various aspects of the storage service (including storage allocation and garbage collection).

FIG. 1C illustrates a third example system 117 for data storage according to some embodiments. For purposes of illustration and not limitation, system 117 (also referred to herein as a "storage system") includes numerous elements. It may be noted that in other embodiments, the system 117 may include the same, more, or fewer elements configured in the same or different ways.

In one embodiment, the system 117 includes a dual peripheral component interconnect ('PCI') flash memory device 118 with individually addressable flash write memory devices. The system 117 may include a storage device controller 119. In one embodiment, the storage device controllers 119A-119D may be CPU, ASIC, FPGA or any other circuit that may implement the control structures required in accordance with the present disclosure. In one embodiment, the system 117 includes flash memory devices (e.g., including flash memory devices 120 a-120 n) operatively coupled to respective channels of the storage device controller 119. The flash memory devices 120 a-120 n may be presented to the controllers 119A-119D as an addressable set of flash pages, erase blocks, and/or control elements that is sufficient to allow the storage device controllers 119A-119D to program and retrieve various aspects of the flash. In one embodiment, the storage device controllers 119A-119D may perform operations on the flash memory devices 120 a-120 n including storing and retrieving data content of pages, arranging and erasing any blocks, tracking statistics related to the use and reuse of flash memory pages, erased blocks and cells, tracking and predicting error codes and faults within flash memory, controlling voltage levels associated with programming and retrieving the contents of flash cells, and the like.

In one embodiment, the system 117 may include a RAM 121 to store individually addressable fast write data. In one embodiment, RAM 121 may be one or more separate discrete devices. In another embodiment, RAM 121 may be integrated into storage device controllers 119A-119D or multiple storage device controllers. RAM 121 may also be used for other purposes such as storing temporary program memory for a processing device (e.g., CPU) in device controller 119.

In one embodiment, the system 117 may include an energy storage device 122, such as a rechargeable battery or capacitor. The energy storage device 122 may store energy sufficient to power the storage device controller 119, an amount of RAM (e.g., RAM 121), and an amount of flash memory (e.g., flash memory 120 a-120 n) for a sufficient time to write the contents of the RAM to the flash memory. In one embodiment, if the storage device controllers 119A-119D detect an external power loss, the storage device controllers may write the contents of RAM to flash memory.

In one embodiment, the system 117 includes two data communication links 123a, 123b. In one embodiment, the data communication links 123a, 123b may be PCI interfaces. In another embodiment, the data communication links 123a, 123b may be based on other communication standards (e.g., hyperTransport, infiniband, etc.). The data communication links 123a, 123b may be based on a fast nonvolatile memory ('NVMe') or an NVMe ('NVMf') specification via a mesh architecture that allows external connections from other components in the storage system 117 to the storage device controllers 119A-119D. It should be noted that for convenience, the data communication link may be interchangeably referred to herein as a PCI bus.

The system 117 may also include an external power source (not shown) that may be provided via one or both data communication links 123a, 123b or may be provided separately. Alternative embodiments include a separate flash memory (not shown) dedicated to storing the contents of RAM 121. The storage device controllers 119A-119D may present a logic device via a PCI bus, which may include an addressable fast write logic device, or a different portion of the logic address space of the storage device 118 (which may be present as a PCI memory or persistent storage device). In one embodiment, operations stored into the device are directed into RAM 121. In the event of a power failure, the storage device controllers 119A-119D may write stored content associated with the addressable flash logical storage device to flash memory (e.g., flash memory 120 a-120 n) for long-term persistent storage.

In one embodiment, the logic device may include some rendering of some or all of the contents of flash memory devices 120 a-120 n, where the rendering allows a storage system (e.g., storage system 117) including storage device 118 to directly address flash memory pages and directly reprogram erase blocks from storage system components external to the storage device over the PCI bus. The presentation may also allow one or more external components to control and retrieve other aspects of the flash memory, including some or all of: tracking statistics related to the use and reuse of flash memory pages, erase blocks and cells across all flash memory devices; tracking and predicting error codes and faults within and across the flash memory device; controlling a voltage level associated with programming and retrieving the contents of the flash cell; etc.

In one embodiment, the energy storage device 122 may be sufficient to ensure that ongoing operation of the flash memory devices 120 a-120 n is completed. The energy storage device 122 may power the storage device controllers 119A-119D and associated flash memory devices (e.g., 120 a-120 n) to do those operations, as well as store fast write RAM to flash memory. The energy storage device 122 may be used to store accumulated statistics and other parameters maintained and tracked by the flash memory devices 120 a-120 n and/or the storage device controller 119. Individual capacitors or energy storage devices, such as smaller capacitors near or embedded in the flash memory device itself, may be used for some or all of the operations described herein.

Various schemes may be used to track and optimize the lifetime of the energy storage components, such as adjusting voltage levels over time, partially discharging the energy storage device 122 to measure corresponding discharge characteristics, and so forth. If the available energy decreases over time, the effective available capacity of the addressable fast write storage device may decrease to ensure that it can be safely written to based on the currently available storage energy.

FIG. 1D illustrates a third example storage system 124 for data storage according to some embodiments. In one embodiment, the storage system 124 includes storage controllers 125a, 125b. In one embodiment, the storage controllers 125a, 125b are operatively coupled to a dual PCI storage device. The storage controllers 125a, 125b are operably coupled (e.g., via a storage network 130) to a number of host computers 127 a-127 n.

In one embodiment, two storage controllers (e.g., 125a and 125 b) provide storage services, such as SCS block storage arrays, file servers, object servers, databases or data analysis services, and the like. The storage controllers 125a, 125b may provide services to host computers 127 a-127 n external to the storage system 124 through a number of network interfaces (e.g., 126 a-126 d). The storage controllers 125a, 125b may provide integrated services or applications entirely within the storage system 124, forming a fused storage and computing system. The storage controllers 125a, 125b may utilize fast write memory within or across the storage devices 119 a-119 d to record ongoing operations to ensure that operation is not lost upon a power failure, storage controller removal, storage controller or storage system shutdown, or some failure of one or more software or hardware components within the storage system 124.

In one embodiment, the memory controllers 125a, 125b operate as PCI masters for one or the other PCI buses 128a, 128 b. In another embodiment, 128a and 128b may be based on other communication standards (e.g., hyperTransport, infiniband, etc.). Other memory system embodiments may operate the memory controllers 125a, 125b as multiple masters for both PCI buses 128a, 128 b. Alternatively, a PCI/NVMe/NVMf switching infrastructure or mesh architecture may connect multiple storage controllers. Some storage system embodiments may allow storage devices to communicate directly with each other, rather than with only a storage controller. In one embodiment, the storage device controller 119a may operate under direction from the storage controller 125a to synthesize and transfer data to be stored into the flash memory device from data already stored in RAM (e.g., RAM 121 of fig. 1C). For example, after the storage controller has determined that an operation has been fully committed across the storage system, or when the flash memory on the device has reached a particular use capacity, or after a particular amount of time, a recalculated version of the RAM content may be transferred to ensure that the security of the data is improved or that the addressable flash memory is released for reuse. This mechanism may be used, for example, to avoid a second transfer from the memory controller 125a, 125b via the bus (e.g., 128a, 128 b). In one embodiment, the recalculation may include compressing the data, appending an index or other metadata, combining multiple data segments together, performing erasure code calculations, and so forth.

In one embodiment, the storage device controllers 119a, 119b are operable, under direction from the storage controllers 125a, 125b, to calculate data stored in RAM (e.g., RAM 121 of fig. 1C) and transfer the data to other storage devices without the involvement of the storage controllers 125a, 125 b. This operation may be used to mirror data stored in one storage controller 125a to another storage controller 125b, or the operation may be used to offload compression, data aggregation, and/or erasure coding calculations and transfers to the storage devices to reduce the load of the storage controllers or storage controller interfaces 129a, 129b to the PCI buses 128a, 128 b.

The storage device controllers 119A-119D may include mechanisms for implementing high availability primitives for use by other portions of the storage system external to the dual-PCI storage device 118. For example, a reservation or exclusion primitive may be provided such that in a storage system having two storage controllers providing highly available storage services, one storage controller may prevent another storage controller from accessing or continuing to access the storage device. This approach may be used, for example, in situations where one controller detects that another controller is not functioning properly or where the interconnect between two storage controllers may itself be not functioning properly.

In one embodiment, a storage system for use with dual PCI direct mapped storage with individually addressable fast write storage includes several systems that manage erase blocks or groups of erase blocks as allocation units for storing data on behalf of a storage service or for storing metadata (e.g., indexes, logs, etc.) associated with the storage service or for proper management of the storage system itself. Flash pages, which may be several kilobytes in size, may be written when data arrives, or when the storage system is to hold data for a longer time interval (e.g., exceeding a defined time threshold). To commit data faster or to reduce the number of writes to the flash memory device, the memory controller may first write the data to an individually addressable fast write memory device on one or more memory devices.

In one embodiment, the storage controllers 125a, 125b may initiate the use of erase blocks within and across the storage device (e.g., 118) according to the age and expected remaining life of the storage device or based on other statistics. The memory controllers 125a, 125b may initiate garbage collection and data migration data between the memory devices according to pages that are no longer needed, as well as manage flash page and erase block life and manage overall system performance.

In one embodiment, storage system 124 may utilize a mirroring and/or erasure coding scheme as part of storing data into an addressable fast write memory and/or as part of writing data into allocation units associated with an erasure block. The erase code may be used across storage devices and within erase blocks or allocation units, or within and across flash memory devices on a single storage device to provide redundancy against single or multiple storage device failures or to prevent internal corruption of flash memory pages caused by flash memory operations or flash memory cell degradation. Mirror and erasure coding at different levels can be used to recover from multiple types of faults occurring either singly or in combination.

The embodiments depicted with reference to fig. 2A-G illustrate a storage cluster storing user data, such as user data originating from one or more users or client systems or other sources external to the storage cluster. Storage clusters use erasure coding and redundant copies of metadata to distribute user data across storage nodes housed within a chassis or across multiple chassis. Erasure coding refers to a data protection or reconstruction method in which data is stored across a set of different locations, such as disks, storage nodes, or geographic locations. Flash memory is one type of solid state memory that may be integrated with embodiments, although embodiments may be extended to other types of solid state memory or other storage media including non-solid state memory. Control of storage locations and workloads is distributed across storage locations in a clustered peer-to-peer system. Tasks such as mediating communications between the various storage nodes, detecting when a storage node becomes unavailable, and balancing I/O (input and output) across the various storage nodes are all handled on a distributed basis. In some embodiments, data is laid out or distributed across multiple storage nodes in a data segment or stripe that supports data recovery. Ownership of data can be reassigned within a cluster independent of input and output types. This architecture, described in more detail below, allows storage nodes in the cluster to fail while the system remains operational because data can be reconstructed from other storage nodes and thus remain available for input and output operations. In various embodiments, a storage node may be referred to as a cluster node, a blade, or a server.

The storage clusters may be contained within a chassis (i.e., a housing that houses one or more storage nodes). Included within the chassis are mechanisms to power each storage node (e.g., a power distribution bus) and communication mechanisms to enable communication between the storage nodes (e.g., a communication bus). According to some embodiments, the storage clusters may operate as stand-alone systems in one location. In one embodiment, the chassis houses at least two instances of both the power distribution bus and the communication bus, which may be independently enabled or disabled. The internal communication bus may be an ethernet bus, however, other technologies such as PCIe, infiniband, and others are equally applicable. The chassis provides ports for an external communication bus for enabling communication between multiple chassis, either directly or through a switch, and with a client system. External communications may use technologies such as ethernet, infiniband, fibre channel, etc. In some embodiments, the external communication bus uses different communication bus technologies for inter-chassis and client communication. If the switch is deployed within a chassis or between chassis, the switch may act as a translation between multiple protocols or technologies. When multiple chassis are connected to define a storage cluster, the storage cluster may be accessed by a client using a proprietary interface or standard interface, such as network file system ('NFS'), common internet file system ('CIFS'), small computer system interface ('SCSI'), or hypertext transfer protocol ('HTTP'). The conversion from the client protocol may occur at the switch, at the chassis external communication bus, or within each storage node. In some embodiments, multiple chassis may be coupled or connected to each other through an aggregator switch. Part and/or all of the coupled or connected chassis may be designated as storage clusters. As discussed above, each chassis may have multiple blades, each blade having a media access control ('MAC') address, but in some embodiments, the storage clusters appear to the external network as having a single cluster IP address and a single MAC address.

Each storage node may be one or more storage servers and each storage server is connected to one or more non-volatile solid-state memory units, which may be referred to as storage units or storage devices. One embodiment includes a single storage server in each storage node and one to eight non-volatile solid-state memory units, although this one example is not meant to be limiting. The storage server may include a processor, DRAM, and interfaces for internal communication buses and for power distribution for each of the power buses. In some embodiments, the interface and storage units share a communication bus, such as PCI express, within the storage node. The non-volatile solid state memory unit may directly access the internal communication bus interface through the storage node communication bus or request the storage node to access the bus interface. The non-volatile solid-state memory unit contains an embedded CPU, a solid-state memory controller, and a number of solid-state mass storage devices, such as between 2 terabytes ('TB') and 32 terabytes in some embodiments. Embedded volatile storage media, such as DRAM, and energy storage devices are included in non-volatile solid state memory cells. In some embodiments, the energy reserve device is a capacitor, supercapacitor, or battery that enables a subset of the DRAM content to be transferred to a stable storage medium in the event of a power loss. In some embodiments, the non-volatile solid-state memory cells are comprised of storage class memory, such as phase-change or magnetoresistive random access memory ('MRAM'), which replaces DRAM and enables reduced power retention devices.

One of the many features of storage nodes and non-volatile solid state storage devices is the ability to actively reconstruct data in a storage cluster. The storage nodes and non-volatile solid state storage devices may determine when a storage node or non-volatile solid state storage device in a storage cluster is unreachable, regardless of whether there is an attempt to read data related to the storage node or non-volatile solid state storage device. The storage nodes then cooperate with the non-volatile solid state storage devices to recover and reconstruct data in at least a portion of the new locations. This constitutes an active rebuild in that the system rebuilds the data without waiting until a read access initiated from the client system employing the storage cluster requires the data. These and other details of the storage memory and its operation are discussed below.

FIG. 2A is a perspective view of a storage cluster 161 having a plurality of storage nodes 150 and internal solid state memory coupled to each storage node to provide a network attached storage or storage area network, according to some embodiments. The network-attached storage, storage area network, or storage cluster, or other storage memory, may include one or more storage clusters 161, each having one or more storage nodes 150, in a flexible and reconfigurable arrangement of both physical components and the amount of storage memory provided thereby. The storage clusters 161 are designed to fit in racks and may be arranged and populated with one or more racks as needed for storage. The storage cluster 161 has a chassis 138 with a plurality of slots 142. It should be appreciated that the chassis 138 may be referred to as a housing, shell, or rack unit. In one embodiment, the chassis 138 has fourteen slots 142, although other numbers of slots are readily contemplated. For example, some embodiments have four slots, eight slots, sixteen slots, thirty-two slots, or other suitable number of slots. In some embodiments, each slot 142 may house one storage node 150. The chassis 138 includes tabs 148 that may be used to mount the chassis 138 to a rack. The fan 144 provides air circulation for cooling the storage node 150 and its components, although other cooling components may be used, or embodiments without cooling components are contemplated. The switch mesh fabric 146 couples storage nodes 150 within the chassis 138 together and to a network to communicate with memory. In the embodiment depicted herein, slots 142 to the left of switch mesh fabric 146 and fans 144 are shown occupied by storage nodes 150 for illustrative purposes, while slots 142 to the right of switch mesh fabric 146 and fans 144 are empty and available for insertion into storage nodes 150. This configuration is one example, and one or more storage nodes 150 may occupy slots 142 in various other arrangements. In some embodiments, the storage node arrangement need not be sequential or contiguous. Storage node 150 is hot pluggable, meaning that storage node 150 may be inserted into slot 142 in chassis 138 or removed from slot 142 without stopping or shutting down the system. Upon insertion or removal of storage node 150 into slot 142, the system automatically reconfigures in order to identify and accommodate the changes. In some embodiments, reconfiguring includes recovering redundancy and/or rebalancing data or loads.

Each storage node 150 may have multiple components. In the embodiment shown here, the storage node 150 includes a printed circuit board 159 populated by a CPU 156 (i.e., a processor), memory 154 coupled to the CPU 156, and a non-volatile solid state storage 152 coupled to the CPU 156, although other installations and/or components may be used in other embodiments. The memory 154 has instructions executed by the CPU 156 and/or data operated on by the CPU 156. As explained further below, the non-volatile solid-state storage 152 includes flash or, in other embodiments, other types of solid-state memory.

Referring to fig. 2A, storage clusters 161 are scalable, meaning that storage capacity with inconsistent storage sizes is easily added, as described above. In some embodiments, one or more storage nodes 150 may be inserted into or removed from each chassis and the storage clusters self-configured. The plug-in storage nodes 150 may be of different sizes, whether installed in the chassis at the time of delivery or added later. For example, in one embodiment, storage node 150 may have any multiple of 4TB, such as 8TB, 12TB, 16TB, 32TB, and so on. In other embodiments, storage node 150 may have other storage or any multiple of capacity. The storage capacity of each storage node 150 is broadcast and affects the decision of how to stripe the data. To maximize storage efficiency, embodiments may self-configure as widely as possible in a stripe, subject to predetermined requirements for continued operation with loss of up to one or up to two nonvolatile solid state storage 152 units or storage nodes 150 within the chassis.

Fig. 2B is a block diagram showing a communication interconnect 173 and a power distribution bus 172 coupling a plurality of storage nodes 150. Referring back to fig. 2A, in some embodiments, the communication interconnect 173 may be included in or implemented with the switch mesh fabric 146. In some embodiments, where multiple storage clusters 161 occupy racks, the communication interconnect 173 may be included in or implemented with a top-of-rack switch. As illustrated in fig. 2B, the storage clusters 161 are enclosed within a single chassis 138. External port 176 is coupled to storage node 150 through communication interconnect 173, while external port 174 is coupled directly to the storage node. An external power port 178 is coupled to the power distribution bus 172. As described with reference to fig. 2A, the storage node 150 may include different amounts and different capacities of the nonvolatile solid-state storage 152. Additionally, the one or more storage nodes 150 may be compute-only storage nodes as illustrated in fig. 2B. The authority 168 is implemented on the non-volatile solid state storage 152, for example, as a list or other data structure stored in memory. In some embodiments, the authority is stored within the non-volatile solid state storage 152 and is supported by software executing on a controller or other processor of the non-volatile solid state storage 152. In another embodiment, authority 168 is implemented on storage node 150, for example, as a list or other data structure stored in memory 154 and supported by software executing on CPU 156 of storage node 150. In some embodiments, authority 168 controls how and where data is stored in non-volatile solid state storage 152. This control helps determine what type of erasure coding scheme is applied to the data, and which storage nodes 150 have which portions of the data. Each authority 168 may be assigned to a non-volatile solid state storage 152. In various embodiments, each authority may control the range of inode numbers, segment numbers, or other data identifiers assigned to data by the file system, storage node 150, or non-volatile solid state storage 152.

In some embodiments, each data segment and each metadata segment has redundancy in the system. In addition, each piece of data and each piece of metadata has an owner that can be referred to as an authority. If the authority is not reachable (e.g., due to a storage node failure), there is a continuous plan for how to find the data or the metadata. In various embodiments, there is a redundant copy of the authority 168. In some embodiments, the authority 168 has a relationship with the storage node 150 and the non-volatile solid state storage 152. Each authority 168 covering a range of data segment numbers or other identifiers of data may be assigned to a particular non-volatile solid state storage 152. In some embodiments, authorities 168 for all such ranges are distributed over non-volatile solid-state storage 152 of a storage cluster. Each storage node 150 has a network port that provides access to the non-volatile solid state storage 152 of that storage node 150. In some embodiments, data may be stored in segments associated with segment numbers that are indirect accesses (directives) to the configuration of RAID (redundant array of independent disks) stripes. The assignment and use of authority 168 thus establishes indirect access to data. According to some embodiments, indirect access may be referred to as the ability to indirectly reference data (in this case via authority 168). The segment identifies a set of non-volatile solid state storage devices 152 and a local identifier into the set of non-volatile solid state storage devices 152 that may contain data. In some embodiments, the local identifier is an offset into the device and may be sequentially reused by multiple segments. In other embodiments, the local identifier is unique to a particular segment and is never reused. The offset in the non-volatile solid state storage 152 is applied to locate data for writing to or reading from the non-volatile solid state storage 152 (in the form of a RAID stripe). Data is striped across multiple units of non-volatile solid-state storage 152, which may include or be different from non-volatile solid-state storage 152 having authority 168 for particular segments of data.

If the location of a particular data segment changes (e.g., during a data movement or data reconstruction), the authority 168 should be consulted at the non-volatile solid state storage 152 or storage node 150 having an authority 168 for that data segment. To locate a particular data segment, embodiments calculate a hash value of the data segment or apply an inode number or data segment number. The output of this operation is directed to the non-volatile solid state storage 152 having an authority 168 for the particular piece of data. In some embodiments, this operation has two phases. The first stage maps entity Identifiers (IDs), such as segment numbers, inode numbers, or directory numbers, to authoritative identifiers. Such mapping may include computations such as hash or bitmask. The second phase is to map the authoritative identifier to a particular non-volatile solid state storage 152, which may be accomplished through explicit mapping. The operations are repeatable such that when a calculation is performed, the results of the calculation may be repeatedly and reliably directed to a particular non-volatile solid state storage 152 having the authority 168. The operations may include a set of reachable storage nodes as input. If the set of reachable non-volatile solid state storage units changes, then the optimal set changes. In some embodiments, the persistence value is the current assignment (always true) and the calculated value is the target assignment that the cluster will attempt to reconfigure. This calculation may be used to determine the best non-volatile solid-state storage 152 for an authority in the presence of a set of non-volatile solid-state storage 152 that are reachable and constitute the same cluster. The computation also determines an ordered set of peer non-volatile solid state storage 152 that also maps recording authorities to non-volatile solid state storage such that authorities can be determined even if the assigned non-volatile solid state storage is unreachable. In some embodiments, if a particular authority 168 is not available, then a replication or replacement authority 168 may be consulted.

Referring to fig. 2A and 2B, two of the many tasks of the CPU 156 on the storage node 150 are to decompose write data and reorganize read data. When the system has determined that data is to be written, the authority 168 for the data is located as above. When the segment ID of the data has been determined, the write request is forwarded to the nonvolatile solid state storage 152 that is currently determined to be the host of the authority 168 upon which the segment determination was made. The non-volatile solid state storage 152 and the host CPU 156 of the storage node 150 where the corresponding authority 168 resides then decompose or fragment the data and transfer the data to the various non-volatile solid state storage 152. The transmitted data is written as a data stripe according to an erasure coding scheme. In some embodiments, data is requested to be extracted, while in other embodiments, data is pushed. Conversely, when data is read, the authority 168 for the segment ID containing the data is located as described above. The host CPU 156 of the non-volatile solid state storage 152 and the storage node 150 where the corresponding authority 168 resides requests data from the non-volatile solid state storage and the corresponding storage node pointed to by the authority. In some embodiments, data is read from the flash memory device as a stripe of data. The host CPU 156 of the storage node 150 then reassembles the read data, correcting any errors, if any, according to the appropriate erasure coding scheme, and forwards the reassembled data to the network. In other embodiments, some or all of these tasks may be handled in the non-volatile solid-state storage 152. In some embodiments, the segment master requests data to be sent to storage node 150 by requesting pages from the storage device and then sending the data to the storage node making the original request.

In an embodiment, authority 168 operates to determine how operations will proceed with respect to a particular logical element. Each of the logical elements may be operated on by a particular authority across multiple storage controllers of the storage system. The authority 168 may communicate with multiple storage controllers such that the multiple storage controllers collectively perform operations for those particular logical elements.

In an embodiment, the logical element may be, for example, a file, a directory, an object bucket, individual objects, a depicted portion of a file or object, other forms of key-value versus database or table. In an embodiment, performing an operation may involve, for example, ensuring consistency, structural integrity, and/or recoverability with other operations for the same logical element, reading metadata and data associated with the logical element, determining what data should be permanently written into the storage system to survive any changes to the operation, or where the metadata and data may be determined to be stored across modular storage devices attached to multiple storage controllers in the storage system.

In some embodiments, the operation is a token-based transaction to efficiently communicate within a distributed system. Each transaction may be accompanied by or associated with a token that gives permission to execute the transaction. In some embodiments, the authority 168 is able to maintain the pre-transaction state of the system until the operation is complete. Token-based communication may be accomplished without a global lock across the system and also enable operation to resume in the event of an interrupt or other failure.

In some systems, such as in UNIX style file systems, data is handled with an index node (index node or inode) that specifies a data structure representing objects in the file system. For example, the object may be a file or a directory. Metadata may accompany the object as attributes such as permissions data and creation time stamps, as well as other attributes. The segment number may be assigned to all or part of this object in the file system. In other systems, data segments are handled with segment numbers assigned elsewhere. For purposes of discussion, a distribution unit is an entity, and an entity may be a file, directory, or segment. That is, an entity is a unit of data or metadata stored by a storage system. Entities are grouped into sets called authorities. Each authority has an authority owner, which is a storage node with the exclusive rights to update an entity in the authority. In other words, the storage node contains an authority, which in turn contains an entity.

According to some embodiments, the fragments are logical containers of data. A segment is an address space between the media address space and the physical flash location, i.e. the data segment number is in this address space. Segments may also contain metadata, which enables data redundancy to be recovered (rewritten to different flash locations or devices) without the involvement of higher level software. In one embodiment, the internal format of the segment contains client data and media map to determine the location of the data. Where applicable, each data segment is protected from, for example, memory and other failures by being divided into data and parity slices (shards). According to the erasure coding scheme, data and parity slices are distributed (i.e., striped) across the non-volatile solid state storage 152 coupled to the host CPU 156 (see fig. 2E and 2G). In some embodiments, the use of the term segment refers to the container and its location in the address space of the segment. According to some embodiments, the use of the term stripe refers to the same set of slices as the fragments and includes how the fragments are distributed along with redundancy or parity information.

A series of address space transformations occurs across the entire memory system. At the top is a directory entry (file name) linked to the inode. The inode points to a media address space that logically stores data. The media addresses may be mapped through a series of indirect media to spread the load of larger files or to implement data services such as deduplication or snapshot. The media addresses may be mapped through a series of indirect media to spread the load of larger files or to implement data services such as deduplication or snapshot. The segment address is then translated to a physical flash location. According to some embodiments, the physical flash location has an address range bounded by the amount of flash in the system. The media addresses and segment addresses are logical containers and in some embodiments use 128 bit or larger identifiers in order to be virtually unlimited, with the possibility of reuse being calculated to be longer than the expected life of the system. In some embodiments, addresses from the logical containers are allocated in a hierarchical manner. Initially, each nonvolatile solid state storage 152 unit may be assigned an address space range. Within this assigned range, the non-volatile solid-state storage 152 is able to allocate addresses without synchronizing with other non-volatile solid-state storage 152.

The data and metadata are stored by a set of underlying storage layouts that are optimized for varying workload patterns and storage devices. These layouts incorporate a variety of redundancy schemes, compression formats, and indexing algorithms. Some of these layouts store information about authorities and authoritative masters, while others store file metadata and file data. Redundancy schemes include error correction codes that allow for defective bits within a single storage device (e.g., a NAND flash chip), erasure codes that allow for multiple storage node failures, and replication schemes that allow for data center or area failures. In some embodiments, a low density parity check ('LDPC') code is used within a single memory cell. In some embodiments, reed-Solomon (Reed-Solomon) encoding is used within the storage clusters, and mirroring is used within the storage grid. Metadata may be stored using an ordered log structured index (e.g., a log structured merge tree), and larger data may not be stored in a log structured layout.

To maintain consistency across multiple copies of an entity, storage nodes implicitly agree on two things by computing: (1) An authority containing entity, and (2) a storage node containing authority. Assigning entities to authorities may be accomplished by pseudo-randomly assigning entities to authorities, by dividing entities into ranges based on externally generated keys, or by placing a single entity into each authority. Examples of pseudo-random schemes are the family of linear hash and copy under scalable hash ('RUSH') hashes, including controlled copy under scalable hash ('CRUSH'). In some embodiments, the pseudo-random assignment is used only to assign authority to the nodes, as the set of nodes may change. The set of authorities cannot change and thus any subjective function may be applied in these embodiments. Some placement schemes automatically place authorities on storage nodes, while other placement schemes rely on explicit mapping of authorities to storage nodes. In some embodiments, a pseudo-random scheme is used to map from each authority to a set of candidate authority owners. A pseudorandom data distribution function associated with the CRUSH may assign an authority to the storage node and create a list of authority assigned locations. Each storage node has a copy of the pseudorandom data distribution function and may achieve the same computation to distribute and later find or locate authority. In some embodiments, each of the pseudo-random schemes requires a set of reachable storage nodes as input in order to infer the same target node. Once an entity is placed in the authority, the entity may be stored on the physical device such that no expected failure will result in unexpected data loss. In some embodiments, the rebalancing algorithm attempts to store copies of all entities within an authority in the same layout and on the same set of machines.

Examples of expected faults include device faults, machine theft, data center fires, and regional disasters (e.g., nuclear or geological events). Different failures may result in different levels of acceptable data loss. In some embodiments, the stolen storage node does not affect neither the security nor the reliability of the system, but depending on the system configuration, regional events may result in loss of data, loss of updates for seconds or minutes, or even complete data loss.

In an embodiment, the data placement for storing redundancy is independent of the authority placement for data consistency. In some embodiments, the authoritative storage nodes do not contain any persistent storage. Instead, the storage node is connected to a non-volatile solid state storage unit that does not contain an authority. The communication interconnect between the storage nodes and the non-volatile solid state storage units is comprised of a variety of communication technologies and has inconsistent performance and fault tolerance characteristics. In some embodiments, as described above, the nonvolatile solid state storage units are connected via PCI express to storage nodes, which are connected together within a single chassis using an Ethernet backplane, and the chassis are connected together to form a storage cluster. In some embodiments, the storage clusters are connected to the clients using an ethernet network or fibre channel. If multiple storage clusters are configured into a storage grid, the multiple storage clusters are connected using the Internet or other long-range network link (e.g., a "metro scale" link or a dedicated link that does not traverse the Internet).

The authoritative owner has the exclusive authority to modify an entity, migrate an entity from one non-volatile solid state storage unit to another, and add and remove copies of an entity. This allows redundancy of the underlying data to be maintained. When the authority owner fails, is about to exit operation or is overloaded, the authority is transferred to a new storage node. Transient faults make it important to ensure that all non-faulty machines agree on a new authoritative location. Uncertainty caused by transient faults may be automatically achieved through a coherence protocol such as Paxos (Paxos), hot-warm (hot-wall) failover schemes, manual intervention via a remote system administrator or local hardware administrator (e.g., by physically removing the failed machine from the cluster, or by pressing a button on the failed machine). In some embodiments, a coherence protocol is used and failover is automatic. According to some embodiments, if too many failures or copy events occur within a too short period of time, the system enters self-save mode and stops copying and data movement activities until administrator intervention.

When the authority is transferred between storage nodes and the authority owner updates the entities in his authority, the system transfers messages between the storage nodes and the non-volatile solid state storage units. With respect to persistent messages, messages with different purposes are of different types. Depending on the type of message, the system maintains different ordering and persistence guarantees. As persistent messages are processed, the messages are temporarily stored in a variety of persistent and non-persistent storage hardware technologies. In some embodiments, messages are stored in RAM, NVRAM, and on the NAND flash device, and multiple protocols are used in order to efficiently utilize each storage medium. Latency sensitive client requests may persist in replicated NVRAM and then in NAND, while background rebalancing operations persist directly to NAND.

The persistence message is stored persistently prior to transmission. This allows the system to continue to service client requests in the event of failure and component replacement. While many hardware components contain unique identifiers that are visible to system administrators, manufacturers, hardware supply chains, and the continuous monitoring quality control infrastructure, applications running above the infrastructure addresses virtualize the addresses. These virtualized addresses do not change over the life of the storage system, whether the component fails or is replaced. This allows each component of the storage system to be replaced over time without requiring reconfiguration or interrupt client request processing, i.e., the system supports interrupt-free upgrades.

In some embodiments, the virtualized addresses are stored with sufficient redundancy. The continuous monitoring system correlates hardware and software status with hardware identifiers. This allows for detection and prediction of faults caused by faulty components and manufacturing details. In some embodiments, the monitoring system also enables proactive diversion of authorities and entities away from affected devices before failure occurs by removing components from the critical path.

FIG. 2C is a multi-level block diagram showing the contents of storage node 150 and the contents of non-volatile solid state storage 152 of storage node 150. In some embodiments, data is transferred to and from storage node 150 through a network interface controller ('NIC') 202. As discussed above, each storage node 150 has a CPU 156 and one or more non-volatile solid state storage devices 152. Moving one level down in fig. 2C, each non-volatile solid-state storage 152 has relatively fast non-volatile solid-state memory, such as non-volatile random access memory ('NVRAM') 204 and flash memory 206. In some embodiments, NVRAM 204 may be a component (DRAM, MRAM, PCM) that does not require a program/erase cycle, and may be memory capable of supporting being written to more frequently than memory is read. Moving down still another level in fig. 2C, NVRAM 204 is implemented in one embodiment as high-speed volatile memory, such as Dynamic Random Access Memory (DRAM) 216, supported by energy reserves 218. The energy reserve 218 provides sufficient power to keep the DRAM 216 powered for a sufficient time to transfer content to the flash memory 206 in the event of a power failure. In some embodiments, the energy reserve 218 is a capacitor, super capacitor, battery, or other device that supplies an appropriate supply of energy sufficient to enable the transfer of the contents of the DRAM 216 to a stable storage medium in the event of a power loss. The flash memory 206 is implemented as a plurality of flash dies 222, which may be referred to as a package of flash dies 222 or an array of flash dies 222. It should be appreciated that the flash die 222 may be packaged in a number of ways, with a single die per package, multiple dies per package (i.e., multi-chip packages), in a hybrid package, as bare dies on a printed circuit board or other substrate, as encapsulated dies, etc. In the illustrated embodiment, the non-volatile solid-state storage 152 has a controller 212 or other processor, and an input output (I/O) port 210 coupled to the controller 212. The I/O port 210 is coupled to the CPU 156 and/or the network interface controller 202 of the flash storage node 150. A flash input output (I/O) port 220 is coupled to a flash die 222, and a direct memory access unit (DMA) 214 is coupled to the controller 212, DRAM 216, and flash die 222. In the embodiment shown, I/O port 210, controller 212, DMA unit 214, and flash I/O port 220 are implemented on a programmable logic device ('PLD') 208, such as an FPGA. In this embodiment, each flash die 222 has pages organized as 16kB (kilobyte) pages 224 and registers 226 through which data may be written to or read from the flash die 222. In other embodiments, other types of solid state memory are used in place of or in addition to the flash memory illustrated within flash die 222.

In various embodiments as disclosed herein, storage clusters 161 may be contrasted with a general storage array. Storage node 150 is part of a collection that creates storage clusters 161. Each storage node 150 has a slice of data and the calculations required to provide the data. The plurality of storage nodes 150 cooperate to store and retrieve data. Memory storage or storage devices typically used in storage arrays are less involved in the processing and manipulation of data. A memory or storage device in a memory array receives a command to read, write, or erase data. The storage memory or storage devices in a storage array are unaware of the larger system in which they are embedded, nor the meaning of the data. The storage memory or storage devices in the storage array may include various types of storage memory, such as RAM, solid state drives, hard drives, and the like. The non-volatile solid-state storage 152 unit described herein has multiple interfaces that are active simultaneously and serve multiple purposes. In some embodiments, some of the functionality of storage node 150 is transferred into storage unit 152, transforming storage unit 152 into a combination of storage unit 152 and storage node 150. Placing the calculation (relative to storing the data) into the storage unit 152 places this calculation closer to the data itself. Various system embodiments have a hierarchy of storage node layers with different capabilities. In contrast, in a storage array, a controller owns and knows everything about all the data that the controller manages in a shelf or storage device. As described herein, in the storage cluster 161, multiple nonvolatile solid state storage 152 units and/or multiple controllers in the storage node 150 cooperate in various ways (e.g., for erasure coding, data slicing, metadata communication and redundancy, storage capacity expansion or contraction, data recovery, etc.).

Fig. 2D shows a storage server environment using an embodiment of the storage node 150 and storage 152 units of fig. 2A-C. In this version, each nonvolatile solid state storage 152 unit has a processor, such as a controller 212 (see fig. 2C), FPGA, flash memory 206, and NVRAM 204 (which is supercapacitor-backed DRAM 216, see fig. 2B and 2C), on a PCIe (peripheral component interconnect express) board in chassis 138 (see fig. 2A). The non-volatile solid state storage 152 unit may be implemented as a single board containing storage devices and may be the largest tolerable fault domain inside the chassis. In some embodiments, up to two non-volatile solid state storage 152 units may fail and the device will continue without losing data.

In some embodiments, the physical storage area is divided into named areas based on application usage. NVRAM 204 is a contiguous block of memory reserved in nonvolatile solid state storage 152DRAM 216 and is supported by NAND flash. The NVRAM 204 is logically divided into every second plurality of memory regions (e.g., spool_regions) that are written as spools. The space within NVRAM 204 spools is managed independently by each authority 168. Each device provides an amount of storage space for each authority 168. Authority 168 further manages life spans and allocations within the space. Examples of spooling include distributed transactions or concepts. The on-board supercapacitor provides short duration power conservation when the primary power of the nonvolatile solid state storage 152 unit fails. During this hold interval, the contents of NVRAM 204 are refreshed to flash memory 206. The next time the power is turned on, the contents of the NVRAM 204 are restored from the flash memory 206.

As for the storage unit controllers, the responsibility of the logical "controller" is distributed across each of the blades that contain the authority 168. This distribution of logic control is shown in fig. 2D as host controller 242, intermediate level controller 244, and storage unit controller 246. Although the components may be physically co-located on the same blade, the management of the control plane and the storage plane is handled independently. Each authority 168 effectively acts as a stand-alone controller. Each authority 168 provides its own data and metadata structure, its own background workers, and maintains its own lifecycle.

FIG. 2E is a hardware block diagram of a blade 252 showing a control plane 254, a computation plane 256, and a storage plane 258, and an authority 168 that interacts with underlying physical resources in the storage server environment of FIG. 2D using the embodiments of storage nodes 150 and storage units 152 of FIGS. 2A-C. The control plane 254 is partitioned into a number of authorities 168 that can use computing resources in the computing plane 256 to run on any of the blades 252. The storage plane 258 is partitioned into a set of devices, each of which provides access to flash 206 resources and NVRAM 204 resources. In one embodiment, the computing plane 256 may perform operations of a storage array controller on one or more devices of the storage plane 258 (e.g., a storage array), as described herein.

In the computation plane 256 and the storage plane 258 of FIG. 2E, authorities 168 interact with underlying physical resources (i.e., devices). From the perspective of authority 168, its resources are striped across all physical devices. From the perspective of the device, it provides resources to all authorities 168 regardless of where the authorities happen to run. Each authority 168 has allocated or has been allocated one or more partitions 260 of storage memory in storage 152, such as partition 260 in flash memory 206 and NVRAM 204. Each authority 168 uses those assigned partitions 260 that it belongs to write or read user data. The authorities may be associated with different amounts of physical storage of the system. For example, one authority 168 may have a greater number of partitions 260 or a larger size of partitions 260 in one or more storage units 152 than one or more other authorities 168.

FIG. 2F depicts the resilient software layer in a storage cluster's blade 252, according to some embodiments. In the elastic configuration, the elastic software is symmetrical, i.e., the computing module 270 of each blade runs the three same process layers depicted in FIG. 2F. The storage manager 274 performs read and write requests from the other blades 252 for data and metadata stored in the local storage unit 152NVRAM 204 and the flash 206. Authority 168 fulfills the client request by issuing the necessary reads and writes to blade 252 on which the corresponding data or metadata resides on storage unit 152. The endpoint 272 parses the client connection request received from the switch mesh fabric 146 supervisor software, relays the client connection request to the authority 168 responsible for fulfillment, and relays the response of the authority 168 to the client. The symmetrical three-layer structure allows for a high degree of concurrency of the memory system. In these embodiments, the elasticity is efficiently and reliably laterally expanded. In addition, the resilience implements a unique lateral expansion technique that balances work evenly across all resources (regardless of client access pattern) and maximizes concurrency by eliminating most of the inter-blade coordination needs that typically occur in conventional distributed locking.

Still referring to FIG. 2F, the authority 168 running in the computing module 270 of the blade 252 performs the internal operations required to fulfill the client request. One feature of the resiliency is that the authority 168 is stateless, i.e., it caches active data and metadata in its own blade 252DRAM for quick access, but the authority stores each update in its NVRAM 204 partition on three separate blades 252 until the update has been written to the flash 206. In some embodiments, all storage system writes to NVRAM 204 are written to partitions on three separate blades 252 in triplicate. With triple mirrored NVRAM 204 and persistent storage protected by parity and Reed-Solomon RAID checksums, the storage system can survive a concurrent failure of two blades 252 without losing data, metadata, or access to either.

Because authority 168 is stateless, it may migrate between blades 252. Each authority 168 has a unique identifier. NVRAM 204 and flash 206 partitions are associated with the identifier of authority 168, but not with blades 252, on some of which the partitions run. Thus, as the authority 168 migrates, the authority 168 continues to manage the same storage partitions from its new location. When a new blade 252 is installed in an embodiment of a storage cluster, the system automatically rebalances the load by: the storage of the new blade 252 is partitioned for use by the system's authority 168, the selected authority 168 is migrated to the new blade 252, and endpoints 272 on the new blade 252 are started and included in the client connection distribution algorithm of the switch mesh fabric 146.

From its new location, the migrating authority 168 persists the contents of its NVRAM 204 partition on the flash 206, processes read and write requests from other authorities 168, and fulfills client requests directed to it by endpoint 272. Similarly, if a blade 252 fails or is removed, the system redistributes its authority 168 among the remaining blades 252 of the system. The redistributed authority 168 continues to perform its original function from its new location.

FIG. 2G depicts an authority 168 and storage resources in a blade 252 of a storage cluster, according to some embodiments. Each authority 168 is exclusively responsible for the flash 206 and NVRAM 204 partitions on each blade 252. The authority 168 manages the contents and integrity of its partitions independent of other authorities 168. The authority 168 compresses the incoming data and temporarily holds it in its NVRAM 204 partition, and then merges, RAID protects, and persists the data in the memory segment in its flash 206 partition. When the authority 168 writes data to the flash 206, the storage manager 274 performs the necessary flash conversion to optimize write performance and maximize media life. In the background, authorities 168 "garbage collect" or reclaim the space occupied by data that a client has made obsolete by overwriting the data. It should be appreciated that because the partitions of the authority 168 are disjoint, no distributed locking is required to perform client and write or perform background functions.

The embodiments described herein may utilize various software, communication, and/or network protocols. In addition, the configuration of hardware and/or software may be adjusted to accommodate various protocols. For example, embodiments may utilize Active Directory (Active Directory), which is a database-based system that provides authentication, directory, policies, and other services in a WINDOWS environment. In these embodiments, LDAP (lightweight directory Access protocol) is an exemplary application protocol for querying and modifying items in a directory service provider (e.g., active directory). In some embodiments, a network lock manager ('NLM') is used as a facility to work in concert with a network file system ('NFS') to provide system V style advisory files and record locking on the network. The server message block ('SMB') protocol, a version of which is also referred to as the common internet file system ('CIFS'), may be integrated with the storage systems discussed herein. SMP operates as an application layer network protocol, typically used to provide shared access to files, printers, and serial ports, as well as promiscuous communication between nodes on the network. SMB also provides an authenticated inter-process communication mechanism. Amazon S3 (simple storage service) is a web service provided by Amazon web service (Amazon Web Service), and the system described herein can interface with Amazon S3 through web service interfaces (REST (representational state transfer), SOAP (simple object access protocol), and BitTorrent). The RESTful API (application programming interface) breaks down transactions to create a series of small modules. Each module addresses a particular underlying portion of the transaction. The control or permissions provided by these embodiments, particularly for object data, may include the utilization of an access control list ('ACL'). An ACL is a permission list attached to an object, and it specifies which users or system processes are granted rights to access the object, and which operations are allowed to be performed on a given object. The system may utilize version 6 internet protocol ('IPv 6') and IPv4 as communication protocols to provide an identification and location system for computers on a network and to route traffic across the internet. Packet routing between network systems may include equal cost multi-path routing ('ECMP'), a routing strategy in which the forwarding of next hop packets to a single destination may occur over multiple "best paths" that are collocated first in the routing metric calculation. Multipath routing can be used in conjunction with most routing protocols because it is limited to per-hop decisions for a single router. The software may support multiple tenants, which are architectures in which a single instance of a software application serves multiple clients. Each customer may be referred to as a tenant. In some embodiments, tenants may be given the ability to customize portions of an application, but not customize the code of the application. Embodiments may maintain audit logs. An audit log is a document that records events in a computing system. In addition to which resources are to be accessed for profiling, audit log entries typically contain destination and source addresses, time stamps, and user login information to comply with various regulations. Embodiments may support various key management policies, such as encryption key rotation. In addition, the system may support some variations of dynamic root passwords or dynamically changing passwords.

Fig. 3A sets forth a diagram of a storage system 306 coupled in data communication with a cloud service provider 302 according to some embodiments of the present disclosure. Although not depicted in detail, the storage system 306 depicted in fig. 3A may be similar to the storage system described above with reference to fig. 1A-1D and 2A-2G. In some embodiments, the storage system 306 depicted in fig. 3A may be embodied as a storage system including unbalanced activity/activity controllers, as a storage system including balanced activity/activity controllers, as a storage system including activity/activity controllers in which fewer than all of the resources of each controller are utilized such that each controller has reserved resources available to support failover, as a storage system including all-activity/activity controllers, as a storage system including data set split controllers, as a storage system including a dual-tier architecture with front-end controllers and back-end integrated storage controllers, as a storage system including laterally-expanded clusters of dual controller arrays, and combinations of such embodiments.

In the example depicted in fig. 3A, storage system 306 is coupled to cloud service provider 302 via data communication link 304. The data communication link 304 may be embodied as a dedicated data communication link, as a data communication path provided through the use of one or more data communication networks, such as a wide area network ('WAN') or LAN, or as some other mechanism capable of conveying digital information between the storage system 306 and the cloud service provider 302. This data communication link 304 may be entirely wired, entirely wireless, or some aggregation of wired and wireless data communication paths. In this example, digital information may be exchanged between storage system 306 and cloud service provider 302 via data communication link 304 using one or more data communication protocols. For example, digital information may be exchanged between storage system 306 and cloud service provider 302 via data communication link 304 using the following protocols: handheld device transport protocol ('HDTP'), hypertext transport protocol ('HTTP'), internet protocol ('IP'), real-time transport protocol ('RTP'), transmission control protocol ('TCP'), user datagram protocol ('UDP'), wireless application protocol ('WAP'), or other protocol.

The cloud service provider 302 depicted in fig. 3A may be embodied as a system and computing environment that provides a wide variety of services to users of the cloud service provider 302, such as by sharing computing resources via the data communication link 304. Cloud service provider 302 may provide on-demand access to a shared pool of configurable computing resources, such as computer networks, servers, storage devices, applications, services, and the like. The shared pool of configurable resources may be quickly provisioned and distributed to users of cloud service provider 302 with minimal management effort. In general, the user of cloud service provider 302 is unaware of the exact computing resources that cloud service provider 302 utilizes to provide the service. While in many cases this cloud service provider 302 may be accessed via the internet, readers of skill in the art will recognize that any system that abstracts the use of shared resources to provide services to users over any data communication link may be considered a cloud service provider 302.

In the example depicted in fig. 3A, cloud service provider 302 may be configured to provide a variety of services to storage system 306 and users of storage system 306 by implementing various service models. For example, cloud service provider 302 may be configured to provide services by implementing an infrastructure as a service ('IaaS') service model, by implementing a platform as a service ('PaaS') service model, by implementing a software as a service ('SaaS') service model, by implementing an authentication as a service ('AaaS') service model, by implementing a storage as a service model in which cloud service provider 302 provides access to its storage infrastructure for use by storage system 306 and users of storage system 306, and so on. Readers will appreciate that cloud service provider 302 may be configured to provide additional services to storage system 306 and users of storage system 306 by implementing additional service models, as the service models described above are included for purposes of explanation only and are in no way representative of limitations on services that cloud service provider 302 may provide or regarding service models that cloud service provider 302 may implement.

In the example depicted in fig. 3A, cloud service provider 302 may be embodied as, for example, a private cloud, as a public cloud, or as a combination of a private cloud and a public cloud. In embodiments in which cloud service provider 302 is embodied as a private cloud, cloud service provider 302 may be dedicated to providing services to a single organization rather than multiple organizations. In embodiments in which cloud service provider 302 is embodied as a public cloud, cloud service provider 302 may provide services to multiple organizations. In still other alternative embodiments, cloud service provider 302 may embody a hybrid cloud deployment as a hybrid of private cloud and public cloud services.

Although not explicitly depicted in fig. 3A, the reader will appreciate that a large number of additional hardware components and additional software components may be necessary to facilitate delivery of cloud services to storage system 306 and users of storage system 306. For example, the storage system 306 may be coupled to (or even include) a cloud storage gateway. This cloud storage gateway may be embodied as, for example, a hardware-based or software-based appliance that is positioned with the storage system 306 as an on-pre. This cloud storage gateway may operate as a bridge between a local application executing on storage system 306 and a remote cloud-based storage utilized by storage system 306. By using a cloud storage gateway, an organization may move primary iSCSI or NAS to cloud service provider 302, thereby enabling the organization to save space on its on-premise storage system. Such a cloud storage gateway may be configured to emulate a disk array, block-based device, file server, or other storage system that may convert SCSI commands, file server commands, or other suitable commands into REST space protocols that facilitate communication with cloud service provider 302.

To enable storage system 306 and users of storage system 306 to utilize services provided by cloud service provider 302, a cloud migration process may occur during which data, applications, or other elements are moved from an organization's local system (or even from another cloud environment) to cloud service provider 302. To successfully migrate data, applications, or other elements to the environment of cloud service provider 302, middleware, such as cloud migration tools, may be utilized to bridge the gap between the environment of cloud service provider 302 and the environment of the organization. Such cloud migration tools may also be configured to address potentially high network costs and long transfer times associated with migrating large amounts of data to cloud service provider 302, as well as address security concerns associated with data sensitive to cloud service provider 302 over a data communications network. To further enable the storage system 306 and users of the storage system 306 to utilize services provided by the cloud service provider 302, cloud orchestration programs may also be used to arrange and coordinate automation tasks in pursuit of creating a merging process or workflow. Such cloud orchestration may perform tasks such as configuring various components (whether those components are cloud components or in-house deployment components), and managing interconnections between such components. Cloud orchestration can simplify inter-component communication and connections to ensure that links are properly configured and maintained.

In the example depicted in fig. 3A and as briefly described above, cloud service provider 302 may be configured to provide services to storage system 306 and users of storage system 306 through the use of a SaaS service model, eliminating the need to install and run applications on local computers, which may simplify maintenance and support of applications. Such applications may take many forms according to various embodiments of the present disclosure. For example, cloud service provider 302 may be configured to provide storage system 306 and users of storage system 306 with access to data analysis applications. Such data analysis applications may, for example, be configured to receive a large amount of telemetry data that is transmitted back (phone home) by storage system 306 in a phone manner. Such telemetry data may describe various operational characteristics of the storage system 306 and may be analyzed for a wide variety of purposes, including, for example, determining health of the storage system 306, identifying workloads executing on the storage system 306, predicting when the storage system 306 will consume various resources, suggesting configuration changes, hardware or software upgrades, workflow migration, or other actions that may improve operation of the storage system 306.

Cloud service provider 302 may also be configured to provide storage system 306 and users of storage system 306 with access to virtualized computing environments. Such virtualized computing environment may be embodied as, for example, a virtual machine or other virtualized computer hardware platform, virtual storage, virtualized computer network resources, and the like. Examples of such virtualized environments may include virtual machines created to emulate an actual computer, virtualized desktop environments that separate logical desktops from physical machines, virtualized file systems that allow uniform access to different types of specific file systems, and many other virtualized environments.

Although the example depicted in fig. 3A illustrates the storage system 306 coupled for data communication with the cloud service provider 302, in other embodiments, the storage system 306 may be a hybrid cloud deployment in which private cloud elements (e.g., private cloud services, in-deployment infrastructure, etc.) are combined with public cloud elements (e.g., public cloud services, infrastructure, etc., that may be provided by one or more cloud service providers) to form a single solution with orchestration between the various platforms. This hybrid cloud deployment may utilize hybrid cloud management software, such as AzureTM Arc from microsoft (tm), which centralizes the management of the hybrid cloud deployment to any infrastructure and enables the deployment of services anywhere. In this example, the hybrid cloud management software may be configured to create, update, and delete resources (both physical and virtual) that form the hybrid cloud deployment, allocate computing and storage to particular workloads, monitor performance of the workloads and resources, policy compliance, update and patching, security status, or perform a variety of other tasks.

Readers will appreciate that by pairing the storage system described herein with one or more cloud service providers, various offerings may be achieved. For example, a disaster recovery as a service ('DRaaS') may be provided in which cloud resources are utilized to protect applications and data from damage caused by disasters, including in embodiments in which a storage system may act as the primary data storage. In such embodiments, an overall system backup may be taken that allows for traffic continuity in the event of a system failure. In such embodiments, cloud data backup techniques (either by themselves or as part of a larger DRaaS solution) may also be integrated into an overall solution that includes the storage system and cloud service provider described herein.

The storage systems and cloud service providers described herein may be used to provide a wide variety of security features. For example, the storage system may encrypt static data (and data may be sent encrypted to and from the storage system) and may utilize a key management as a service ('KMaaS') to manage encryption keys, keys for locking and unlocking storage devices, and so forth. Likewise, a cloud data security gateway or similar mechanism may be utilized to ensure that data stored within the storage system is not improperly eventually stored in the cloud as part of a cloud data backup operation. Furthermore, micro-segmentation or identity-based segmentation may be utilized in a data center containing storage systems or within a cloud service provider to create secure zones in the data center and cloud deployment, enabling workloads to be isolated from each other.

For further explanation, fig. 3B sets forth a diagram of a storage system 306 according to some embodiments of the present disclosure. Although not depicted in detail, the storage system 306 depicted in fig. 3B may be similar to the storage system described above with reference to fig. 1A-1D and 2A-2G, as the storage system may include many of the components described above.

The storage system 306 depicted in fig. 3B may include a large number of storage resources 308 that may be embodied in many forms. For example, the storage resources 308 may include nano-RAM or another form of non-volatile random access memory utilizing carbon nanotubes deposited on a substrate, 3D cross-point non-volatile memory, flash memory, including single-level cell ('SLC') NAND flash, multi-level cell ('MLC') NAND flash, three-level cell ('TLC') NAND flash, four-level cell ('QLC') NAND flash, or others. Likewise, the storage resource 308 may include a non-volatile magnetoresistive random access memory ('MRAM'), including spin transfer torque ('STT') MRAM. Alternatively, example memory resources 308 may include non-volatile phase change memory ('PCM'), quantum memory that allows for storing and retrieving photonic quantum information, resistive random access memory ('ReRAM'), storage class memory ('SCM'), or other forms of memory resources, including any combination of the resources described herein. Readers will appreciate that the storage systems described above may utilize other forms of computer memory and storage devices, including DRAM, SRAM, EEPROM, general purpose memory, and many others. The storage resources 308 depicted in FIG. 3A may be embodied in a variety of form factors, including, but not limited to, dual inline memory modules ('DIMMs'), non-volatile dual inline memory modules ('NVDIMMs'), M.2, U.2, and others.

The storage resources 308 depicted in fig. 3B may include various forms of SCM. The SCM may effectively treat fast non-volatile memory (e.g., NAND flash) as an extension of DRAM such that the entire data set may be treated as an entire in-memory data set residing in DRAM. The SCM may include non-volatile media, such as NAND flash. Such NAND flash may be accessed utilizing NVMe, which may use the PCIe bus as its transport, providing relatively low access latency compared to the old protocol. In fact, network protocols for SSDs in full flash arrays may include NVMe (ROCE, NVME TCP), fibre channel (NVMe FC), infiniband (iWARP) using Ethernet, and other protocols that make it possible to treat fast nonvolatile memory as an extension of DRAM. In view of the fact that DRAMs are typically byte-addressable and fast, non-volatile memory, such as NAND flash, is block-addressable, a controller software/hardware stack may be required to convert block data into bytes stored in a medium. Examples of media and software that may be used as an SCM may include 3D Xpoint, intel memory drive technology, Z-SSD of Samsung, and others, for example.

The storage resources 308 depicted in fig. 3B may also include racetrack memory (also referred to as domain wall memory). Such racetrack memory may be embodied in the form of non-volatile solid-state memory that relies on the magnetic field formed by electrons as they spin in the solid-state device, as well as the inherent strength and orientation of their electron charge. By using spin-coherent current to move the magnetic domains along the nanoscale permalloy wire, when current is passed through the wire, the magnetic domains may pass through a magnetic read/write head located near the wire, thereby changing the magnetic domains to record the bit pattern. To create a racetrack memory device, many such metal lines and read/write elements may be packaged together.

The example storage system 306 depicted in fig. 3B may implement a variety of storage architectures. For example, a storage system according to some embodiments of the present disclosure may utilize block storage, where data is stored in units of blocks, and each block essentially acts as an individual hard disk drive. A storage system according to some embodiments of the present disclosure may utilize object storage, wherein data is managed as objects. Each object may include the data itself, variable amounts of metadata, and a globally unique identifier, where object storage may be implemented at multiple levels (e.g., device level, system level, interface level). A storage system according to some embodiments of the present disclosure utilizes file storage in which data is stored in a hierarchical structure. This data may be saved in file and folder form and presented to both the system storing the data in the same format and the system retrieving the data.

The example storage system 306 depicted in fig. 3B may be embodied as a storage system in which additional storage resources may be added through the use of a longitudinal expansion model, additional storage resources may be added through the use of a lateral expansion model, or some combination thereof. In the longitudinally extending model, additional storage may be added by adding additional storage means. However, in the lateral expansion model, additional storage nodes may be added to the storage node cluster, where such storage nodes may include additional processing resources, additional network resources, and the like.

The example storage system 306 depicted in FIG. 3B may utilize the storage resources described above in a number of different ways. For example, some portion of the storage resources may be used to act as a write cache, storage resources within the storage system may be used as a read cache, or layering may be implemented within the storage system by placing data within the storage system according to one or more layering policies.

The storage system 306 depicted in fig. 3B also includes communication resources 310 that may be used to facilitate data communication between components within the storage system 306, as well as between the storage system 306 and computing devices external to the storage system 306, including embodiments in which those resources are separated by a relatively broad area. The communication resources 310 may be configured to utilize a variety of different protocols and data communication mesh architectures to facilitate data communication between components within the storage system and computing devices external to the storage system. For example, the communication resources 310 may include: fibre channel ('FC') technologies, such as FC mesh architecture and FC protocols that can transport SCSI commands over FC networks; FC ('FCoE') technology via an ethernet network by which FC frames are encapsulated and transmitted over the ethernet network; infiniband ('IB') technology, in which a switched mesh architecture topology is utilized to facilitate transmission between channel adapters; NVM express ('NVMe') technology and NVMe ('nvmeoh') technology via a mesh architecture by which non-volatile storage media attached via a PCI express ('PCIe') bus can be accessed; and others. In fact, the storage systems described above may directly or indirectly utilize neuter communication techniques and devices by which information (including binary information) is transmitted using neuter beams.

The communication resources 310 may also include mechanisms for accessing the storage resources 308 within the storage system 306 utilizing serial attached SCSI ('SAS'), serial ATA ('SATA') bus interfaces for connecting the storage resources 308 within the storage system 306 to host bus adapters within the storage system 306, internet Small computer System interface ('iSCSI') technology to provide block-level access to the storage resources 308 within the storage system 306, and other communication resources that may be used to facilitate data communication between components within the storage system 306 and between the storage system 306 and computing devices external to the storage system 306.

The storage system 306 depicted in fig. 3B also includes processing resources 312 that may be used to execute computer program instructions and to perform other computing tasks within the storage system 306. The processing resources 312 may include one or more ASICs and one or more CPUs tailored for a particular purpose. The processing resources 312 may also include one or more DSPs, one or more FPGAs, one or more system on a chip ('socs'), or other forms of processing resources 312. The storage system 306 may utilize the storage resources 312 to perform a variety of tasks, including but not limited to supporting execution of software resources 314 as will be described in more detail below.

The storage system 306 depicted in fig. 3B also includes software resources 314 that, when executed by processing resources 312 within the storage system 306, may perform a wide variety of tasks. For example, the software resources 314 may include one or more modules of computer program instructions for performing various data protection techniques when executed by the processing resources 312 within the storage system 306. Such data protection techniques may be performed, for example, by system software executing on computer hardware within a storage system, by a cloud service provider, or otherwise. Such data protection techniques may include data archiving techniques, data backup techniques, data replication techniques, data snapshot techniques, data and database cloning techniques, and other data protection techniques.

The software resource 314 may also include software for implementing a software defined storage ('SDS'). In this example, software resource 314 may include one or more modules of computer program instructions for policy-based provisioning and management of data storage independent of underlying hardware when executed. Such software resources 314 may be used to implement storage virtualization to separate storage hardware from software that manages the storage hardware.

The software resources 314 may also include software for facilitating and optimizing I/O operations for the storage system 306. For example, the software resources 314 may include software modules that perform various data reduction techniques (e.g., data compression, deduplication, and others). Software resources 314 may include software modules that intelligently group I/O operations together to facilitate preferred use of underlying storage resources 308, software modules that perform data migration operations for migration from within the storage system, and software modules that perform other functions. Such software resources 314 may be embodied as one or more software containers or in many other ways.

For further explanation, fig. 3C sets forth an example of a cloud-based storage system 318 according to some embodiments of the present disclosure. In the example depicted in fig. 3C, the Cloud-based storage system 318 is created entirely in the Cloud computing environment 316 (e.g., amazon web services ('AWS') TM, microsoft Azure TM, gu Geyun platform TM, IBM Cloud TM, oracle Cloud TM, and others). The cloud-based storage system 318 may be used to provide services similar to those that may be provided by the storage systems described above.

The cloud-based storage system 318 depicted in fig. 3C includes two cloud computing instances 320, 322 each for supporting execution of storage controller applications 324, 326. Cloud computing instances 320, 322 may be embodied as instances of cloud computing resources (e.g., virtual machines) that may be provided by cloud computing environment 316 to support execution of software applications (e.g., storage controller applications 324, 326), for example. For example, each of cloud computing instances 320, 322 may be executed on Azure VMs, where each Azure VM may include high-speed temporary storage that may be used as a cache (e.g., as a read cache). In one embodiment, cloud computing instances 320, 322 may be embodied as amazon elastic computing cloud ('EC 2') instances. In this example, amazon machine images ('AMIs') including storage controller applications 324, 326 may be launched to create and configure virtual machines that may execute storage controller applications 324, 326.

In the example method depicted in fig. 3C, the storage controller applications 324, 326 may be embodied as modules of computer program instructions that, when executed, perform various storage tasks. For example, the storage controller applications 324, 326 may be embodied as modules of computer program instructions that, when executed, perform the same tasks as the controllers 110A, 110B in fig. 1A described above, such as writing data to the cloud-based storage system 318, erasing data from the cloud-based storage system 318, retrieving data from the cloud-based storage system 318, monitoring and reporting disk utilization and performance, performing redundancy operations, such as RAID or RAID-like data redundancy operations, compressing data, encrypting data, deduplication, and so forth. The reader will appreciate that since there are two cloud computing instances 320, 322 that each include a storage controller application 324, 326, in some embodiments, one cloud computing instance 320 may operate as a primary controller as described above, while the other cloud computing instance 322 may operate as a secondary controller as described above. The reader will appreciate that the storage controller applications 324, 326 depicted in fig. 3C may include the same source code executing within different cloud computing instances 320, 322 (e.g., different EC2 instances).

The reader will appreciate that other embodiments that do not include a primary controller and a secondary controller are within the scope of this disclosure. For example, each cloud computing instance 320, 322 may operate as a primary controller for some portion of the address space supported by cloud-based storage system 318, each cloud computing instance 320, 322 may operate as a primary controller if the I/O operations services for cloud-based storage system 318 are partitioned in some other manner, and so on. Indeed, in other embodiments where cost savings may be prioritized over performance requirements, there may be only a single cloud computing instance containing a storage controller application.

The cloud-based storage system 318 depicted in fig. 3C includes cloud computing instances 340a, 340b, 340n with local storage 330, 334, 338. Cloud computing instances 340a, 340b, 340n may be embodied as instances of cloud computing resources that may be provided by cloud computing environment 316 to support execution of software applications, for example. The cloud computing instances 340a, 340b, 340n of fig. 3C may differ from the cloud computing instances 320, 322 described above in that the cloud computing instances 340a, 340b, 340n of fig. 3C have local storage 330, 334, 338 resources, while the cloud computing instances 320, 322 supporting execution of the storage controller applications 324, 326 do not have to have local storage resources. Cloud computing instances 340a, 340b, 340n with local storage 330, 334, 338 may be embodied, for example, as EC 2M 5 instances comprising one or more SSDs, as EC 2R 5 instances comprising one or more SSDs, as EC 2I 3 instances comprising one or more SSDs, and so forth. In some embodiments, the local storage 330, 334, 338 must be embodied as a solid state storage device (e.g., SSD) rather than a storage device that utilizes a hard disk drive.

In the example depicted in fig. 3C, each of the cloud computing instances 340a, 340b, 340n with local storage 330, 334, 338 may include a software daemon 328, 332, 336 that, when executed by the cloud computing instances 340a, 340b, 340n, may present itself to the storage controller application 324, 326 as if the cloud computing instances 340a, 340b, 340n were physical storage (e.g., one or more SSDs). In this example, software daemons 328, 332, 336 may include computer program instructions similar to those typically contained on a storage device so that storage controller applications 324, 326 can send and receive the same commands that a storage controller would send to the storage device. In this way, the storage controller applications 324, 326 may include code that is the same (or substantially the same) as code executed by the controllers in the storage systems described above. In these and similar embodiments, communication between the storage controller applications 324, 326 and the cloud computing instances 340a, 340b, 340n with the local storage 330, 334, 338 may utilize iSCSI, NVMe via TCP, messaging, custom protocols, or in some other mechanism.

In the example depicted in fig. 3C, each of the cloud computing instances 340a, 340b, 340n with local storage 330, 334, 338 may also be coupled to a block storage 342, 344, 346 provided by the cloud computing environment 316, such as an amazon elastic block storage ('EBS') volume. In this example, the block storage 342, 344, 346 provided by the cloud computing environment 316 may be utilized in a manner similar to how the NVRAM devices described above are utilized, as software daemons 328, 332, 336 (or some other module) executing within a particular cloud computing instance 340a, 340b, 340n may initiate writing data to its attached EBS volume and writing data to its local storage 330, 334, 338 resources upon receiving a request to write data. In some alternative embodiments, data may be written to only the local storage 330, 334, 338 resources within a particular cloud computing instance 340a, 340b, 340 n. In an alternative embodiment, instead of using the block storage 342, 344, 346 provided by the cloud computing environment 316 as NVRAM, the actual RAM on each of the cloud computing instances 340a, 340b, 340n with the local storage 330, 334, 338 may be used as NVRAM, thereby reducing the network utilization costs associated with using EBS volumes as NVRAM. In yet another embodiment, high performance block storage resources, such as one or more Azure Ultra disks, may be used as NVRAM.

The storage controller applications 324, 326 may be used to perform various tasks such as deduplicating data contained in a request, compressing data contained in the request, determining where to write data contained in the request, etc., and then ultimately sending the request to write a deduplicated, encrypted, or otherwise potentially updated version of data to one or more cloud computing instances 340a, 340b, 340n having local storage 330, 334, 338. In some embodiments, either cloud computing instance 320, 322 may receive a request to read data from cloud-based storage system 318, and may ultimately send a request to read data to one or more of cloud computing instances 340a, 340b, 340n with local storage 330, 334, 338.

When a particular cloud computing instance 340a, 340b, 340n having a local storage 330, 334, 338 receives a request to write data, the software daemon 328, 332, 336 may be configured to not only write data to its own local storage 330, 334, 338 resources and any suitable block storage 342, 344, 346 resources, but the software daemon 328, 332, 336 may also be configured to write data to a cloud-based object storage 348 attached to the particular cloud computing instance 340a, 340b, 340n. The cloud-based object store 348 attached to the particular cloud computing instance 340a, 340b, 340n may be embodied as, for example, an amazon simple storage service ('S3'). In other embodiments, cloud computing instances 320, 322, each including a storage controller application 324, 326, may initiate data storage in local storage 330, 334, 338 and cloud-based object storage 348 of cloud computing instances 340a, 340b, 340n. In other embodiments, instead of using cloud computing instances 340a, 340b, 340n with local storage 330, 334, 338 (also referred to herein as "virtual drives") and cloud-based object storage 348 to store data, the persistent storage layer may be implemented in other ways. For example, one or more Azure Ultra disks may be used to store data continuously (e.g., after the data has been written to the NVRAM layer).

While the local storage 330, 334, 338 resources and block storage 342, 344, 346 resources utilized by the cloud computing instances 340a, 340b, 340n may support block-level access, the cloud-based object storage 348 attached to a particular cloud computing instance 340a, 340b, 340n only supports object-based access. Thus, software daemons 328, 332, 336 may be configured to obtain blocks of data, package those blocks into objects, and write the objects to cloud-based object storage 348 attached to particular cloud computing instances 340a, 340b, 340 n.

Consider an example in which data is written in 1MB blocks to local storage 330, 334, 338 resources and block storage 342, 344, 346 resources utilized by cloud computing instances 340a, 340b, 340 n. In this example, assume that a user of cloud-based storage system 318 issues a request to write data, which results in the need to write 5MB of data after the data is compressed and de-duplicated by storage controller applications 324, 326. In this example, writing data to the local storage 330, 334, 338 and block storage 342, 344, 346 resources utilized by the cloud computing instances 340a, 340b, 340n is relatively straightforward, as 5 blocks of size 1MB are written to the local storage 330, 334, 338 and block storage 342, 344, 346 resources utilized by the cloud computing instances 340a, 340b, 340 n. In this example, software daemons 328, 332, 336 may also be configured to create five objects that contain different 1MB blocks of data. As such, in some embodiments, each object written to cloud-based object store 348 may be the same (or nearly the same) size. The reader will appreciate that in this example, metadata associated with the data itself may be included in each object (e.g., the first 1MB of the object is data and the remainder is metadata associated with the data). Readers will appreciate that cloud-based object store 348 may be incorporated into cloud-based storage system 318 to increase the persistence of cloud-based storage system 318.

In some embodiments, all data stored by cloud-based storage system 318 may be stored in both: 1) Cloud-based object storage 348, and 2) at least one of local storage 330, 334, 338 resources or block storage 342, 344, 346 resources utilized by cloud computing instances 340a, 340b, 340 n. In such embodiments, the local storage 330, 334, 338 resources and block storage 342, 344, 346 resources utilized by the cloud computing instances 340a, 340b, 340n may effectively operate as a cache that typically includes all of the data also stored in S3, such that all reads of the data may be serviced by the cloud computing instances 340a, 340b, 340n without the cloud computing instances 340a, 340b, 340n accessing the cloud-based object storage 348. However, readers will appreciate that in other embodiments, all data stored by the cloud-based storage system 318 may be stored in the cloud-based object storage 348, but less than all data stored by the cloud-based storage system 318 may be stored in at least one of the local storage 330, 334, 338 resources or the block storage 342, 344, 346 resources utilized by the cloud computing instances 340a, 340b, 340 n. In this example, various policies may be utilized to determine which subset of data stored by cloud-based storage system 318 should reside in both: 1) Cloud-based object storage 348, and 2) at least one of local storage 330, 334, 338 resources or block storage 342, 344, 346 resources utilized by cloud computing instances 340a, 340b, 340 n.

One or more modules of computer program instructions executing within cloud-based storage system 318 (e.g., a monitoring module executing on its own EC2 instance) may be designed to handle failure of one or more of cloud computing instances 340a, 340b, 340n with local storage 330, 334, 338. In this example, the monitoring module may handle failure of one or more of the cloud computing instances 340a, 340b, 340n with the local storage 330, 334, 338 by creating one or more new cloud computing instances with the local storage, retrieving data stored on the failed cloud computing instance 340a, 340b, 340n from the cloud-based object storage 348, and storing the data retrieved from the cloud-based object storage 348 in the local storage on the newly created cloud computing instance. The reader will appreciate that many variations of this process may be implemented.

The reader will appreciate that various performance aspects of the cloud-based storage system 318 may be monitored (e.g., by a monitoring module executing in the EC2 instance) such that the cloud-based storage system 318 may be longitudinally or laterally expanded as desired. In this example, if the cloud computing instances 320, 322 for supporting execution of the storage controller applications 324, 326 are too small in size and do not adequately service I/O requests issued by users of the cloud-based storage system 318, the monitoring module may create new, more powerful cloud computing instances (e.g., of the type that include more processing power, more memory, etc.) that include the storage controller applications so that the new, more powerful cloud computing instances may begin to operate as primary controllers. Likewise, if the monitoring module determines that the cloud computing instances 320, 322 for supporting execution of the storage controller applications 324, 326 are oversized and may result in cost savings by switching to smaller, less powerful cloud computing instances, the monitoring module may create new, less powerful (and less expensive) cloud computing instances that contain the storage controller applications so that the new, less powerful cloud computing instances may begin to operate as primary controllers.

The storage system described above may perform intelligent data backup techniques by which data stored in the storage system may be replicated and stored in different locations to avoid losing data in the event of equipment failure or some other form of disaster. For example, the storage system described above may be configured to check each backup to avoid restoring the storage system to an undesirable state. Consider an example in which malware infects a storage system. In this example, the storage system may include software resources 314 that may scan each backup to identify backups captured before and those backups captured after the malware infects the storage system. In this example, the storage system may restore itself from a backup that does not contain malware, or at least does not restore the portion of the backup that contains malware. In this example, the storage system may include a software resource 314 that may scan each backup to identify the presence of malware (or viruses, or some other undesirable software), for example, by: by identifying a write operation serviced by the storage system and originating from a network subnet suspected of having delivered malware; by identifying write operations serviced by the storage system and originating from users suspected of having delivered malware; by identifying write operations serviced by the storage system and checking the contents of the write operations against the fingerprints of malware, and in many other ways.

Readers will further appreciate that backups (typically in the form of one or more snapshots) may also be utilized to perform quick recovery of the storage system. Consider an example in which a storage system is infected with luxury software to lock a user out of the storage system. In this example, the software resource 314 within the storage system may be configured to detect the presence of the lux software and may be further configured to restore the storage system to a point in time prior to the point in time at which the lux software infects the storage system using the reserved backup. In this example, the presence of the lux software may be explicitly detected by using software tools utilized by the system, by using a key (e.g., a USB drive) inserted into the storage system, or in a similar manner. Likewise, the presence of the lux software may be inferred in response to the system activity meeting a predetermined fingerprint (e.g., no reading or writing to the system occurs within a predetermined period of time).

Readers will appreciate that the various components described above may be grouped into one or more optimized computing packages as a converged infrastructure. Such a converged infrastructure may include computers, storage devices, and a pool of network resources, which may be shared by multiple applications and managed in a collective manner using policy-driven processes. Such a fusion infrastructure may be implemented by a fusion infrastructure reference architecture, by a stand-alone appliance, by a software-driven super fusion method (e.g., a super fusion infrastructure), or in other ways.

Readers will appreciate that the storage systems described in this disclosure may be used to support various types of software applications. In fact, the storage system may be 'application aware' in the sense that the storage system may obtain, maintain, or otherwise have access to information describing the connected application (e.g., an application utilizing the storage system) to optimize the operation of the storage system based on intelligence about the application and its utilization pattern. For example, the storage system may optimize data layout, optimize cache behavior, optimize 'QoS' hierarchy, or perform some other optimization designed to improve storage performance experienced by the application.

As an example of one type of application that may be supported by the storage systems described herein, the storage system 306 may be used to support such applications by providing storage resources for: artificial intelligence ('AI') applications, database applications, XOps items (e.g., devOps items, dataOps items, MLOps items, modelOps items, platformOps items), electronic design automation tools, event driven software applications, high performance computing applications, simulation applications, high speed data capture and analysis applications, machine learning applications, media production applications, media service applications, picture archiving and communication systems ('PACS') applications, software development applications, virtual reality applications, augmented reality applications, and many other types of applications.

In view of the fact that the storage system includes computing resources, storage resources, and a wide variety of other resources, the storage system may be well suited to support resource-intensive applications, such as AI applications. AI applications can be deployed in a variety of fields, including: predictive maintenance in manufacturing and related fields, healthcare applications such as patient data and risk analysis, retail and marketing deployments (e.g., search advertisements, social media advertisements), supply chain solutions, financial technology solutions such as business analysis and reporting tools, operational deployments such as real-time analysis tools, application performance management tools, IT infrastructure management tools, and many other fields.

Such AI applications may enable a device to perceive its environment and take actions that maximize its chances of success on a certain target. Examples of such AI applications may include IBM Watson TM, microsoft OxfordTM, google DeepMindTM, baidu Minwa TM, and others.

The storage system described above may also be well suited to support other types of resource-intensive applications, such as machine learning applications. The machine learning application may perform various types of data analysis to automate analytical model construction. Using an algorithm that iteratively learns from the data, the machine learning application may cause the computer to learn without explicit programming. One particular area of machine learning is known as reinforcement learning, which involves taking appropriate action to maximize rewards under certain circumstances.

In addition to the resources already described, the storage system described above may also contain a graphics processing unit ('GPU'), sometimes referred to as a visual processing unit ('VPU'). Such GPUs may be embodied as specialized electronic circuits that quickly manipulate and alter memory to speed up the creation of images in a frame buffer intended for output to a display device. Such GPUs may be included in any of the computing devices that are part of the storage system described above, including as one of many individually scalable components of the storage system, wherein other examples of individually scalable components of such storage system may include storage components, memory components, computing components (e.g., CPU, FPGA, ASIC), network components, software components, and other components. In addition to GPUs, the storage systems described above may also include a neural network processor ('NNP') for use in various aspects of neural network processing. Such NNPs may be used in place of (or in addition to) GPUs, and may also scale independently.

As described above, the storage systems described herein may be configured to support artificial intelligence applications, machine learning applications, big data analysis applications, and many other types of applications. The rapid growth of these kinds of applications is driven by three technologies: deep Learning (DL), GPU processor, and big data. Deep learning is a computational model that utilizes a massively parallel neural network stimulated by the human brain. The deep learning model is not expert-made software by hand, but rather by writing its own software by learning from a large number of instances. Such GPUs may contain thousands of cores that are well suited to running algorithms that loosely represent the parallel nature of the human brain.

Advances in deep neural networks, including the development of multi-layer neural networks, have fired a new wave of algorithms and tools that data scientists use Artificial Intelligence (AI) to mine their data. With improved algorithms, larger datasets, and various frameworks (including for spanning a range of arbitraryOpen source software library for machine learning of business), data scientists are dealing with new use cases such as automated driving vehicles, natural language processing and understanding, computer vision, machine reasoning, powerful AI, and many others. Applications of such techniques may include: detecting, identifying and avoiding the machine and the vehicle object; visual identification, classification and labeling; algorithmic financial transaction policy performance management; positioning and mapping at the same time; predictive maintenance of high value machines; preventing network security threat and automating professional knowledge; image identification and classification; solving a problem; robotics; text analysis (extraction, classification), text generation and translation; and many other applications. Application of AI technology has been implemented in a variety of products, for example, voice recognition technology of Amazon Echo allows users to talk to their machines, google translation ^TM Allowing machine-based language translation, spotify Discover Web kly provides recommendations about new songs and artists that the user may like based on usage and traffic analysis of the user, quill's text generates products that take structured data and convert it into narrative stories, chat robots provide real-time context-specific answers to questions in conversational format, and many other products.

Data is the core of modern AI and deep learning algorithms. Before training can begin, one problem that must be addressed is around collecting marker data that is critical to training an accurate AI model. Full-scale AI deployments may be required to continuously collect, clean up, transform, tag, and store large amounts of data. Adding additional high quality data points translates directly into a more accurate model and better insight. The data sample may undergo a series of processing steps including, but not limited to: 1) introducing data from external sources into the training system and storing the data in raw form, 2) cleaning up and transforming the data in a format convenient for training, including linking the data samples to the appropriate labels, 3) exploring the parameters and models, quickly testing using smaller data sets, and iterating to converge to the most promising model to push into the production cluster, 4) performing a training phase to select a random batch of input data that contains both new and old samples, and feeding those data into the production GPU server for computation to update the model parameters, and 5) evaluating, including using the reserved portion of the data that is not used in the training, in order to evaluate the model accuracy of the reserved data. This lifecycle may be applicable to any type of parallel machine learning, not just neural networks or deep learning. For example, a standard machine learning framework may rely on a CPU instead of a GPU, but the data intake and training workflow may be the same. Readers will appreciate that a single shared storage data hub creates a coordination point throughout the lifecycle without requiring additional copies of data between the import, pre-processing, and training phases. The data introduced is rarely used for one purpose only, and shared storage may give flexibility to train multiple different models or apply traditional analysis to the data.

The reader will appreciate that each stage in the AI data pipeline may have different requirements for a data hub (e.g., a storage system or a collection of storage systems). Laterally expanding storage systems must provide non-compromised performance for a wide variety of access types and patterns (small, metadata intensive to large files, random to sequential access patterns, and low to high concurrency). The storage system described above may serve as an ideal AI data hub because the system may serve unstructured work loads. In the first phase, the data is ideally imported and stored onto the same data hub that will be used in later phases to avoid excessive data duplication. The next two steps can be done on a standard compute server optionally containing a GPU, and then in the fourth and last stage, the complete training production job runs on a powerful GPU acceleration server. Typically, a production pipeline exists alongside the experiment pipeline, which operates on the same data set. Furthermore, the GPU acceleration server may be used independently for different models, or may be used in combination to train one larger model, even in a distributed fashion across multiple systems. If the shared storage hierarchy is slow, each phase must copy the data to local storage, resulting in wasted time buffering the data to a different server. The ideal data hub of the AI training pipeline provides similar performance to data stored locally on the server node, while also having the simplicity and performance of enabling concurrent operation of all pipeline stages.

In order for the storage system described above to function as a data hub or as part of an AI deployment, in some embodiments, the storage system may be configured to provide DMA between a storage device included in the storage system and one or more GPUs for AI or big data analysis pipelines. One or more GPUs may be coupled to a storage system, e.g., via a network-structured NVMe ('NVMe-ofj'), such that bottlenecks, e.g., host CPUs, etc., may be bypassed and the storage system (or one oF the components contained therein) may directly access GPU memory. In this example, the storage system may utilize an API hooking with the GPU to transfer data directly to the GPU. For example, the GPU may be embodied as an nvidia GPU, and the storage system may support gpudiect storage ('GDS') software or similar proprietary software that enables the storage system to transfer data to the GPU via RDMA or similar mechanisms.

While the preceding paragraphs discuss deep learning applications, readers will understand that the storage system described herein may also be part of a distributed deep learning ('DDL') platform to support execution of DDL algorithms. The storage system described above may also be paired with other technologies, such as TensorFlow, which is an open source software library for data flow programming across a series of tasks that may be used for machine learning applications such as neural networks, to facilitate development, application, etc. of such machine learning models.

The storage system described above may also be used in neuromorphic computing environments. Neuromorphic calculations are in a form of calculation that mimics brain cells. To support neuromorphic computation, the architecture of interconnected "neurons" replaces the traditional computational model with low power signals that travel directly between neurons to achieve more efficient computation. Neuromorphic calculations may utilize Very Large Scale Integration (VLSI) systems that contain electronic analog circuits to mimic the neural biological architecture present in the nervous system, as well as analog, digital, mixed-mode analog/digital VLSI, and software systems implementing neural system models for sensing, motor control, or multisensory integration.

Readers will appreciate that the storage system described above may be configured to support (among other types of data) storage or use of blockchains and derivative items, such as open source blockchains and related tools as part of the IBMTM Hyperledger item, licensed blockchains in which a certain number of trusted parties are allowed to access the blockchains, blockchain products that enable developers to build their own distributed ledger items, and others. The blockchains and storage systems described herein may be utilized to support on-chain storage of data as well as off-chain storage of data.

The out-of-chain storage of data may be implemented in a variety of ways and may occur when the data itself is not stored within the blockchain. For example, in one embodiment, a hash function may be utilized and the data itself may be fed into the hash function to generate the hash value. In this example, a hash of a large piece of data, rather than the data itself, may be embedded within the transaction. Readers will appreciate that in other embodiments, blockchain alternatives may be used to facilitate the decentralized storage of information. For example, one alternative to blockchain that may be used is blockweave (blockweave). Although conventional blockchains store per transaction to enable authentication, blockspinning permits secure dispersion without using the entire chain, thereby enabling low-cost on-chain storage of data. Such block spinning may utilize a consensus mechanism based on access attestation (PoA) and proof of workload (PoW).

The storage systems described above may be used alone or in combination with other computing devices to support in-memory computing applications. In-memory computing involves storing information in RAM distributed across clusters of computers. Readers will appreciate that the storage systems described above, particularly those systems that are configurable with customizable amounts of processing resources, storage resources, and memory resources (e.g., those systems in which the blade contains a configurable amount of each type of resource), may be configured in a manner that provides an infrastructure that can support in-memory computing. Likewise, the storage system described above may include component parts (e.g., NVDIMMs, 3D cross-point storage that provides persistent fast random access memory) that may actually provide an improved in-memory computing environment as compared to in-memory computing environments that rely on RAM distributed across dedicated servers.

In some embodiments, the storage system described above may be configured to operate as a hybrid in-memory computing environment that includes a universal interface to all storage media (e.g., RAM, flash memory devices, 3D cross-point storage devices). In such embodiments, the user may not know the details of where their data is stored, but they may still use the same complete, unified API to address the data. In such embodiments, the storage system may move the data (in the background) to the fastest tier available, including looking aside various characteristics of the data or looking aside some other heuristic to intelligently place the data. In this example, the storage system may even utilize existing products (e.g., apache igite and GridGain) to move data between storage layers, or the storage system may utilize custom software to move data between storage layers. The storage systems described herein may implement various optimizations to improve the performance of in-memory computations, such as to make the computation occur as close to the data as possible.

The reader will further appreciate that in some embodiments, the storage system described above may be paired with other resources to support the applications described above. For example, one infrastructure may include primary computations in the form of servers and workstations that exclusively use general purpose computing ('GPGPU') on a graphics processing unit to accelerate deep learning applications that are interconnected into a compute engine to train parameters of a deep neural network. Each system may have an ethernet external connection, an infiniband external connection, some other form of external connection, or some combination thereof. In this example, GPUs may be grouped for a single large training or independently used to train multiple models. The infrastructure may also include a storage system, such as those described above, to provide full flash file or object storage, such as lateral expansion, through which data may be accessed via high performance protocols, such as NFS, S3, and the like. The infrastructure may also include redundant shelf-top ethernet switches connected to storage and computation, e.g., via ports in the MLAG port channels, to achieve redundancy. The infrastructure may also include additional computations in the form of white-box servers, optionally with GPUs, for data introduction, preprocessing, and model debugging. The reader will appreciate that additional infrastructure is also possible.

The reader will appreciate that the storage system described above, alone or in coordination with other computing machines, may be configured to support other AI-related tools. For example, the storage system may utilize tools such as ONXX or other open neural network exchange formats that make it easier to communicate models written in different AI frameworks. Likewise, the storage system may be configured to support tools such as Gluon, which allow developers to prototype, build, and train deep learning models. In fact, the storage system described above may be part of a larger platform, such as the ibm data private cloud, that includes integrated data science, data engineering, and application building services.

Readers will further appreciate that the storage system described above may also be deployed as an edge solution. This edge solution may be used to optimize a cloud computing system by performing data processing at the network edge near the data source. Edge computing can push applications, data, and computing power (i.e., services) from a centralized point to the logical extremity of the network. By using edge solutions such as the storage systems described above, computing tasks may be performed using computing resources provided by such storage systems, data may be stored using storage resources of the storage systems, and cloud-based services may be accessed using various resources (including network resources) of the storage systems. By performing computing tasks based on edge solutions, storing data based on edge solutions, and utilizing edge solutions in general, consumption of expensive cloud-based resources may be avoided, and in fact, performance improvements may be experienced relative to more serious reliance on cloud-based resources.

While many tasks may benefit from the utilization of edge solutions, some specific uses may be particularly suited for deployment in this environment. For example, devices such as drones, autopilots, robots, and others may require extremely fast processing speeds, in fact so fast that sending data to the cloud environment and returning received data processing support may be too slow. As an additional example, some IoT devices, such as connected video cameras, may be very unsuitable for utilizing cloud-based resources, as it may be impractical (not only from a privacy, security, or financial perspective) to send data to the cloud simply because of the pure amount of data involved. As such, many tasks that truly involve data processing, storage, or communication may be more suitable for platforms that include edge solutions (e.g., the storage systems described above).

The storage systems described above may act as network edge platforms for combined computing resources, storage resources, network resources, cloud technology, network virtualization technology, and the like, alone or in combination with other computing resources. As part of the network, the edge may have characteristics similar to other network facilities from client terminals and backhaul aggregation facilities to point of presence (PoP) and regional data centers. Readers will appreciate that network workloads such as Virtual Network Functions (VNFs) and others will reside on the network edge platform. A network edge platform with the capability of a combination of containers and virtual machines may rely on controllers and schedulers that are no longer geographically co-located with data processing resources. As micro services, the functionality may be separated into a control plane, user and data plane, or even a state machine, allowing independent optimization and scaling techniques to be applied. Such user and data planes may be implemented by adding accelerators, both resident in server platforms (e.g., FPGA and smart NIC), and may be implemented by commercial silicon and programmable ASIC with SDN capabilities.

The storage system described above may also be optimized for use in big data analysis, including use as part of a combinable data analysis pipeline in which a containerized analysis architecture, for example, makes analysis capabilities more combinable. Big data analysis can be generally described as a process of examining large and diverse data sets to find hidden patterns, unknown correlations, market trends, customer preferences, and other useful information that can help organizations make more intelligent business decisions. As part of the process, semi-structured and unstructured data (e.g., internet click stream data, web server logs, social media content, text from customer emails and survey responses, mobile phone call detail records, ioT sensor data, and other data) may be converted into structured form.

The storage systems described above may also support (including implementation as a system interface) applications that perform tasks in response to human speech. For example, the storage system may support execution of intelligent personal assistant applications such as Alexa (TM), apple Siri (TM), google VoiceTM, samsung BixbyTM, microsoft CortanaTM, and others. Although the example described in the previous sentence utilizes voice as input, the storage system described above may also support chat robots (chatbots), talking robots, chat robots (chat bots), or manual conversation entities or other applications configured to conduct conversations via auditory or text methods. As such, the storage system may actually execute such applications to enable users, such as system administrators, to interact with the storage system via speech. Although such applications may be used as interfaces for various system management operations in embodiments according to the present disclosure, such applications are generally capable of voice interactions, music playback, making to-do lists, setting alarm clocks, streaming podcasts, playing audio readings, and providing weather, traffic, and other real-time information (e.g., news).

The storage system described above may also implement an AI platform to realize the prospect of autopilot storage. Such AI platforms can be configured to provide global predictive intelligence by collecting and analyzing a large number of storage system telemetry data points to enable easy management, analysis, and support. In fact, such storage systems may be able to predict both capacity and performance, as well as generate intelligent advice regarding workload deployment, interaction, and optimization. Such AI platforms may be configured to scan all incoming storage system telemetry data according to a problem fingerprint library to predict and resolve incidents in real-time before they affect the customer environment, and capture hundreds of performance-related variables for forecasting performance load.

The storage system described above may support serialization or simultaneous execution of artificial intelligence applications, machine learning applications, data analysis applications, data transformations, and other tasks that may together form an AI ladder. This AI ladder can be effectively formed by combining such elements to form a complete data science pipeline, where there are dependencies between the elements of the AI ladder. For example, an AI may require some form of machine learning, machine learning may require some form of analysis, analysis may require some form of data and information architecture, and so forth. As such, each element may be considered a step in the AI ladder, which may together form a complete and complex AI solution.

The storage system described above may also be used, alone or in combination with other computing environments, to provide a ubiquitous experience of AI, where AI permeates a wide and broad variety of aspects of business and life. For example, AI may play an important role in: deep learning solutions, deep reinforcement learning solutions, general artificial intelligence solutions, autopilot vehicles, cognitive computing solutions, commercial UAVs or drones, conversational user interfaces, enterprise taxonomies, ontology management solutions, machine learning solutions, intelligent motes, intelligent robots, intelligent workplaces, and many others are provided.

The storage systems described above may also be used, alone or in combination with other computing environments, to provide a wide range of transparent immersive experiences, including those using digital twinning of various "things" such as people, places, processes, systems, and the like, in which technology can introduce transparency between people, businesses, and things. Such transparent immersive experience can be provided as augmented reality technology, connected home, virtual reality technology, brain-to-machine interface, human body augmentation technology, nanotube electronics, volumetric display, 4D printing technology, or other technology.

The storage systems described above may also be used, alone or in combination with other computing environments, to support a wide variety of digital platforms. For example, such digital platforms may include 5G wireless systems and platforms, digital twin platforms, edge computing platforms, ioT platforms, quantum computing platforms, serverless PaaS, software defined security, neuromorphic computing platforms, and the like.

The storage system described above may also be part of a multi-cloud environment, where multiple cloud computing and storage services are deployed in a single heterogeneous architecture. To facilitate operation of this multi-cloud environment, a DevOps tool may be deployed to enable orchestration across the clouds. Likewise, continuous development and continuous integration tools may be deployed to standardize the process of pushing and provisioning cloud workloads around continuous integration and delivery, new features. By normalizing these processes, a cloudiness policy may be implemented that enables the best provider to be utilized for each workload.

The storage system described above may be used as part of a platform to enable the use of encryption anchors that may be used to authenticate the source and content of a product to ensure that it matches a blockchain record associated with the product. Similarly, the storage systems described above may implement various encryption techniques and schemes, including lattice cryptography, as part of a suite of tools that protect data stored on the storage systems. Lattice cryptography may involve the construction of cryptographic primitives involving a lattice in the construction itself or in a security certificate. Unlike public key schemes such as RSA, diffie-Hellman, or elliptic curve cryptography, which are vulnerable to quantum computer attacks, some lattice-based constructs appear to be resistant to attacks by both classical and quantum computers.

Quantum computers are devices that perform quantum computation. Quantum computation uses quantum mechanical phenomena such as superposition and entanglement for computation. Quantum computers differ from transistor-based traditional computers in that such traditional computers require encoding data into binary digits (bits), each of which is always in one of two determined states (0 or 1). Compared to traditional computers, quantum computers use quantum bits, which can be in superposition of states. Quantum computers maintain a qubit sequence in which a single qubit may represent a 1, 0, or any quantum superposition of those two qubit states. A pair of qubits may be in any quantum superposition of 4 states, and three qubits may be in any superposition of 8 states. Quantum computers with n qubits can typically be in any superposition of up to 2 n different states at the same time, whereas traditional computers can be in only one of these states at any one time. Quantum turing machines are theoretical models of this computer.

The storage system described above may also be paired with an FPGA acceleration server as part of a larger AI or ML infrastructure. Such FPGA acceleration servers may reside near the storage systems described above (e.g., in the same data center) or even be incorporated into an appliance that includes one or more storage systems, one or more FPGA acceleration servers, network infrastructure supporting communications between one or more storage systems and one or more FPGA acceleration servers, and other hardware and software components. Alternatively, the FPGA acceleration server may reside within a cloud computing environment that may be used to perform computing-related tasks of AI and ML jobs. Any of the embodiments described above may be used together to act as an FPGA-based AI or ML platform. The reader will appreciate that in some embodiments of an FPGA-based AI or ML platform, FPGAs contained within the FPGA acceleration server may be reconfigured for different types of ML models (e.g., LSTM, CNN, GRU). The ability to reconfigure the FPGA contained within the FPGA acceleration server may enable acceleration of the ML or AI application based on the optimal numerical accuracy and memory model used. The reader will appreciate that by treating the collection of FPGA acceleration servers as a pool of FPGAs, any CPU in the data center can use the pool of FPGAs as a shared hardware micro-service, rather than limiting the servers to dedicated accelerators to which plugs.

The FPGA acceleration server and GPU acceleration server described above may implement a computational model in which the machine learning model and parameters are fixed into high bandwidth on-chip memory (where large amounts of data flow through the high bandwidth on-chip memory) rather than keeping small amounts of data in the CPU and running long instruction streams on it as in more traditional computational models. For this computational model, the FPGA may even be more efficient than the GPU, as the FPGA may be programmed with only the instructions needed to run such computational model.

The storage system described above may be configured to provide parallel storage, such as by using a parallel file system, such as BeeGFS. Such parallel file systems may include a distributed metadata architecture. For example, a parallel file system may include multiple metadata servers across which metadata is distributed, as well as components including services for clients and storage servers.

The systems described above may support execution of a wide variety of software applications. Such software applications may be deployed in a variety of ways, including container-based deployment models. A variety of tools may be used to manage the containerized application. For example, the containerized application may be managed using Docker Swarm, kubernetes, and others. The containerized application may be used to facilitate a serverless, cloud-native computing deployment and management model for the software application. To support server-less, cloud-native computing deployment and management models of software applications, a container may be used as part of an event handling mechanism (e.g., AWS Lambdas) such that various events may cause the containerized application to launch to operate as an event handler.

The system described above may be deployed in a variety of ways, including in a manner that supports 5 th generation ('5G') networks. The 5G network may support substantially faster data communications than previous generations of mobile communication networks and thus may result in cracking of data and computing resources, as modern mass data centers may become less prominent and may be replaced by more local miniature data centers, e.g., close to mobile network towers. The systems described above may be included in such local mini-data centers and may be part of or paired with a multiple access edge computing ('MEC') system. Such MEC systems may enable cloud computing capabilities and IT service environments at the edges of the cellular network. By running the application and performing the relevant processing tasks closer to the cellular client, network congestion may be reduced and the application may be better executed.

The storage system described above may also be configured to implement NVMe partitioned namespaces. By using an NVMe partitioned namespace, the logical address space of the namespace is divided into several zones. Each zone provides a logical block address range that must be written in order and explicitly reset prior to overwriting, thereby enabling creation of a namespace disclosing the natural boundaries of the device and offloading management of the internal mapping tables to the host. To implement an NVMe partitioned namespace ('ZNS'), a ZNS SSD or some other form of partitioned block device that uses a zone-to-zone public namespace logical address space can be utilized. By physically aligning the zones with the interior of the device, some inefficiencies in data placement can be eliminated. In such embodiments, each band may be mapped to a separate application, for example, such that functions such as wear leveling and garbage collection may be performed on a per-band or per-application basis rather than across the entire device. To support ZNS, the storage controller described herein may be configured to interact with a partitioned block device using, for example, a Linux kernel partitioned block device interface or other tool.

The storage systems described above may also be configured to otherwise implement partitioned storage, such as by using Shingled Magnetic Recording (SMR) storage. In examples where partitioned storage is used, embodiments of device management may be deployed where the storage device hides this complexity by managing embodiments of the device management in firmware, presenting an interface similar to any other storage device. Alternatively, partitioned storage may be implemented via a host managed embodiment that depends on how the operating system is aware of how the drive is handled, and only writes sequentially to some areas of the drive. Partitioned storage may similarly be implemented using a host aware example in which a combination of drive management implementations and host management implementations are deployed.

The storage systems described herein may be used to form data lakes. The data lake may operate as a first location to which the organized data flows, where such data may be in raw format. Metadata tagging may be implemented to facilitate searching data elements in a data lake, particularly in embodiments where the data lake contains multiple data stores (e.g., unstructured data, semi-structured data, structured data) in a format that is not readily accessible or readable. Data may be transferred downstream from the data lake to a data warehouse where the data may be stored in a more manageable, packaged, and consumed format. The storage system described above may also be used to implement this data warehouse. In addition, the data marts or data hubs may allow for easier consumption of data, where the storage systems described above may also be used to provide the underlying storage resources required by the data marts or data hubs. In an embodiment, querying a data lake may require a read-on-read (schema-on-read) method in which data is applied to a schema or schema when the data is fetched from a storage location rather than when the data is entered into the storage location.

The storage systems described herein may also be configured to implement a recovery Point target ('RPO'), which may be established by a user, by an administrator, as a system default, as part of a storage class or service that the storage system participates in delivering, or otherwise. The "recovery point target" is a target that achieves a maximum time difference between the last update of the source data set and the last recoverable replicated data set update that can be recovered correctly from a continuously or frequently updated copy of the source data set given the reason for doing so. An update is correctly recoverable if all updates processed on the source data set are correctly considered before the last recoverable replicated data set update.

In synchronous replication, the RPO will be zero, meaning that under normal operation, all completed updates on the source data set should exist and can be correctly restored on the duplicate data set. In best effort near synchronous replication, the RPO may be as low as a few seconds. In snapshot-based replication, the RPO may be roughly calculated as the time interval between snapshots plus the time to transfer modifications between a previously transferred snapshot and the most recent snapshot to be replicated.

If the update accumulates faster than it replicates, the RPO may be missed. For snapshot-based replication, an RPO may be missed if more data to be replicated is accumulated between two snapshots than is replicable between taking a snapshot and replicating an accumulated update of the snapshot to the replica. Also, in snapshot-based replication, if the cumulative speed of the data to be replicated is faster than can be transferred in the time between subsequent snapshots, the replication may begin to further lag, which may lengthen the gap between the intended recovery point target and the actual recovery point represented by the update of the last correct replication.

The storage system described above may also be part of a shared-nothing storage cluster. In a shared-nothing storage cluster, each node of the cluster has local storage and communicates with other nodes in the cluster over a network, with the storage used by the cluster being (typically) provided only by storage connected to each individual node. The set of nodes that synchronously replicate the data set may be one example of a shared-nothing storage cluster in that each storage system has local storage and communicates with other storage systems over a network, where those storage systems do not (typically) use storage from elsewhere with which they share access over some interconnect. In contrast, some of the storage systems described above are themselves built as shared storage clusters, as there are drive shelves shared by the paired controllers. However, other storage systems described above are built as shared-nothing storage clusters, because all storage devices are local to a particular node (e.g., blade), and all communication is through a network linking computing nodes together.

In other embodiments, other forms of shared-nothing storage clusters may include embodiments in which any node in the cluster has a local copy of all of its needed storage, as well as embodiments in which data is mirrored to other nodes in the cluster by way of synchronous replication to ensure that the data is not lost or because the other nodes are also using the storage. In this embodiment, if a new cluster node needs some data, the data may be copied to the new node from other nodes that have copies of the data.

In some embodiments, a shared storage cluster based on mirrored copies may store multiple copies of storage data for all clusters, with each subset of data being copied to a particular set of nodes and different subsets of data being copied to different sets of nodes. In some variations, an embodiment may store all stored data for a cluster in all nodes, while in other variations, the nodes may be partitioned such that a first group of nodes will all store the same data set, while a second, different group of nodes will all store different data sets.

Readers will appreciate that RAFT-based databases (e.g., etcd) may operate as shared-nothing storage clusters, where all RAFT nodes store all data. However, the amount of data stored in the RAFT clusters may be limited so that the additional copies do not consume too much memory. The container server cluster is also able to copy all data to all cluster nodes, provided that the container is not too large and its bulk data (data manipulated by the application running in the container) is stored elsewhere, such as in the S3 cluster or an external file server. In this example, container storage may be provided by the clusters directly through their shared-nothing storage model, those containers providing images of the execution environment forming part of the application or service.

For further explanation, fig. 3D illustrates an exemplary computing device 350 that may be specifically configured to perform one or more of the processes described herein. As shown in fig. 3D, computing device 350 may include a communication interface 352, a processor 354, a storage 356, and an input/output ("I/O") module 358 communicatively connected to each other via a communication infrastructure 360. Although the exemplary computing device 350 is shown in fig. 3D, the components illustrated in fig. 3D are not intended to be limiting. Additional or alternative components may be used in other embodiments. The components of the computing device 350 shown in fig. 3D will now be described in additional detail.

The communication interface 352 may be configured to communicate with one or more computing devices. Examples of communication interface 352 include, but are not limited to, a wired network interface (e.g., a network interface card), a wireless network interface (e.g., a wireless network interface card), a modem, an audio/video connection, and any other suitable interface.

Processor 354 generally represents any type or form of processing unit capable of processing data and/or interpreting, executing, and/or directing the execution of one or more of the instructions, processes, and/or operations described herein. The processor 354 may perform operations by executing computer-executable instructions 362 (e.g., applications, software, code, and/or other executable data instances) stored in the storage 356.

Storage 356 may include one or more data storage media, devices, or configurations, and may take any type, form, and combination of data storage media and/or devices. For example, storage 356 may include, but is not limited to, any combination of non-volatile media and/or volatile media described herein. Electronic data, including data described herein, may be temporarily and/or permanently stored in the storage 356. For example, data representing computer-executable instructions 362 configured to direct processor 354 to perform any of the operations described herein may be stored within storage 356. In some examples, the data may be arranged in one or more databases residing within the storage 356.

The I/O module 358 may include one or more I/O modules configured to receive user input and provide user output. The I/O module 358 may include any hardware, firmware, software, or combination thereof that supports input and output capabilities. For example, the I/O module 358 may include hardware and/or software for capturing user input, including but not limited to a keyboard or keypad, a touch screen component (e.g., a touch screen display), a receiver (e.g., an RF or infrared receiver), a motion sensor, and/or one or more input buttons.

The I/O module 358 may include one or more devices for presenting output to a user, including but not limited to a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., a display driver), one or more audio speakers, and one or more audio drivers. In a particular embodiment, the I/O module 358 is configured to provide graphical data to a display for presentation to a user. The graphical data may represent one or more graphical user interfaces and/or any other graphical content that may serve as a particular implementation. In some examples, any of the systems, computing devices, and/or other components described herein may be implemented by computing device 350.

For further explanation, FIG. 3E illustrates an example of a storage system cluster 376 for providing storage services (also referred to herein as 'data services'). The storage system cluster 376 depicted in FIG. 3 includes a plurality of storage systems 374a, 374b, 374c, 374d, 374n, each of which may be similar to the storage systems described herein. The storage systems 374a, 374b, 374c, 374d, 374n in the storage system cluster 376 may be embodied as the same storage system or different types of storage systems. For example, the two storage systems 374a, 374n depicted in fig. 3E are depicted as cloud-based storage systems, because the resources that collectively form each of the storage systems 374a, 374n are provided by different cloud service providers 370, 372. For example, the first cloud service provider 370 may be Amazon AWSTM and the second cloud service provider 372 is Microsoft AzureTM, but in other embodiments, one or more public clouds, private clouds, or a combination thereof may be used to provide the underlying resources used to form a particular storage system in the storage system cluster 376.

According to some embodiments of the present disclosure, the example depicted in fig. 3E includes an edge management service 382 for delivering storage services. The storage services (also referred to herein as 'data services') that are delivered may include, for example, services that provide a certain amount of storage to the consumer, services that provide storage to the consumer according to a predetermined service level agreement, services that provide storage to the consumer according to predetermined regulatory requirements, and many other services.

The edge management service 382 depicted in fig. 3E may be embodied as one or more modules of computer program instructions, for example, executing on computer hardware (e.g., one or more computer processors). Alternatively, the edge management service 382 may be embodied as one or more modules of computer program instructions executing on a virtualized execution environment, such as one or more virtual machines, in one or more containers, or in some other manner. In other embodiments, the edge management service 382 may be embodied as a combination of the above-described embodiments, including embodiments in which one or more modules of computer program instructions contained in the edge management service 382 are distributed across multiple physical or virtual execution environments.

The edge management service 382 may act as a gateway for providing storage services to storage consumers, where the storage services utilize storage provided by one or more storage systems 374a, 374b, 374c, 374d, 374n. For example, the edge management service 382 may be configured to provide storage services to host devices 378a, 378b, 378c, 378d, 378n that are executing one or more applications that consume the storage services. In this example, the edge management service 382 may operate as a gateway between the host devices 378a, 378b, 378c, 378d, 378n and the storage systems 374a, 374b, 374c, 374d, 374n, rather than requiring the host devices 378a, 378b, 378c, 378d, 378n to directly access the storage systems 374a, 374b, 374c, 374d, 374n.

The edge management service 382 of FIG. 3E discloses the storage service module 380 to the host devices 378a, 378b, 378c, 378d, 378n of FIG. 3E, but in other embodiments the edge management service 382 may disclose the storage service module 380 to other consumers of various storage services. The various storage services may be presented to the consumer via one or more user interfaces, via one or more APIs, or by some other mechanism provided by storage service module 380. As such, the storage service module 380 depicted in FIG. 3E may be embodied as one or more modules of computer program instructions executing on physical hardware, on a virtualized execution environment, or a combination thereof, wherein execution of such modules enables consumers of storage services to be provided, selected, and accessed various storage services.

The edge management service 382 of fig. 3E also includes a system management service module 384. The system management service module 384 of fig. 3E includes one or more computer program instruction modules that, when executed, perform various operations in conjunction with the storage systems 374a, 374b, 374c, 374d, 374n to provide storage services to the host devices 378a, 378b, 378c, 378d, 378 n. The system management service module 384 may be configured to, for example, perform tasks, such as providing storage resources from the storage systems 374a, 374b, 374c, 374d, 374n via one or more APIs disclosed by the storage systems 374a, 374b, 374c, 374d, 374n, migrating data sets or workloads among the storage systems 374a, 374b, 374c, 374d, 374n via one or more APIs disclosed by the storage systems 374a, 374b, 374d, 374n, setting one or more tunable parameters (i.e., one or more configurable settings) on the storage systems 374a, 374b, 374c, 374d, 374n via one or more APIs disclosed by the storage systems 374a, 374b, 374d, 374n, and so forth. For example, many of the services described below relate to embodiments in which the storage systems 374a, 374b, 374c, 374d, 374n are configured to operate in some manner. In such instances, the system management service module 384 may be responsible for configuring the storage systems 374a, 374b, 374c, 374d, 374n to operate in the manner described below using the APIs (or some other mechanism) provided by the storage systems 374a, 374b, 374c, 374d, 374 n.

In addition to configuring the storage systems 374a, 374b, 374c, 374d, 374n, the edge management service 382 itself may be configured to perform various tasks required to provide various storage services. Consider an example in which a storage service comprises a service that causes personally identifiable information ('PII') contained in a dataset to be obfuscated when the dataset is accessed when the service is selected and applied. In this example, the storage systems 374a, 374b, 374c, 374d, 374n may be configured to confuse PII when servicing read requests for a data set. Alternatively, the storage systems 374a, 374b, 374c, 374d, 374n may service the read by returning data containing the PII, but the edge management service 382 itself may confuse the PII as the data passes the edge management service 382 en route from the storage systems 374a, 374b, 374c, 374d, 374n to the host devices 378a, 378b, 378c, 378d, 378 n.

The storage systems 374a, 374b, 374c, 374D, 374n depicted in fig. 3E may be embodied as one or more of the storage systems described above with reference to fig. 1A-3D, including variations thereof. In fact, the storage systems 374a, 374b, 374c, 374d, 374n may act as a pool of storage resources, wherein individual components in the pool have different performance characteristics, different storage characteristics, and the like. For example, one storage system 374a may be a cloud-based storage system, another storage system 374b may be a storage system that provides block storage, another storage system 374c may be a storage system that provides file storage, another storage system 374d may be a relatively high-performance storage system, another storage system 374n may be a relatively low-performance storage system, and so on. In alternative embodiments, only a single storage system may be present.

The storage systems 374a, 374b, 374c, 374d, 374n depicted in fig. 3E may also be organized into different failure domains, such that the failure of one storage system 374a should be completely independent of the failure of another storage system 374 b. For example, each of the storage systems may receive power from a separate power system, each of the storage systems may be coupled for data communication via a separate data communication network, and so on. Furthermore, the storage systems in the first failure domain may be accessed via a first gateway, while the storage systems in the second failure domain may be accessed via a second gateway. For example, the first gateway may be a first instance of the edge management service 382 and the second gateway may be a second instance of the edge management service 382, including embodiments in which each instance is different or each instance is part of the distributed edge management service 382.

As an illustrative example of available storage services, storage services associated with different levels of data protection may be presented to a user. For example, a storage service may be presented to a user that, when selected and implemented, ensures that data associated with the user will be protected so that various recovery point targets ('RPOs') may be guaranteed. For example, a first available storage service may ensure that some of the data sets associated with the user will be protected so that any data exceeding 5 seconds may be recovered in the event of a primary data storage device failure, while a second available storage service may ensure that the data sets associated with the user will be protected so that any data exceeding 5 minutes may be recovered in the event of a primary data storage device failure.

Additional examples of storage services that may be presented to, selected by, and ultimately applied to a data set associated with a user may include one or more data compliance services. Such data compliance services may be embodied as services that may be provided, for example, to consumers (i.e., users) of the data compliance services to ensure that the user's data sets are managed in a manner that complies with various regulatory requirements. For example, one or more data compliance services may be provided to the user to ensure that the user's data set is managed in a manner that complies with general data protection regulations (' GDPR '), one or more data compliance services may be provided to the user to ensure that the user's data set is managed in a manner that complies with the Sarbanes-Oxley Act of 2002 ("SOX"), or one or more data compliance services may be provided to the user to ensure that the user's data set is managed in a manner that complies with some other regulatory regulations. In addition, one or more data compliance services may be provided to the user to ensure that the user's data set is managed in a manner that complies with some non-government guidelines (e.g., complies with best practices for auditing purposes), one or more data compliance services may be provided to the user to ensure that the user's data set is managed in a manner that complies with specific client or organization requirements, and so forth.

Consider an example in which a particular data compliance service is designed to ensure that a user's data set is managed in a manner that meets the requirements set forth in GDPR. Although a list of all requirements of the GDPR may be found in the regulations themselves, for purposes of illustration, the exemplary requirement set forth in the GDPR is that a pseudonymization process must be applied to the stored data in order to transform the personal data so that the resulting data cannot be attributed to a particular data body without using additional information. For example, data encryption techniques may be applied to render the original data unintelligible, and such data encryption techniques cannot be reversed without having access to the correct decryption key. As such, the GDPR may require that the decryption key be stored separately from the pseudonymized data. A specific data compliance service may be provided to ensure compliance with the requirements set forth in this paragraph.

To provide this particular data compliance service, the data compliance service may be presented to and selected by the user (e.g., via a GUI). In response to receiving a selection of a particular data compliance service, one or more storage service policies may be applied to a data set associated with the user to perform the particular data compliance service. For example, a storage service policy may be applied that requires that the data set be encrypted before being stored in the storage system, before being stored in the cloud environment, or before being stored elsewhere. To implement this policy, not only may a requirement be implemented that the data set be encrypted at the time of storage, but a requirement may be implemented that the data set be encrypted prior to transmission (e.g., the data set is sent to another party). In this example, a storage service policy may also be enforced that requires that any encryption keys used to encrypt the data set not be stored on the same system that stores the data set itself. The reader will appreciate that many other forms of data compliance services may be provided and implemented in accordance with embodiments of the present disclosure.

The storage systems 374a, 374b, 374c, 374d, 374n in the storage system cluster 376 may be commonly managed by, for example, one or more cluster management modules. The cluster management module may be part of or separate from the system management service module 384 depicted in fig. 3E. The cluster management module may perform tasks such as monitoring the health of each storage system in the cluster, initiating an update or upgrade of one or more storage systems in the cluster, migrating a workload for load balancing or other performance purposes, and many others. As such, and for many other reasons, the storage systems 374a, 374b, 374c, 374d, 374n may be coupled to one another via one or more data communication links in order to exchange data between the storage systems 374a, 374b, 374c, 374d, 374 n.

The storage systems described herein may support various forms of data replication. For example, two or more storage systems may copy data sets synchronously with each other. In synchronous replication, different copies of a particular data set may be maintained by multiple storage systems, but all accesses (e.g., reads) to the data set should produce consistent results, regardless of which storage system the access is to. For example, a read for any of the storage systems that synchronously replicates the data set should return the same result. Thus, while updates to versions of the data set need not occur at exactly the same time, precautions must be taken to ensure consistent access to the data set. For example, if an update (e.g., a write) to a data set is received by a first storage system, the update is confirmed to be complete only when all storage systems that synchronously copy the data set have applied the update to their data set copies. In this example, synchronous replication may be performed using I/O forwarding (e.g., writes received at a first storage system are forwarded to a second storage system), communication between storage systems (e.g., each storage system indicates that it has completed an update), or other means.

In other embodiments, the data set may be replicated by using checkpoints. In checkpoint-based replication (also referred to as 'near synchronous replication'), a set of updates to a data set (e.g., one or more write operations to the data set) may occur between different checkpoints such that the data set is updated to a particular checkpoint only if all updates to the data set have been completed before the particular checkpoint. Consider an example in which a first storage system stores a real-time copy of a data set being accessed by a user of the data set. In this example, assume that a data set is copied from a first storage system to a second storage system using checkpoint-based copying. For example, the first storage system may send a first checkpoint to the second storage system (at time t=0), followed by a first set of updates to the data set, followed by a second checkpoint (at time t=1), followed by a second set of updates to the data set, followed by a third checkpoint (at time t=2). In this example, if the second storage system has performed all of the updates in the first set of updates, but has not performed all of the updates in the second set of updates, then the copy of the data set stored on the second storage system may be up to date until the second checkpoint. Alternatively, if the second storage system has performed all of the first set of updates and the second set of updates, then the copy of the data set stored on the second storage system may be up to date until the third checkpoint. Readers will appreciate that various types of checkpoints (e.g., metadata only checkpoints) may be used, which may be expanded based on various factors (e.g., time, number of operations, RPO settings), and so forth.

In other embodiments, the data set may be replicated by snapshot-based replication (also referred to as 'asynchronous replication'). In snapshot-based replication, a snapshot of a data set may be sent from a replication source, such as a first storage system, to a replication target, such as a second storage system. In this embodiment, each snapshot may contain the entire data set or a subset of the data set, e.g., only the portion of the data set that has changed since the last snapshot was sent from the replication source to the replication target. The reader will appreciate that the snapshots may be sent on demand, or in some other manner, based on a policy that takes into account various factors (e.g., time, number of operations, RPO settings).

The storage systems described above may be configured, alone or in combination, to act as continuous data protection storage. Continuous data protection storage is a feature of storage systems that record updates to a data set in such a way that consistent images of previous contents of the data set can be accessed at a low time granularity (typically on the order of a few seconds or even less) and extended backwards for a reasonable period of time (typically hours or days). These allow the most recent consistent point in time of accessing the data set, and also allow the point in time of accessing the data set that may occur just before an event, such as causing a partial corruption or other loss of the data set, while preserving a maximum number of updates that are close to before the event. Conceptually, it is like a sequence of snapshots of a data set taken very frequently and saved for a longer period of time, although the implementation of continuous data protection storage is often quite different from snapshots. A storage system implementing a data-continuous data protection storage device may further provide a means to access these points in time, to access one or more of these points in time as a snapshot or clone copy, or to revert the data set to one of those recorded points in time.

Over time, to reduce overhead, some points in time maintained in the continuous data protection storage may be merged with other nearby points in time, some of which are essentially deleted from the storage. This may reduce the capacity required to store updates. A limited number of these time points may also be converted into a snapshot of longer duration. For example, such a storage device may maintain a low granularity sequence of time points that trace back for several hours from now on, with some time points being merged or deleted to reduce overhead for up to an additional day. Going back to the far past, some of these time points may be converted every few hours into snapshots representing consistent time point images.

Although some embodiments are described primarily in the context of a storage system, readers of skill in the art will recognize that embodiments of the present disclosure may also take the form of a computer program product disposed on computer-readable storage media for use with any suitable processing system. Such computer-readable storage media may be any storage media for machine-readable information, including magnetic media, optical media, solid-state media, or other suitable media. Examples of such media include magnetic disks in hard disk drives or floppy disks, optical disks for optical drives, magnetic tape, and other media as will occur to those of skill in the art. Those skilled in the art will also recognize that while some of the embodiments described in this specification are oriented to software installed and executing on computer hardware, nevertheless, alternative embodiments implemented as firmware or as hardware are well within the scope of the present disclosure.

In some examples, a non-transitory computer-readable medium storing computer-readable instructions may be provided in accordance with the principles described herein. The instructions, when executed by a processor of a computing device, may direct the processor and/or the computing device to perform one or more operations including one or more of the operations described herein. Such instructions may be stored and/or transmitted using any of a variety of known computer-readable media.

The non-transitory computer-readable media referred to herein may comprise any non-transitory storage medium that participates in providing data (e.g., instructions) that may be read and/or executed by a computing device (e.g., by a processor of a computing device). For example, a non-transitory computer-readable medium may include, but is not limited to, any combination of non-volatile storage media and/or volatile storage media. Exemplary non-volatile storage media include, but are not limited to, read-only memory, flash memory, solid state drives, magnetic storage devices (e.g., hard disks, floppy disks, tape, etc.), ferroelectric random access memory ("RAM"), and optical disks (e.g., compact disks, digital video disks, blu-ray disks, etc.). Exemplary volatile storage media include, but are not limited to, RAM (e.g., dynamic RAM).

The advantages and features of the present disclosure may be further described by the following statements:

1. a method, the method comprising: maintaining, by a processing device of a storage controller, a list of a plurality of available zones across respective Solid State Drive (SSD) storage portions of a plurality of partitioned storage devices of a storage system; receiving data from a plurality of sources, wherein the data is associated with processing a data set, the data set comprising a plurality of volumes and associated metadata; determining a plurality of slices of the data such that each of the plurality of slices is writable in parallel with the remaining slices of the plurality of slices; mapping the plurality of tiles to subsets of the plurality of available zones, respectively; and writing the plurality of tiles to the subset of the plurality of available zones in parallel.

2. The method of statement 1, further comprising: an erasure code corresponding to the data is written to the subset of the plurality of available bands.

3. The method of any of statements 1-2, wherein the erasure code comprises a first portion and a second portion, and wherein writing the erasure code further comprises: the first portion is written to a first zone of the subset of the plurality of available zones and the second portion is written to a second zone of the subset of the plurality of available zones in parallel with the first portion.

4. The method of any of statements 1-3, wherein the usable bands of the plurality of partitioned storage devices are classified as being in an open state.

5. The method of any of statements 1-4, wherein the usable bands of the plurality of partitioned storage devices are classified as being in an empty state.

6. The method of any of statements 1-5, wherein the plurality of partitioned storage devices further comprises a non-volatile random access memory (NVRAM) portion for recording the data, wherein the SSD portion and the NVRAM portion are individually addressable, and wherein the NVRAM portion is smaller than the SSD portion.

7. The method of any of statements 1-6, wherein the NVRAM portion comprises a Random Access Memory (RAM) device, an energy storage device, and a processing device.

Fig. 4 illustrates an example system 400 for block merging, according to some embodiments. In one embodiment, system 400 may include storage drives 171A-171E. The memory drives 171A-171E are operably coupled to one or more memory controllers. In one embodiment, storage drives 171A-171E include direct mapped Solid State Drive (SSD) storage portions. In other embodiments, storage drives 171A-171E also include a fast write portion to record data to be written to the direct mapped SSD portion. In one embodiment, the fast write portion is an NVRAM portion. The direct mapped SSD portion and NVRAM portion may be individually addressable. The NVRAM portion may be smaller than the direct mapped SSD portion. The storage drives 171A-171E may be organized into write groups (e.g., write group 401). The write group may RAID protect the data and write the data in a segment (e.g., SEGIO 407) composed of allocation units (e.g., allocation units 404A, 404E) located on a subset of the storage drives 171A-171E within the write group.

In one embodiment, the SSD provides the disk LBA space of numbered sectors to the storage controller, typically in the range of 512 bytes to 8192 bytes. The controller may manage the LBA space of the SSD in logically contiguous LBA blocks called Allocation Units (AUs), which typically have a size between 1 megabyte and 64 megabytes. The storage controller may align the AU with the internal storage organization of the SSD to optimize performance and minimize media loss.

In one embodiment, the memory controller may allocate physical memory in segments (e.g., SEGIO 407). A segment may be made up of several AUs, each AU located on a different SSD in the same write group. An AU in a segment may be located on any AU boundary in the LBA space of its SSD. The storage controller may determine the size of a segment when allocating the segment based on (a) whether the segment will contain data or metadata and (b) the number of SSDs selected to be written to the group. In one embodiment, the data segments may consist of 8 to 16 AUs and may vary segment by segment based on redundancy or performance requirements or available storage and storage layout within the storage system.

In one embodiment, to allocate segments, the storage controller selects a write group, and writes SSDs in the group that will contribute to the AU. The write group and SSD selections may be quasi-random, but for long term I/O and media wear leveling, the selections may be slightly biased toward newer SSDs, less common SSD areas, and/or SSDs that contain more free AUs, or for fragments that are expected to remain intact for longer periods of time, the selections may be biased toward older SSDs or older SSD areas with shorter life expectancy remaining. To ensure that SSDs are available to allow recovery and rebuild from a failed device, at allocation time, the segments may be limited to one or two AUs less than the number of SSDs written in the group.

In one embodiment, the storage controller may designate AUs in a segment as, for example, a column of 1 megabyte slices (e.g., slices 402A-402E). Corresponding slices in each of the segmented AUs may be collectively referred to as SEGIO (e.g., SEGIO 407). The SEGIO 407 may be a unit in which the storage controller packages data or metadata before writing the data or metadata to the flash, and the storage controller calculates a RAID-3D checksum over the SEGIO. In one embodiment, the RAID checksums (erasure codes) may be rotated (e.g., occupy different tile locations in each of the segmented SEGIOs).

In one embodiment, the memory controller designates the shards as a column of logical pages aligned with SSD flash pages, which are units of internal SSD write and ECC protected data. In one embodiment, a 4 kilobyte logical page may be used. In other embodiments, smaller or larger logical pages may be utilized. The storage controller may calculate a separate RAID-3D checksum over each set of corresponding logical pages in the SEGIO. Advantageously, when an uncorrectable read error occurs, the memory controller may reconstruct the unreadable logical page from the erase code stored elsewhere in the SEGIO, or from elsewhere within the same slice to cover the single page read error, or from a slice on another device.

In one embodiment, to optimize read time, the array packages the data and metadata into a sliced buffered logical page in an order that minimizes the larger block segments. The packing order may not affect the write performance because the array may write to the entire tile. The memory controller may calculate and store page checksums (e.g., P406 and Q408) based on data written to the memory drives. The checksum may be initialized with the corresponding page offset in the tile and the monotonically increasing segment number. In one embodiment, when a memory controller reads a logical page, it recalculates the checksum of the logical page and compares the result to the stored value to verify the content and source of the page.

In one embodiment, a memory controller organizes data in a content block (cblock) for storage on a flash. The memory controller may divide the bulk of writes into cblocks based on size and alignment; in one embodiment, cblock may contain up to 64 kilobytes. In one embodiment, the storage controller may shrink (compress) the data prior to storing the data by: eliminating sectors whose entire content consists of repeated zeros or other simple patterns (repeated single bytes or words, known patterns such as database free blocks); de-duplication by eliminating the same sector sequence as the stored data; and/or compress data that is neither a simple version nor a copy of the stored data.

The memory controller may pack the reduced cblock into a logical page buffer in the approximate order of arrival or writing to the backing store, effectively creating a log of data written by the host or by the memory system. The memory controller may densely pack cblocks-each beginning where the previous cblock ended. Advantageously, this may mean that no space is wasted by "rounding up" to 4 kilobytes, 8 kilobytes, 16 kilobytes or other boundaries, or even to 512 byte block boundaries. Dense packing uses media more efficiently than alternative operations that map LBAs directly to physical storage or locate data based on their content. Furthermore, it helps to extend the media life by reducing the total accumulated data written to the flash.

Fig. 5 illustrates an example system 500 for block merging, according to some embodiments. In one implementation, block merge module 248 of FIG. 2 may implement the operations of system 500, including writing, erasing, and replacing cblocks and erase blocks.

In one implementation, the system 500 includes compressed data 504. The compressed data may be received 502 from many different sources and may include any type of data. In one implementation, the compressed data 504 may include write data, unmap data, write the same data, file system data, object data, and other types of data. In one embodiment, operations are received from various sources and may include writing write data, unmapping requests, writing the same requests, virtual block range copy requests, snapshot requests, and other operations. These operations, in combination with internal array operations, may produce data and metadata compressed according to the description above. The received data will be stored on one or more storage drives (e.g., storage drive 171). In another embodiment, the data may not be compressed.

In one embodiment, the data to be stored may generate an index table and be compressed (or further compressed in an embodiment where the data has been compressed). In another embodiment, the data may remain uncompressed. The index table may point to data (e.g., index digests) for indexing into the table. In response to a write being performed, the corresponding index table may be updated to include the write. When generating the index table, processing logic may write the received data in subsets (e.g., cblock) 506, 508, 510. The subset may be subjected to 3D RAID such that the subset is recoverable while the next subset is written on. To increase throughput, more than one subset may be written in parallel with other subsets. Furthermore, more than one allocation unit may be written in parallel with additional allocation units.

In one embodiment, the subset (e.g., cblock) is part of the allocation units 512, 514. As described above, the allocation unit may be a software structure within the storage drive. In one embodiment, in response to an allocation unit being populated with subsets, the allocation unit is written to a corresponding storage drive. In another embodiment, data may be written to portions of multiple allocation units in parallel without filling a first allocation unit before writing to another allocation unit. In one example, a certain number of storage drives may be partitioned into a certain number of allocation units that map to an erase block. The erase block may be mapped directly onto the storage device address space. In an embodiment, a single erase block is an allocation unit. In another implementation, the allocation unit may include a plurality of erase blocks, or the erase block may include a plurality of allocation units.

In one implementation, processing logic maintains a list of available allocation units across a number of flash devices 516, 518. The dispensing units may be classified as available, in use, or unavailable. In one embodiment, the subset may not be individually overwritable. In another embodiment, the subset may be erased, marked as available, and overwritten as part of a garbage collection operation. According to the same operation, subsequent allocation units (e.g., 514) may write to different storage drives 518 in parallel. In one embodiment, a controller (e.g., controller 110) monitors idle allocation units across multiple drives. Further, as described above, the partial allocation blocks may be written. This may be particularly beneficial when the dispensing unit has a large size; performing writes more frequently may be more efficient than waiting to fill large allocation units, while erasing blocks (as an example) are larger or making garbage collection track or manage larger allocation units rather than smaller allocation units may be more efficient.

Fig. 6 illustrates a flow diagram for block merging, according to some embodiments. The method 600 may be performed by processing logic that comprises hardware (e.g., circuitry, dedicated logic, programmable logic, microcode, etc.), software (e.g., instructions run on a processing device to perform hardware simulation), or a combination thereof.

Referring to fig. 6, at block 602, processing logic maintains a list of available allocation units across a plurality of flash devices of a flash memory system (e.g., storage drive 171 of storage system 100). In one implementation, the flash device maps an erase block to a directly addressable memory area without requiring drive controller conversion, or with relatively minimal conversion, or converts the entire writable area of the erase block to a continuous address, where the conversion may or may not be fixed for the duration of the memory device. Flash memory systems may classify erase blocks as available, in use, or unavailable. In one implementation, at least a portion of an erase block may be assigned as an allocation unit. In another embodiment, multiple erase blocks may be assigned as allocation units. In yet another implementation, a single erase block may be assigned as multiple allocation units. At block 604, processing logic receives data from a plurality of sources. In one embodiment, the data may be associated with a processing dataset, and the dataset may include a plurality of file systems and associated metadata. In some implementations, the storage device, or even portions of the storage device, may have erase blocks of different sizes. This may result in the allocation unit being made up of one number of erase blocks in some cases, and another number of erase blocks even within the same storage system in some other cases.

At block 606, processing logic determines a plurality of subsets of data such that each subset can be written in parallel with the remaining subsets. At block 608, processing logic maps each of the plurality of subsets to an available allocation unit, and at block 610, processing logic writes the plurality of subsets to the plurality of available allocation units in parallel. As described above, in one embodiment, multiple subsets may be written serially in situations where processing logic fills a first allocation unit before writing to a next allocation unit.

FIG. 7 illustrates an example system 700 for parallel zone writing, according to some embodiments. As previously described, the system 700 may include storage drives 171A-171E with partitioned storage. The memory drives 171A-171E are operably coupled to one or more memory controllers. In some embodiments, storage drives 171A-171E may include a fast write portion to record data to be written to the partitioned storage. In an embodiment, the fast write portion is an NVRAM portion. The partitioned storage portion and the NVRAM portion may be individually addressable. The NVRAM portion may be smaller than the partitioned storage portion. The storage drives 171A-171E may be organized into write groups (e.g., write group 701). The write group may RAID protect the data and write the data in a segment (e.g., SEGIO 707) that includes the zones (e.g., zones 704A, 704E) located on a subset of the storage drives 171A-171E within the write group.

In an embodiment, the memory controller may allocate physical memory in a segment (e.g., SEGIO 707). A segment may include several bands across several SSDs in the same write group. In an embodiment, the multiple bands of segments may be located on the same SSD in the write group. In some embodiments, the plurality of SSDs in the write group may include a plurality of bands that are segmented. The storage controller may determine the size of a segment when allocating the segment based on (a) whether the segment will contain data or metadata and (b) the number of SSDs selected to be written to the group. In some embodiments, data segments may be composed of 8-16 stripe groups and may vary segment by segment based on redundancy or performance requirements, available storage and storage layout within the storage system, or variations across SSDs based on differences in stripe sizes.

In an embodiment, to allocate segments, the storage controller selects a write group, and writes SSDs in the group that will contribute to the zone. The write group and SSD selections may be quasi-random, but for long term I/O and media wear leveling, the selections may be slightly biased toward newer SSDs and/or SSDs that contain more free zones, or for segments that are expected to remain intact for longer periods of time, the selections may be biased toward older SSDs that remain shorter in expected life. To ensure that SSDs are available to allow recovery and rebuild from a failed device, at allocation time, the segments may be limited to one or two fewer bands than the number of SSDs written in the group.

In some embodiments, the memory controller may designate a band in a segment as, for example, a column of 1 megabyte slices (e.g., slices 702A-702E). Corresponding tiles in each of the segmented bands may be collectively referred to as SEGIO (e.g., SEGIO 707). SEGIO 707 may be a unit in which the storage controller packages data or metadata before writing the data or metadata to the flash, and the storage controller calculates an erasure code through the SEGIO. In an embodiment, the erasure code may be rotated (e.g., occupy different tile positions in each of the segmented SEGIOs). The segments may be written in RAID-like slices across multiple SSDs of the storage system.

In an embodiment, the storage controller designates the shards as a column of logical pages aligned with SSD flash pages, which are units of internal SSD write and ECC protected data. In other embodiments, smaller or larger logical pages may be utilized. The memory controller may calculate a separate erase code on each set of corresponding logical pages in the SEGIO. Advantageously, when an uncorrectable read error occurs, the memory controller may reconstruct the unreadable logical page from the erase code stored elsewhere in the SEGIO, or from elsewhere within the same slice to cover the single page read error, or from a slice on another device.

In an embodiment, to optimize read time, the array packages the data and metadata into a sliced buffered logical page in an order that minimizes the larger block segments. The packing order may not affect the write performance because the array may write to the entire tile. The memory controller may calculate and store page checksums (e.g., P706 and Q708) based on the data written to the memory drives. In some embodiments, the checksum may be initialized with the corresponding page offset in the tile and the monotonically increasing segment number. In one embodiment, when a memory controller reads a logical page, it recalculates the checksum of the logical page and compares the result to the stored value to verify the content and source of the page.

FIG. 8 illustrates an example system 800 for writing to bands of a partitioned storage device in parallel, according to some embodiments. The system 800 includes compressed data 804, which may be received 802 from many different sources and may include any type of data. In an embodiment, the compressed data 804 may include write data, unmapped data, write the same data, file system data, object data, or other types of data. In an embodiment, operations are received from various sources and may include writes, unmap requests, write same requests, virtual block range copy requests, snapshot requests, and other types of requests. These operations, in combination with internal array operations, may produce data and metadata compressed according to the description above. The received data will be stored on one or more storage drives (e.g., storage drive 171) comprising a partitioned storage device. In some embodiments, the data may not be compressed.

In embodiments, the data to be stored may generate an index table and be compressed (or further compressed in embodiments in which the data has been compressed). In some embodiments, the data may remain uncompressed. The index table may point to data (e.g., index digests) for indexing into the table. In response to a write being performed, the corresponding index table may be updated to include the write. When generating the index table, processing logic may write the received data in subsets (e.g., cblock) 806, 808, 810. The subset may be subject to erasure code encoding such that the subset is recoverable while the next subset is subsequently written. To increase throughput, more than one subset may be written in parallel with other subsets. Furthermore, more than one zone may be written in parallel with additional zones. In an embodiment, the subset (e.g., cblock) is part of the zones 812a, 812 b. Data may be written to portions of multiple zones in parallel without filling a first zone before writing to another zone.

The processing logic of the memory controller may maintain a list of available bands across the number of flash devices 816, 818. As previously described, the bands may be classified as having various states. In an embodiment, the list of available bands that have been classified as having an open state or an empty state. Subsequent zones (e.g., 812 b) may be written to different storage drives 818 in parallel according to the same operation. A storage controller (e.g., controller 110) may monitor the available zones across multiple drives.

FIG. 9 is an example method 900 to efficiently write data in a partitioned drive storage system in accordance with an embodiment of the present disclosure. In general, method 900 may be performed by processing logic that may comprise hardware (e.g., a processing device, circuitry, dedicated logic, programmable logic, microcode, hardware of the device, integrated circuits, etc.), software (e.g., instructions run or executed on a processing device), or a combination thereof. In some embodiments, as previously described at fig. 1A-3E, processing logic of a storage controller of a storage system may perform method 900.

The method 900 may begin at block 910, where processing logic maintains a list of available zones across respective Solid State Drive (SSD) storage portions of a partitioned storage device of a storage system.

At block 920, processing logic receives data from a plurality of sources, including a plurality of volumes and associated metadata.

At block 930, processing logic determines a plurality of slices of data such that each slice can be written in parallel with the remaining slices of the plurality of slices.

At block 940, processing logic maps the plurality of tiles to a subset of the available bands.

At block 950, processing logic writes the plurality of tiles to a subset of the available bands in parallel.

Claims

1. A storage system, comprising:

a plurality of partitioned storage devices, wherein each partitioned storage device of the plurality of partitioned storage devices includes a Solid State Drive (SSD) storage portion; and

A storage controller operatively coupled to the plurality of partitioned storage devices, the storage controller comprising a processing device for:

maintaining a list of a plurality of available zones across the respective SSD storage portions of the plurality of partitioned storage devices;

receiving data from a plurality of sources, wherein the data is associated with processing a data set, the data set comprising a plurality of volumes and associated metadata;

determining a plurality of slices of the data such that each slice of the plurality of slices can be written in parallel with the remaining slices of the plurality of slices;

mapping the plurality of tiles to subsets of the plurality of available zones, respectively; and

The plurality of tiles are written in parallel to the subset of the plurality of available zones.

2. The storage system of claim 1, wherein the processing device is further to:

an erasure code corresponding to the data is written to the subset of the plurality of available bands.

3. The memory system of claim 2, wherein the erasure code comprises a first portion and a second portion, and wherein to write the erasure code, the processing device is further to:

the first portion is written to a first zone of the subset of the plurality of available zones and the second portion is written to a second zone of the subset of the plurality of available zones in parallel with the first portion.

4. The storage system of claim 1, wherein the usable zones of the plurality of partitioned storage devices are classified as being in an open state.

5. The storage system of claim 1, wherein the usable bands of the plurality of partitioned storage devices are classified as being in an empty state.

6. The storage system of claim 1, further comprising a non-volatile random access memory (NVRAM) portion for recording the data, wherein the SSD portion and the NVRAM portion are individually addressable, and wherein the NVRAM portion is smaller than the SSD portion.

7. The storage system of claim 6, wherein the NVRAM portion comprises a Random Access Memory (RAM) device, an energy storage device, and a processing device.

8. A method, comprising:

maintaining, by a processing device of a storage controller, a list of a plurality of available zones across respective Solid State Drive (SSD) storage portions of a plurality of partitioned storage devices of a storage system;

9. The method as recited in claim 8, further comprising:

10. The method of claim 9, wherein the erasure code comprises a first portion and a second portion, and wherein writing the erasure code further comprises:

11. The method of claim 8, wherein the available zones of the plurality of partitioned storage devices are classified as being in an open state.

12. The method of claim 8, wherein the available zones of the plurality of partitioned storage devices are classified as being in an empty state.

13. The method of claim 8, wherein the plurality of partitioned storage devices further comprises a non-volatile random access memory (NVRAM) portion for recording the data, wherein the SSD portion and the NVRAM portion are individually addressable, and wherein the NVRAM portion is smaller than the SSD portion.

14. The method of claim 13, wherein the NVRAM portion comprises a Random Access Memory (RAM) device, an energy storage device, and a processing device.

15. A non-transitory computer-readable storage medium comprising instructions that, when executed by a processing device of a storage controller, cause the processing device to:

maintaining, by the processing device of the storage controller, a list of a plurality of available zones across respective Solid State Drive (SSD) storage portions of a plurality of partitioned storage devices of a storage system;

16. The non-transitory computer-readable storage medium of claim 15, wherein the processing device further:

17. The non-transitory computer-readable storage medium of claim 16, wherein the erasure code comprises a first portion and a second portion, and wherein to write the erasure code, the processing device is further to:

18. The non-transitory computer-readable storage medium of claim 15, wherein the available zones of the plurality of partitioned storage devices are classified as being in an open state.

19. The non-transitory computer-readable storage medium of claim 15, wherein the available zones of the plurality of partitioned storage devices are classified as being in an empty state.

20. The non-transitory computer-readable storage medium of claim 15, further comprising a non-volatile random access memory (NVRAM) portion for recording the data, wherein the SSD portion and the NVRAM portion are individually addressable, and wherein the NVRAM portion is smaller than the SSD portion.