CN116391169A - Creating virtual storage systems - Google Patents

Creating virtual storage systems Download PDF

Info

Publication number
CN116391169A
CN116391169A CN202180070563.6A CN202180070563A CN116391169A CN 116391169 A CN116391169 A CN 116391169A CN 202180070563 A CN202180070563 A CN 202180070563A CN 116391169 A CN116391169 A CN 116391169A
Authority
CN
China
Prior art keywords
storage
storage system
data
virtual storage
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202180070563.6A
Other languages
Chinese (zh)
Inventor
罗纳德·卡尔
帕尔·博特斯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Pure Storage Inc
Original Assignee
Pure Storage Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US17/070,372 external-priority patent/US11422751B2/en
Application filed by Pure Storage Inc filed Critical Pure Storage Inc
Publication of CN116391169A publication Critical patent/CN116391169A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0685Hybrid storage combining heterogeneous device types, e.g. hierarchical storage, hybrid arrays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • G06F3/0605Improving or facilitating administration, e.g. storage management by facilitating the interaction with a user or administrator
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • G06F3/0607Improving or facilitating administration, e.g. storage management by facilitating the process of upgrading existing storage systems, e.g. for improving compatibility between host and storage device
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/0647Migration mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0662Virtualisation aspects
    • G06F3/0665Virtualisation aspects at area level, e.g. provisioning of virtual or logical volumes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]

Abstract

The invention relates to creating a virtual storage system comprising: instantiating one or more virtual storage controllers; instantiating one or more virtual storage devices, each comprising a plurality of storage tiers; and constructing a virtual storage system, wherein the one or more virtual storage devices are coupled to each of the one or more virtual storage controllers.

Description

Creating virtual storage systems
Drawings
FIG. 1A illustrates a first example system for data storage according to some embodiments.
FIG. 1B illustrates a second example system for data storage according to some embodiments.
FIG. 1C illustrates a third example system for data storage according to some embodiments.
FIG. 1D illustrates a fourth example system for data storage according to some embodiments.
FIG. 2A is a perspective view of a storage cluster having multiple storage nodes and internal storage coupled to each storage node to provide network attached storage, according to some embodiments.
Fig. 2B is a block diagram showing an interconnect switch coupling multiple storage nodes, according to some embodiments.
FIG. 2C is a multi-level block diagram showing the contents of a storage node and the contents of one of the non-volatile solid state storage units, according to some embodiments.
FIG. 2D shows a storage server environment using embodiments of storage nodes and storage units of some previous figures, according to some embodiments.
FIG. 2E is a blade (blade) hardware block diagram showing a control plane, a compute and store plane, and authorization interacting with an underlying physical resource, according to some embodiments.
Fig. 2F depicts a resilient software layer in a blade of a storage cluster, according to some embodiments.
FIG. 2G depicts authorization and storage resources in a blade of a storage cluster, according to some embodiments.
Fig. 3A sets forth a diagram of a storage system coupled for data communication with a cloud service provider according to some embodiments of the present disclosure.
Fig. 3B sets forth a diagram of a storage system according to some embodiments of the present disclosure.
FIG. 3C illustrates an exemplary computing device that may be explicitly configured to perform one or more of the processes described herein.
Fig. 3D sets forth a block diagram illustrating a plurality of storage systems supporting bins (pod) according to some embodiments of the present disclosure.
FIG. 3E sets forth a flow chart illustrating an example method for servicing I/O operations directed to a data set synchronized across multiple storage systems according to some embodiments of the present disclosure.
Fig. 4 sets forth an example of a cloud-based storage system according to some embodiments of the present disclosure.
Fig. 5 sets forth an example of an additional cloud-based storage system according to some embodiments of the present disclosure.
FIG. 6 sets forth a flow chart illustrating an example method of servicing I/O operations in a cloud-based storage system.
FIG. 7 sets forth a flow chart illustrating an example method of servicing I/O operations in a cloud-based storage system.
FIG. 8 sets forth a flow chart illustrating an additional example method of servicing I/O operations in a cloud-based storage system.
FIG. 9 sets forth a flow chart illustrating an additional example method of servicing I/O operations in a cloud-based storage system.
FIG. 10 sets forth a flow chart illustrating an additional example method of servicing I/O operations in a cloud-based storage system.
FIG. 11 sets forth a flow chart illustrating an additional example method of servicing I/O operations in a cloud-based storage system.
FIG. 12 illustrates an example virtual storage system architecture, according to some embodiments of the present disclosure.
FIG. 13 illustrates an additional example virtual storage system architecture according to some embodiments of the present disclosure.
FIG. 14 illustrates an additional example virtual storage system architecture according to some embodiments of the present disclosure.
FIG. 15 illustrates an additional example virtual storage system architecture according to some embodiments of the present disclosure.
FIG. 16 illustrates an additional example virtual storage system architecture according to some embodiments of the present disclosure.
FIG. 17 sets forth a flow chart illustrating an additional example method of servicing I/O operations in a virtual storage system.
FIG. 18 sets forth a flow chart illustrating an additional example method of servicing I/O operations in a virtual storage system.
FIG. 19 illustrates an additional example virtual storage system architecture, according to some embodiments of the present disclosure.
FIG. 20 illustrates an additional example virtual storage system architecture, according to some embodiments of the present disclosure.
FIG. 21 sets forth a flow chart illustrating an example method of creating a virtual storage system.
FIG. 22 sets forth a flow chart illustrating an additional example method of creating a virtual storage system.
FIG. 23 sets forth a flow chart illustrating an additional example method of creating a virtual storage system.
FIG. 24 sets forth a flow chart illustrating an additional example method of creating a virtual storage system.
Detailed Description
Example methods, apparatus, and articles of manufacture for a virtual storage system architecture according to embodiments of the present disclosure are described with reference to the accompanying drawings, beginning with FIG. 1A. FIG. 1A illustrates an example system for data storage according to some embodiments. For purposes of illustration and not limitation, system 100 (also referred to herein as a "storage system") includes numerous elements. It may be noted that in other embodiments, system 100 may include the same, more, or fewer elements configured in the same or different ways.
The system 100 includes a number of computing devices 164A-B. A computing device (also referred to herein as a "client device") may be embodied as, for example, a server, workstation, personal computer, notebook, or the like in a data center. The computing devices 164A-B may be coupled for data communication with one or more storage arrays 102A-B through a storage area network ('SAN') 158 or a local area network ('LAN') 160.
SAN 158 may be implemented with various data communication architectures, devices, and protocols. For example, an architecture for SAN 158 may include fibre channel, ethernet, infiniband, serial attached small computer system interface ('SAS'), or the like. The data communication protocols used with SAN 158 may include advanced technology attachment ('ATA'), fibre channel protocol, small computer System interface ('SCSI'), internet Small computer System interface ('iSCSI'), super SCSI, architecture-based flash nonvolatile memory ('NVMe'), or the like. It is noted that SAN 158 is provided for purposes of illustration and not limitation. Other data communication couplings may be implemented between computer devices 164A-B and storage arrays 102A-B.
LAN 160 may also be implemented with a variety of architectures, devices, and protocols. For example, an architecture for the LAN 160 may include ethernet (802.3), wireless (802.11), or the like. The data communication protocols used in the LAN 160 may include transmission control protocol ('TCP'), user datagram protocol ('UDP'), internet protocol ('IP'), hypertext transfer protocol ('HTTP'), wireless access protocol ('WAP'), hand-held device transmission protocol ('HDTP'), session initiation protocol ('SIP'), real-time protocol ('RTP'), or the like.
The storage arrays 102A-B may provide persistent data storage for the computing devices 164A-B. In implementations, the storage array 102A may be housed in a chassis (not shown) and the storage array 102B may be housed in another chassis (not shown). The storage arrays 102A and 102B may include one or more storage array controllers 110A-D (also referred to herein as "controllers"). The storage array controllers 110A-D may be embodied as modules of an automated computing machine including computer hardware, computer software, or a combination of computer hardware and software. In some implementations, the storage array controllers 110A-D may be configured to perform various storage tasks. Storage tasks may include writing data received from computing devices 164A-B to storage arrays 102A-B, erasing data from storage arrays 102A-B, retrieving data from storage arrays 102A-B and providing data to computing devices 164A-B, monitoring and reporting disk utilization and performance, performing redundant operations (e.g., redundant array of independent drives ('RAID') or RAID-like data redundancy operations), compressing data, encrypting data, and so forth.
The memory array controllers 110A-D may be implemented in various ways, including as a field programmable gate array ('FPGA'), a programmable logic chip ('PLC'), an application specific integrated circuit ('ASIC'), a system on a chip ('SOC'), or any computing device including discrete components (e.g., a processing device, a central processing unit, a computer memory, or various adapters). The storage array controllers 110A-D may include, for example, data communications adapters configured to support communications via the SAN 158 or LAN 160. In some implementations, the storage array controllers 110A-D may be independently coupled to the LAN 160. In an implementation, storage array controllers 110A-D may include I/O controllers or the like that couple storage array controllers 110A-D for data communications to persistent storage resources 170A-B (also referred to herein as "storage resources") through a midplane (not shown). Persistent storage resources 170A-B generally include any number of storage drives 171A-F (also referred to herein as "storage") and any number of non-volatile random access memory ('NVRAM') devices (not shown).
In some implementations, NVRAM devices of persistent storage resources 170A-B may be configured to receive data from storage array controllers 110A-D to be stored in storage drives 171A-F. In some examples, the data may originate from computing devices 164A-B. In some examples, writing data to the NVRAM device may be performed faster than writing data directly to the storage drives 171A-F. In an implementation, the storage array controllers 110A-D may be configured to utilize NVRAM devices as fast accessible buffers for data that is intended to be written to the storage drives 171A-F. The latency of write requests using NVRAM devices as buffers may be improved relative to systems in which storage array controllers 110A-D write data directly to storage drives 171A-F. In some embodiments, the NVRAM device may be implemented with computer memory in the form of high bandwidth low latency RAM. NVRAM devices are referred to as "non-volatile" in that the NVRAM device may receive or contain the only power to maintain the state of RAM after the NVRAM device loses main power. Such a power source may be a battery, one or more capacitors, or the like. In response to losing power, the NVRAM device may be configured to write the contents of RAM to persistent storage, such as storage drives 171A-F.
In an implementation, storage drives 171A-F may refer to any device configured to record data permanently, where "permanently" or "permanently" refers to the device's ability to maintain the recorded data after losing power. In some implementations, the storage drives 171A-F may correspond to non-disk storage media. For example, storage drives 171A-F may be one or more solid state drives ('SSDs'), flash memory-based storage devices, any type of solid state non-volatile memory, or any other type of non-mechanical storage device. In other embodiments, storage drives 171A-F may comprise mechanical or rotating hard disks, such as hard disk drives ('HDD').
In some implementations, the storage array controllers 110A-D may be configured to offload device management responsibilities from the storage drives 171A-F in the storage arrays 102A-B. For example, the storage array controllers 110A-D may manage control information that may describe the state of one or more memory blocks in the storage drives 171A-F. The control information may indicate, for example, that a particular memory block has failed and should no longer be written to, that a particular memory block contains the boot code of the memory array controllers 110A-D, the number of program-erase ('P/E') cycles that have been performed on a particular memory block, the age of data stored in a particular memory block, the type of data stored in a particular memory block, and so forth. In some implementations, control information may be stored with an associated memory block as metadata. In other implementations, control information for the storage drives 171A-F may be stored in one or more particular memory blocks of the storage drives 171A-F selected by the storage array controller 110A-D. The selected memory block may be marked with an identifier indicating that the selected memory block contains control information. The identifiers may be used by the storage array controllers 110A-D along with the storage drives 171A-F to quickly identify memory blocks containing control information. For example, the memory controllers 110A-D may issue commands to locate memory blocks containing control information. It may be noted that the control information may be so large that portions of the control information may be stored in multiple locations, that the control information may be stored in multiple locations, for example for redundancy purposes, or that the control information may be otherwise distributed across multiple memory blocks in the storage drives 171A-F.
In an embodiment, the storage array controllers 110A-D may offload device management responsibilities from the storage drives 171A-F of the storage arrays 102A-B by retrieving control information from the storage drives 171A-F describing the state of one or more memory blocks in the storage drives 171A-F. Retrieving control information from storage drives 171A-F may be performed, for example, by storage array controllers 110A-D querying storage drives 171A-F for the location of control information for a particular storage drive 171A-F. The storage drives 171A-F may be configured to execute instructions that enable the storage drives 171A-F to identify the location of the control information. The instructions may be executed by a controller (not shown) associated with or otherwise located on the storage drives 171A-F and may cause the storage drives 171A-F to scan a portion of each memory block to identify the memory block storing the control information of the storage drives 171A-F. The storage drives 171A-F may respond by sending response messages to the storage array controllers 110A-D containing the locations of the control information of the storage drives 171A-F. In response to receiving the response message, the storage array controllers 110A-D may issue a request to read data stored at addresses associated with the locations of the control information of the storage drives 171A-F.
In other embodiments, the storage array controllers 110A-D may further offload device management responsibilities from the storage drives 171A-F by performing storage drive management operations in response to receiving control information. Storage drive management operations may include, for example, operations typically performed by storage drives 171A-F, such as controllers (not shown) associated with particular storage drives 171A-F. The storage drive management operations may include, for example, ensuring that data is not written to the failed memory blocks within storage drives 171A-F, ensuring that data is written to the memory blocks within storage drives 171A-F in a manner such that adequate wear leveling is achieved, and so forth.
In an implementation, the storage arrays 102A-B may implement two or more storage array controllers 110A-D. For example, the memory array 102A may include a memory array controller 110A and a memory array controller 110B. Under a given example, a single storage array controller 110A-D (e.g., storage array controller 110A) of storage system 100 may be designated as having a primary state (also referred to herein as a "primary controller"), and other storage array controllers 110A-D (e.g., storage array controller 110A) may be designated as having a secondary state (also referred to herein as a "secondary controller"). The primary controller may have certain rights, such as permissions to change data in persistent storage resources 170A-B (e.g., write data to persistent storage resources 170A-B). At least some of the rights of the primary controller may replace the rights of the secondary controller. For example, the secondary controller may not have the right when the primary controller has permission to change the data in persistent storage resources 170A-B. The states of the memory array controllers 110A-D may change. For example, the memory array controller 110A may be designated as having a secondary state and the memory array controller 110B may be designated as having a primary state.
In some implementations, a primary controller (e.g., storage array controller 110A) may be used as the primary controller for one or more storage arrays 102A-B, and a secondary controller (e.g., storage array controller 110B) may be used as the secondary controller for one or more storage arrays 102A-B. For example, storage array controller 110A may be a primary controller of storage arrays 102A and 102B, and storage array controller 110B may be a secondary controller of storage arrays 102A and 102B. In some implementations, the storage array controllers 110C and 110D (also referred to as "storage processing modules") may not have a primary or secondary state. The storage array controllers 110C and 110D implemented as storage processing modules may serve as a communication interface between the primary and secondary controllers (e.g., storage array controllers 110A and 110B, respectively) and the storage array 102B. For example, the storage array controller 110A of the storage array 102A may send a write request to the storage array 102B via the SAN 158. The write request may be received by both memory array controllers 110C and 110D of memory array 102B. The storage array controllers 110C and 110D facilitate communications, such as sending write requests to the appropriate storage drives 171A-F. It may be noted that in some embodiments, the storage processing module may be used to increase the number of storage drives controlled by the primary and secondary controllers.
In implementations, the storage array controllers 110A-D are communicatively coupled to one or more storage drives 171A-F via a midplane (not shown) and to one or more NVRAM devices (not shown) included as part of the storage arrays 102A-B. The storage array controllers 110A-D may be coupled to the midplane via one or more data communication links, and the midplane may be coupled to the storage drives 171A-F and NVRAM devices via one or more data communication links. For example, the data communication links described herein are collectively illustrated by data communication links 108A-D and may include a peripheral component interconnect express ('PCIe') bus.
FIG. 1B illustrates an example system for data storage according to some embodiments. The memory array controller 101 illustrated in FIG. 1B may be similar to memory array controllers 110A-D described with respect to FIG. 1A. In one example, storage array controller 101 may be similar to storage array controller 110A or storage array controller 110B. For purposes of illustration and not limitation, the memory array controller 101 includes numerous elements. It may be noted that in other embodiments, the memory array controller 101 may contain the same, more, or fewer elements configured in the same or different ways. It may be noted that elements of fig. 1A may be included below to help illustrate features of the memory array controller 101.
The memory array controller 101 may include one or more processing devices 104 and random access memory ('RAM') 111. The processing device 104 (controller 101) represents one or more general-purpose processing devices, such as a microprocessor, central processing unit, or the like. More particularly, the processing device 104 (or controller 101) may be a complex instruction set computing ('CISC') microprocessor, a reduced instruction set computing ('RISC') microprocessor, a very long instruction word ('VLIW') microprocessor, or a processor implementing other instruction sets, or processors implementing a combination of instruction sets. The processing device 104 (controller 101) may also be one or more special purpose processing devices, such as an application specific integrated circuit ('ASIC'), a field programmable gate array ('FPGA'), a digital signal processor ('DSP'), a network processor, or the like.
The processing device 104 may be connected to the RAM 111 via a data communication link 106, which data communication link 106 may be embodied as a high-speed memory bus, such as a double data rate 4 ('DDR 4') bus. Stored in RAM 111 is operating system 112. In some implementations, the instructions 113 are stored in the RAM 111. The instructions 113 may include computer program instructions for performing operations in a direct mapped flash memory storage system. In one embodiment, a direct mapped flash memory storage system is a system that directly addresses data blocks within a flash drive and does not require address translation performed by the memory controller of the flash drive.
In an implementation, the storage array controller 101 includes one or more host bus adapters 103A-C coupled to the processing device 104 via data communication links 105A-C. In implementations, the host bus adapters 103A-C can be computer hardware that connects a host system (e.g., a storage array controller) to other networks and storage arrays. In some examples, host bus adapters 103A-C may be fibre channel adapters enabling storage array controller 101 to connect to a SAN, ethernet adapters enabling storage array controller 101 to connect to a LAN, or the like. Host bus adapters 103A-C may be coupled to processing device 104 via data communication links 105A-C (e.g., such as a PCIe bus).
In an implementation, the storage array controller 101 may include a host bus adapter 114 coupled to the expander 115. Expander 115 may be used to attach host systems to a larger number of storage drives. In embodiments in which host bus adapter 114 is embodied as a SAS controller, expander 115 may be, for example, a SAS expander for enabling host bus adapter 114 to be attached to a storage drive.
In an implementation, the storage array controller 101 may include a switch 116 coupled to the processing device 104 via a data communication link 109. Switch 116 may be a computer hardware device that may create multiple endpoints from a single endpoint, thereby enabling multiple devices to share a single endpoint. Switch 116 may be, for example, a PCIe switch coupled to a PCIe bus (e.g., data communication link 109) and presenting multiple PCIe connection points to the midplane.
In an embodiment, the storage array controller 101 includes a data communication link 107 for coupling the storage array controller 101 to other storage array controllers. In some examples, data communication link 107 may be a Quick Path Interconnect (QPI) interconnect.
A conventional storage system using a conventional flash drive may implement a process across flash drives that are part of the conventional storage system. For example, higher level processes of a storage system may initiate and control processes across flash drives. However, the flash drive of a conventional storage system may include its own storage controller that also performs the process. Thus, for a traditional storage system, both higher-level processes (e.g., initiated by the storage system) and lower-level processes (e.g., initiated by a storage controller of the storage system) may be performed.
To address various shortcomings of conventional storage systems, operations may be performed by higher-level processes rather than by lower-level processes. For example, a flash memory storage system may include a flash drive that does not include a storage controller that provides a process. Thus, the operating system of the flash memory storage system itself may initiate and control the process. This may be accomplished by a direct mapped flash memory storage system that directly addresses blocks of data within a flash drive and does not require address translation performed by the memory controller of the flash drive.
The operating system of the flash memory storage system may identify and maintain a list of allocation units across multiple flash drives of the flash memory storage system. The allocation unit may be a full erase block or a plurality of erase blocks. The operating system may maintain a map or address range that maps addresses directly to erase blocks of a flash drive of the flash memory storage system.
An erase block that is mapped directly to a flash drive may be used to rewrite data and erase data. For example, an operation may be performed on one or more allocation units that include first data and second data, where the first data is to be retained and the second data is no longer used by the flash memory storage system. The operating system may initiate a process to write the first data to a new location within the other allocation unit and erase the second data and mark the allocation unit as available for subsequent data. Thus, the process may be performed only by the higher level operating system of the flash memory storage system, with additional lower level processes not necessarily being performed by the controller of the flash drive.
Advantages of the process being performed only by the operating system of the flash memory storage system include improving the reliability of the flash drive of the flash memory storage system, as no unnecessary or redundant write operations are performed during the process. One possible novelty is the concept of initiating and controlling processes at the operating system of the flash memory storage system. In addition, the process may be controlled by the operating system across multiple flash drives. This is in contrast to the process performed by the memory controller of the flash drive.
The storage system may consist of two storage array controllers sharing a set of drives for failover purposes, or it may consist of a single storage array controller providing storage services utilizing multiple drives, or it may consist of a distributed network of storage array controllers each having a certain number of drives or a certain amount of flash storage, where the storage array controllers in the network cooperate to provide complete storage services and cooperate in various aspects of the storage services including storage allocation and discard item collection.
FIG. 1C illustrates a third example system 117 for data storage according to some embodiments. For purposes of illustration and not limitation, system 117 (also referred to herein as a "storage system") includes numerous elements. It may be noted that in other embodiments, the system 117 may include the same, more, or fewer elements configured in the same or different ways.
In one embodiment, the system 117 includes a dual peripheral component interconnect ('PCI') flash memory storage device 118 with individually addressable flash write storage. The system 117 may include a memory controller 119. In one embodiment, the memory controllers 119A-D may be CPU, ASIC, FPGA or any other circuitry that may implement the necessary control structures according to the present disclosure. In one embodiment, the system 117 includes flash memory devices (e.g., including flash memory devices 120 a-n) operatively coupled to respective channels of the storage device controller 119. The flash memory devices 120 a-n may be presented to the controllers 119A-D as an addressable set of flash pages, erase blocks, and/or control elements sufficient to allow the memory device controllers 119A-D to program and retrieve various aspects of flash memory. In one embodiment, the storage device controllers 119A-D may perform operations on the flash memory devices 120 a-n, including storing and retrieving data content of pages, arranging and erasing any blocks, tracking statistics related to the use and reuse of flash memory pages, erased blocks and cells, tracking and predicting error codes and faults within the flash memory, controlling voltage levels associated with programming and retrieving the contents of flash memory cells, and the like.
In one embodiment, the system 117 may include RAM 121 to store individually addressable fast write data. In one embodiment, RAM 121 may be one or more separate discrete devices. In another embodiment, RAM 121 may be integrated into storage device controllers 119A-D or multiple storage device controllers. RAM 121 may also be used for other purposes, such as for storing temporary programming memory for a processing device (e.g., CPU) in device controller 119.
In one embodiment, the system 117 may include an energy storage device 122, such as a rechargeable battery or capacitor. The energy storage device 122 may store energy sufficient to power the storage device controller 119, some amount of RAM (e.g., RAM 121), and some amount of flash memory (e.g., flash memory 120 a-120 n) for a sufficient time to write the contents of the RAM to the flash memory. In one embodiment, if the storage device controller detects an external power loss, the storage device controllers 119A-D may write the contents of RAM to flash memory.
In one embodiment, the system 117 includes two data communication links 123a, 123b. In one embodiment, the data communication links 123a, 123b may be PCI interfaces. In another embodiment, the data communication links 123a, 123b may be based on other communication standards (e.g., hyperTransport, infiniband, etc.). The data communication links 123a, 123b may be based on a flash nonvolatile memory ('NVMe') or on an architecture-based NVMe ('NVMf') specification that allows external connection to the storage device controllers 119A-D from other components in the storage system 117. It should be noted that for convenience, the data communication link may be interchangeably referred to herein as a PCI bus.
The system 117 may also include an external power source (not shown), which may be provided via one or both data communication links 123a, 123b, or may be provided separately. Alternative embodiments include a separate flash memory (not shown) dedicated to storing the contents of RAM 121. The storage device controllers 119A-D may present a logical device (which may include an addressable fast write logical device) or a distinct portion of the logical address space of the storage device 118 (which may be presented as a PCI memory or as a persistent storage device) via a PCI bus. In one embodiment, operations stored into the device are directed into RAM 121. In the event of a power failure, the storage device controllers 119A-D may write stored content associated with the addressable fast write logical storage to flash memory (e.g., flash memory 120 a-n) for long-term persistent storage.
In one embodiment, the logic device may include some sort of rendering of some or all of the contents of flash memory devices 120 a-n, where the rendering allows a storage system including storage device 118 (e.g., storage system 117) to directly address flash memory pages and directly reprogram erase blocks from storage system components external to the storage device over the PCI bus. The presentation may also allow one or more of the external components to control and retrieve other aspects of the flash memory, including some or all of: tracking statistics related to the use and reuse of flash memory pages, erase blocks, and cells across all flash memory devices; tracking and predicting error codes and faults within and across flash memory devices; controlling a voltage level associated with programming and retrieving the contents of the flash memory cells; etc.
In one embodiment, the energy storage device 122 may be sufficient to ensure that ongoing operations on the flash memory devices 120 a-120 n are completed, the energy storage device 122 may power the storage device controllers 119A-D and associated flash memory devices (e.g., 120 a-n) for those operations, as well as for storing fast write RAM to flash memory. The energy storage device 122 may be used to store the accumulated statistics and other parameters maintained and tracked by the flash memory devices 120 a-n and/or the storage device controller 119. Separate capacitors or energy storage devices (e.g., smaller capacitors near or embedded within the flash memory device itself) may be used for some or all of the operations described herein.
Various schemes may be used to track and optimize the lifetime of the energy storage component, such as adjusting voltage levels over time, partially discharging the energy storage device 122 to measure corresponding discharge characteristics, and so forth. If the available energy decreases over time, the effective available capacity of the addressable fast write storage device may be reduced to ensure that it can be safely written to based on the stored energy currently available.
FIG. 1D illustrates a third example system 124 for data storage according to some embodiments. In one embodiment, the system 124 includes memory controllers 125a, 125b. In one embodiment, memory controllers 125a, 125b are operatively coupled to dual PCI memory devices 119a, 119b and 119c, 119d, respectively. The storage controllers 125a, 125b are operably coupled (e.g., via a storage network 130) to a number of host computers 127 a-n.
In one embodiment, two storage controllers (e.g., 125a and 125 b) provide storage services, such as SCS block storage arrays, file servers, object servers, databases or data analysis services, and the like. The storage controllers 125a, 125b may provide services to host computers 127 a-n external to the storage system 124 through a number of network interfaces (e.g., 126 a-d). The storage controllers 125a, 125b may provide integrated services or applications entirely within the storage system 124, forming a converged storage and computing system. The storage controllers 125a, 125b may utilize fast write memory within the storage devices 119 a-d or across the storage devices 119 a-d to record (journ) ongoing operations to ensure that operations are not lost in the event of a power failure, storage controller removal, storage controller or storage system shutdown, or some failure of one or more software or hardware components within the storage system 124.
In one embodiment, the controllers 125a, 125b operate as PCI masters for one or the other PCI buses 128a, 128 b. In another embodiment, 128a, 128b may be based on other communication standards (e.g., hyperTransport, infiniband, etc.). Other memory system embodiments may operate the memory controllers 125a, 125b as multiple masters for both PCI buses 128a, 128 b. Alternatively, the PCI/NVMe/NVMf switching infrastructure or architecture may connect multiple storage controllers. Some storage system embodiments may allow storage devices to communicate directly with each other, rather than only with a storage controller. In one embodiment, the storage device controller 119a may operate under the direction from the storage controller 125a to synthesize and transfer data to be stored into the flash memory device from data already stored in RAM (e.g., RAM 121 of fig. 1C). For example, the recalculated version of RAM content may be transferred after the storage controller has determined that an operation has been fully committed across the storage system or when the flash memory on the device has reached a particular use capacity or after a particular amount of time to ensure improved security of the data or to free up addressable flash memory capacity for reuse. This mechanism may be used, for example, to avoid secondary transfers from the memory controllers 125a, 125b via buses (e.g., 128a, 128 b). In one embodiment, the recalculation may comprise compressing the data, appending an index or other metadata, combining multiple data segments together, performing erasure code calculations, and so forth.
In one embodiment, under direction from the storage controllers 125a, 125b, the storage device controllers 119a, 119b are operable to calculate data from data stored in RAM (e.g., RAM 121 of fig. 1C) and transfer the data to other storage devices without involving the storage controllers 125a, 125b. This operation may be used to mirror data stored in one controller 125a to another controller 125b, or it may be used to offload compression, data aggregation, and/or erasure coding calculations and transfers to a storage device to reduce the load on the storage controller or storage controller interface 129a, 129b to the PCI bus 128a, 128 b.
The storage device controllers 119A-D may include mechanisms for implementing high availability primitives for use by other components of the storage system external to the dual-PCI storage device 118. For example, a reservation or exclusion primitive may be provided such that in a storage system having two storage controllers providing highly available storage services, one storage controller may prevent another storage controller from accessing or continuing to access the storage device. For example, this may be used in situations where one controller detects that the other controller is not functioning properly or where the interconnect between two storage controllers itself may not function properly.
In one embodiment, a storage system for use with dual PCI direct mapped storage with individually addressable fast write storage includes a system that manages erase blocks or groups of erase blocks as allocation units for storing data on behalf of a storage service or for storing metadata (e.g., indexes, logs, etc.) associated with the storage service or for proper management of the storage system itself. Flash pages, which may be several kilobytes in size, may be written when data arrives or when the storage system will hold the data for a long interval (e.g., exceeding a defined time threshold). To commit data faster, or to reduce the number of writes to a flash memory device, a memory controller may first write data to individually addressable fast write memory devices on one or more memory devices.
In one embodiment, the storage controllers 125a, 125b may initiate the use of erase blocks within and across the storage device (e.g., 118) according to the age and expected remaining life of the storage device or based on other statistics. The storage controllers 125a, 125b may initiate garbage collection and data migration data between storage devices based on pages that are no longer needed and for managing flash page and erase block lifetime and for managing overall system performance.
In one embodiment, storage system 124 may utilize a mirroring and/or erasure coding scheme as part of storing data into an addressable fast write storage and/or as part of writing data into allocation units associated with an erase block. Erasure codes can be used across storage devices and within erase blocks or allocation units, or within and across flash memory devices on a single storage device to provide redundancy against single or multiple storage device failures or to prevent internal corruption of flash memory pages caused by flash memory operations or by flash memory cell degradation. Mirror and erasure coding at various levels can be used to recover from multiple types of failures occurring alone or in combination.
The embodiments depicted with reference to fig. 2A-G illustrate a storage cluster storing user data, such as user data originating from one or more users or client systems or other sources external to the storage cluster. The storage clusters distribute user data across storage nodes housed within a chassis or across multiple chassis using erasure coding and redundant copies of metadata. Erasure coding refers to a method of data protection or reconstruction in which data is stored across a set of different locations (e.g., disks, storage nodes, or geographic locations). Flash memory is one type of solid state memory that may be integrated with embodiments, but embodiments may be extended to other types of solid state memory or other storage media, including non-solid state memory. Control of storage locations and workloads is distributed across storage locations in a clustered peer-to-peer system. Tasks such as mediating communications between various storage nodes, detecting when a storage node becomes unavailable, and balancing I/O (input and output) across various storage nodes are all handled on a distributed basis. In some embodiments, data is laid out or distributed across multiple storage nodes in data segments or stripes that support data recovery. Ownership of data may be reassigned within a cluster independent of input and output patterns. This architecture, described in more detail below, allows storage nodes in the cluster to fail while the system remains operational because data can be reconstructed from other storage nodes and thus remain available for input and output operations. In various embodiments, a storage node may be referred to as a cluster node, a blade, or a server.
The storage clusters may be housed within a chassis (i.e., a housing that houses one or more storage nodes). A mechanism (e.g., a power distribution bus) to provide power to each storage node and a communication mechanism (e.g., a communication bus to enable communication between storage nodes) are included within the chassis. According to some embodiments, the storage clusters may operate as stand-alone systems in one location. In one embodiment, the chassis houses at least two examples of both power distribution and communication buses that may be individually enabled or disabled. The internal communication bus may be an ethernet bus, however, other technologies such as PCIe, infiniband, and others are equally suitable. The chassis provides ports for an external communication bus for communication between multiple chassis and with the client systems, either directly or through a switch. External communications may use technologies such as ethernet, infiniband, fibre channel, etc. In some embodiments, the external communication bus uses different communication bus technologies for inter-chassis and client communications. If the switch is deployed within a chassis or between chassis, the switch may be used as a translation between multiple protocols or technologies. When multiple chassis are connected to define a storage cluster, the storage cluster may be accessed by a client using a proprietary interface or standard interface (e.g., network file system ('NFS'), common internet file system ('CIFS'), small computer system interface ('SCSI'), or hypertext transfer protocol ('HTTP'). The slave client protocol translation may occur at the switch, the chassis external communication bus, or within each storage node. In some embodiments, multiple chassis may be coupled or connected to each other through an aggregator switch. A portion and/or all of the coupled or connected chassis may be designated as a storage cluster. As discussed above, each chassis may have multiple blades, each with a media access control ('MAC') address, but in some embodiments the storage cluster appears to the external network as having a single cluster IP address and a single MAC address.
Each storage node may be one or more storage servers, and each storage server is connected to one or more non-volatile solid-state memory units, which may be referred to as storage units or storage devices. One embodiment includes a single storage server in each storage node and between 1-8 non-volatile solid-state memory units, however, this one example is not meant to be limiting. The storage server may include processors, DRAMs, and interfaces for internal communication buses and power distribution for each of the power buses. In some embodiments, the interface and storage units share a communication bus, such as PCI express, within the storage node. The non-volatile solid state memory unit may directly access the internal communication bus interface through the storage node communication bus or request the storage node to access the bus interface. The non-volatile solid-state memory unit contains an embedded CPU, a solid-state memory controller, and a number of solid-state mass storage devices, for example, between 2 and 32 terabytes ('TB') in some embodiments. Embedded volatile storage media, such as DRAM, and energy storage devices are included in non-volatile solid state memory cells. In some embodiments, the energy reserve device is a capacitor, super capacitor, or battery that enables transfer of a subset of the DRAM content to a stable storage medium in the event of a loss of power. In some embodiments, the nonvolatile solid state memory cells are configured with storage class memory, such as phase change or magnetoresistive random access memory ('MRAM') that replaces DRAM and enables a reduced power retention device.
One of the many features of storage nodes and non-volatile solid state storage devices is the ability to actively reconstruct data in a storage cluster. The storage nodes and non-volatile solid state storage devices may determine when a storage node or non-volatile solid state storage device in a storage cluster is unreachable, independent of whether an attempt is made to read data related to that storage node or non-volatile solid state storage device. The storage nodes and nonvolatile solid state storage then cooperate to recover and reconstruct data in at least a portion of the new locations. This constitutes an active rebuild in that the system does not need to wait until a read access initiated from a client system employing the storage cluster requires data to be rebuilt. These and additional details of the memory and its operation are discussed below.
FIG. 2A is a perspective view of a storage cluster 161 having a plurality of storage nodes 150 and internal solid state memory coupled to each storage node to provide a network attached storage or storage area network, according to some embodiments. The network attached storage, storage area network, or storage cluster, or other storage memory, may include one or more storage clusters 161, each having one or more storage nodes 150, arranging both physical components and the amount of storage memory provided thereby in a flexible and reconfigurable manner. Storage clusters 161 are designed to fit in racks, and one or more racks may be set up and filled as needed for storage of memory. Storage cluster 161 has a chassis 138 with a plurality of slots 142. It should be appreciated that the chassis 138 may be referred to as a shell, housing, or rack unit. In one embodiment, the chassis 138 has fourteen slots 142, although other numbers of slots may be readily designed. For example, some embodiments have 4 slots, 8 slots, 16 slots, 32 slots, or other suitable number of slots. In some embodiments, each slot 142 may house one storage node 150. The chassis 138 includes tabs 148 that may be used to mount the chassis 138 to a rack. Fan 144 provides air circulation for cooling storage node 150 and its components, although other cooling components may be used, or embodiments without cooling components may be designed. The switch fabric 146 couples storage nodes 150 within the chassis 138 together and to a network for communication with memory. In the embodiment depicted herein, for illustrative purposes, the slots 142 to the left of the switch fabric 146 and fans 144 are shown occupied by storage nodes 150, while the slots 142 to the right of the switch fabric 146 and fans 144 are empty and available for insertion of storage nodes 150. This configuration is one example, and in various additional arrangements, one or more storage nodes 150 may occupy the slot 142. In some embodiments, the storage node arrangement need not be sequential or contiguous. Storage node 150 is hot pluggable, meaning that storage node 150 may be inserted into slot 142 in chassis 138 or removed from slot 142 without having to stop or power down the system. After insertion or removal of storage node 150 from slot 142, the system automatically reconfigures to recognize and accommodate the change. In some embodiments, reconfiguring includes re-storing redundancy and/or re-balancing data or loads.
Each storage node 150 may have multiple components. In the embodiment shown here, the storage node 150 includes a printed circuit board 159 populated by the CPU 156 (i.e., processor), memory 154 coupled to the CPU 156, and non-volatile solid state storage 152 coupled to the CPU 156, although in other embodiments other mounts and/or components may be used. The memory 154 has instructions executed by the CPU 156 and/or data operated on by the CPU 156. As further explained below, the non-volatile solid-state storage 152 includes flash memory, or in further embodiments, other types of solid-state memory.
Referring to FIG. 2A, storage cluster 161 is scalable, meaning that storage capacity with non-uniform storage size can be easily added, as described above. One or more storage nodes 150 may be inserted into or removed from each chassis, and in some embodiments, the storage clusters are self-configuring. The plug-in storage nodes 150 may be of different sizes whether installed in the chassis at the time of delivery or added later. For example, in one embodiment, storage node 150 may have any multiple of 4TB, such as 8TB, 12TB, 16TB, 32TB, and so on. In further embodiments, storage node 150 may have other storage or any multiple of capacity. The storage capacity of each storage node 150 is broadcast and affects the decision of how to stripe the data. For maximum storage efficiency, embodiments may be self-configured in stripes as wide as possible subject to predetermined continuous operational requirements, with up to one or up to two non-volatile solid state storage device units 152 or storage nodes 150 lost within the chassis.
Fig. 2B is a block diagram showing a communication interconnect 173 and a power distribution bus 172 coupling a plurality of storage nodes 150. Referring back to fig. 2A, in some embodiments, the communication interconnect 173 may be included in the switch fabric 146 or implemented with the switch fabric 146. Where multiple storage clusters 161 occupy racks, in some embodiments, communication interconnect 173 may be included in or implemented with the top of a rack switch. As illustrated in FIG. 2B, storage clusters 161 are enclosed within a single chassis 138. External port 176 is coupled to storage node 150 through communication interconnect 173, while external port 174 is coupled directly to the storage node. An external power port 178 is coupled to the power distribution bus 172. Storage nodes 150 may include different amounts and different capacities of non-volatile solid-state storage 152 as described with reference to fig. 2A. Additionally, one or more storage nodes 150 may be compute-only storage nodes, as illustrated in fig. 2B. The authorization 168 is implemented on the non-volatile solid state storage 152, for example as a list or other data structure stored in memory. In some embodiments, the authorization is stored within the non-volatile solid state storage 152 and is supported by software executing on a controller or other processor of the non-volatile solid state storage 152. In another embodiment, the authorization 168 is implemented on the storage node 150, for example as a list or other data structure stored in the memory 154 and is supported by software executing on the CPU 156 of the storage node 150. In some embodiments, the authority 168 controls how and where data is stored in the non-volatile solid state storage 152. This control helps determine which type of erasure coding scheme is applied to the data and which storage nodes 150 have which portions of the data. Each authority 168 may be assigned to a non-volatile solid state storage 152. In various embodiments, each grant may control a range of index node (inode) numbers, segment numbers, or other data identifiers assigned to data by the file system, by the storage node 150, or by the non-volatile solid state storage 152.
In some embodiments, each piece of data and each piece of metadata has redundancy in the system. In addition, each piece of data and each piece of metadata has an owner, which may be referred to as an authorization. If that authorization is not reachable, e.g. by a failure of a storage node, there is a succession plan for how to find that data or that metadata. In various embodiments, there is a redundant copy of the authority 168. In some embodiments, the authority 168 has a relationship with the storage node 150 and the non-volatile solid state storage 152. Each authority 168 covering a range of data segment numbers or other identifiers of data may be assigned to a particular non-volatile solid state storage 152. In some embodiments, the entitlements 168 for all such ranges are distributed over the non-volatile solid state storage 152 of the storage cluster. Each storage node 150 has a network port that provides access to the non-volatile solid state storage 152 of that storage node 150. The data may be stored in segments, in some embodiments segments are associated with segment numbers and that segment number is indirection of a RAID (redundant array of independent disks) configuration. The assignment and use of the authorization 168 thus establishes indirection to the data. According to some embodiments, indirection may be referred to as the ability to indirectly (in this case, via the authority 168) reference data. The segment identifies a set of non-volatile solid state storage devices 152 and a local identifier into the set of non-volatile solid state storage devices 152 that may contain data. In some embodiments, the local identifier is an offset into the device and may be reused by multiple segments sequentially. In other embodiments, the local identifier is unique to a particular segment and is never reused. The offset in the non-volatile solid state storage 152 is applied to locate data for writing to the non-volatile solid state storage 152 or reading from the non-volatile solid state storage 152 (in the form of a RAID stripe). Data is striped across multiple units of non-volatile solid state storage 152, non-volatile solid state storage 152 may include non-volatile solid state storage 152 having authorization 168 for a particular data segment or be different from the non-volatile solid state storage 152.
If there is a change in the location where a particular data segment is located, for example, during a data movement or data reconstruction, the authority 168 for that data segment should be consulted at that nonvolatile solid state storage 152 or storage node 150 having that authority 168. To locate a particular piece of data, embodiments calculate a hash value of the data segment or apply an inode number or data segment number. The output of this operation is directed to the non-volatile solid state storage 152 having the authority 168 for that particular piece of data. In some embodiments, there are two phases for this operation. The first stage maps entity Identifiers (IDs) (e.g., segment numbers, inode numbers, or directory numbers) to authorization identifiers. This mapping may include, for example, the computation of a hash or a bitmask. The second stage is to map the authorization identifier to a particular non-volatile solid state storage 152, which may be accomplished through explicit mapping. The operations are repeatable such that when the computation is performed, the results of the computation are reproducibly and reliably directed to the particular non-volatile solid-state storage 152 having that authority 168. Operations may include reachable storage node sets as inputs. If a set change of non-volatile solid state storage units is achievable, then the optimal set changes. In some embodiments, the held value is the current assignment (which is always true), and the calculated value is the target assignment that the cluster will attempt to reconfigure. This calculation may be used to determine the best non-volatile solid-state storage device 152 to authorize when there is a set of non-volatile solid-state storage devices 152 that are reachable and constitute the same cluster. The computation also determines an ordered set of peer non-volatile solid state storage 152 that will also record the authorization for the non-volatile solid state storage mapping so that authorization can be determined even when the assigned non-volatile solid state storage is unreachable. In some embodiments, if a particular authorization 168 is not available, then duplicate or alternate authorization 168 may be consulted.
Referring to fig. 2A and 2B, two of the many tasks of the CPU 156 on the storage node 150 are decomposing write data and reorganizing read data. When the system has determined that data is to be written, the authorization 168 for that data is located as described above. When the segment ID of the data has been determined, the write request is forwarded to the nonvolatile solid state storage 152 of the host currently determined to be the authority 168 determined from the segment. The host CPU 156 of the storage node 150 on which the non-volatile solid state storage 152 and the corresponding authority 168 reside then breaks up or slices the data and transfers it out to the various non-volatile solid state storage 152. The transmitted data is written as data stripes according to an erasure coding scheme. In some embodiments, the data is requested to be pulled, and in other embodiments, pushed. In contrast, when data is read, the authority 168 for the segment ID containing the data is located as described above. The host CPU 156 of the non-volatile solid state storage 152 and the storage node 150 on which the corresponding authorization 168 resides requests data from the non-volatile solid state storage and the corresponding storage node pointed to by the authorization. In some embodiments, the data is read from the flash memory storage device as a stripe of data. The host CPU 156 of the storage node 150 then reassembles the read data, correcting any errors (if any) according to the appropriate erasure coding scheme and forwards the reassembled data to the network. In further embodiments, some or all of these tasks may be handled in non-volatile solid-state storage 152. In some embodiments, the segment host requests data to be sent to storage node 150 by requesting a page from the storage device and then sending the data to the storage node that issued the original request.
In some systems, such as in the UNIX style file system, data is handled with an index node or inode that specifies the data structure that represents the objects in the file system. For example, the object may be a file or a directory. Metadata may accompany the object as attributes such as license data and creation time stamps, among other attributes. The segment number may be assigned to all or a portion of this object in the file system. In other systems, data segments are handled with segment numbers assigned elsewhere. For discussion purposes, a distribution unit is an entity, and an entity may be a file, directory, or segment. That is, an entity is a unit of data or metadata stored by a storage system. Entities are grouped into sets called grants. Each grant has a grant owner, which is a storage node with the exclusive rights to update the entity in the grant. In other words, the storage node contains the authorization, and the authorization in turn contains the entity.
A segment is a logical container of data according to some embodiments. A segment is an address space between the media address space and the physical flash location, i.e., the data segment number is in this address space. The segments may also contain metadata that enables data redundancy to be recovered (rewritten to a different flash location or device) without involving higher level software. In one embodiment, the internal format of the segment contains client data and media map to determine the location of that data. Where applicable, each data segment is protected by breaking it up into several data and parity fragments, such as to prevent the effects of memory and other failures. Data and parity shards are distributed (i.e., striped) across the non-volatile solid-state storage 152 coupled to the host CPU 156 according to an erasure coding scheme (see fig. 2E and 2G). In some embodiments, the use of the term segment refers to a container and its location in the address space of the segment. According to some embodiments, the use of the term stripe refers to the same set of fragments as a segment and includes fragments and how redundant or parity information is distributed.
A series of address space transformations occurs across the entire memory system. At the top is a directory entry (file name) linked to the inode. The inode points to a media address space where the data is logically stored. The media addresses may be mapped through a series of indirect media to extend the load of large files or to implement data services such as deduplication or snapshot. The media addresses may be mapped through a series of indirect media to extend the load of large files or to implement data services such as deduplication or snapshot. The segment address is then translated into a physical flash location. According to some embodiments, the physical flash memory location has an address range defined by the amount of flash memory in the system. The media addresses and segment addresses are logical containers and in some embodiments use 128 bit or larger identifiers in order to be virtually unlimited, with the possibility of reuse calculated to be longer than the expected lifetime of the system. In some embodiments, addresses from the logical containers are allocated in a hierarchical manner. Initially, each non-volatile solid state storage device unit 152 may be assigned a range of address spaces. Within this assigned range, the non-volatile solid-state storage 152 is able to allocate addresses without synchronizing with other non-volatile solid-state storage 152.
The data and metadata are stored by a set of underlying storage layouts optimized for different workload types and storage devices. These layouts incorporate a variety of redundancy schemes, compression formats, and indexing algorithms. Some of these layouts store information about the authorizing and authorizing masters, while other layouts store file metadata and file data. Redundancy schemes include error correction codes that tolerate corrupted bits within a single storage device (e.g., a NAND flash memory chip), erasure codes that tolerate multiple storage node failures, and replication schemes that tolerate data center or region failures. In some embodiments, a low density parity check ('LDPC') code is used within a single memory cell. In some embodiments, reed-Solomon (Reed-Solomon) encoding is used within the storage clusters, and mirroring is used within the storage grid. Metadata may be stored using an ordered log-structured index (e.g., a log-structured merge tree), and big data may not be stored in a log-structured layout.
To maintain consistency across multiple copies of an entity, the storage node implicitly agrees to two things by computing: (1) Containing an authorization of the entity and (2) a storage node containing the authorization. The assignment of entities to authorizations may be accomplished by pseudo-randomly assigning entities to authorizations, by splitting entities into ranges based on externally generated keys, or by placing a single entity into each authorization. Examples of pseudo-random schemes are linear hashes and hashes of the copy ('run') series under extensible hashes, including controlled copies ('CRUSH') under extensible hashes. In some embodiments, the pseudo-random assignment is used only to assign grants to nodes, as the set of nodes may change. The authorization set cannot be changed and thus any subjective function can be applied in these embodiments. Some placement schemes automatically place grants on storage nodes, while other placement schemes rely on explicit mapping of grants to storage nodes. In some embodiments, a pseudo-random scheme is used to map from each grant to a set of candidate grant owners. Pseudo-random data distribution functions related to CRUSH may assign grants to storage nodes and create a list of where to assign grants. Each storage node has a copy of the pseudorandom data distribution function and can derive the same calculation for distribution and later find or locate the authorization. In some embodiments, each of the pseudo-random schemes requires a set of reachable storage nodes as input in order to arrive at the same target node. Once the entity has been placed in the authorization, the entity may be stored on the physical device such that the expected failure will not result in unexpected data loss. In some embodiments, the rebalancing algorithm attempts to store copies of all entities within an authorization in the same layout and on the same set of machines.
Examples of expected faults include device faults, machine theft, data center fires, and regional disasters, such as nuclear or geological events. Different failures result in different degrees of acceptable data loss. In some embodiments, storage node theft does not affect the security nor reliability of the system, while regional events may result in no loss of data, update loss for seconds or minutes, or even complete loss of data, depending on the system configuration.
In an embodiment, the placement of data storing redundancy is independent of the placement of authorization for data consistency. In some embodiments, the storage node containing the authorization does not contain any persistent storage. Instead, the storage node is connected to a non-volatile solid state storage unit that does not contain authorization. The communication interconnections between storage nodes and non-volatile solid state storage units are made up of a variety of communication technologies and have non-uniform performance and fault tolerance characteristics. In some embodiments, as mentioned above, the non-volatile solid state storage units are quickly connected to the storage nodes via PCI, the storage nodes are connected together within a single chassis using an Ethernet backplane, and the chassis are connected together to form a storage cluster. In some embodiments, the storage clusters are connected to the clients using ethernet or fibre channel. If the plurality of storage clusters are configured as a storage grid, the plurality of storage clusters are connected using the Internet or other long-range networking link (e.g., a "metro scale" link or a private link that does not traverse the Internet).
The authorized owner has the exclusive rights to modify the entity, migrate the entity from one non-volatile solid state storage unit to another non-volatile solid state storage unit, and add and remove copies of the entity. This allows redundancy of the underlying data to be maintained. When the owner of the authorization fails, is to be taken out of use or overloaded, the authorization is transferred to the new storage node. Transient faults make it important to ensure that all non-faulty machines agree on a new authorized location. Ambiguity due to transient faults can be automatically achieved through a consensus protocol (e.g., paxos), a hot-warm failover scheme, via manual intervention by a remote system administrator or by a local hardware administrator (e.g., by physically removing the failed machine from the cluster, or pressing a button on the failed machine). In some embodiments, a consensus protocol is used and failover is automatic. According to some embodiments, if excessive failures or copy events occur within an excessive period of time, the system enters a self-protected mode and stops copying and data movement activities until administrator intervention.
When authorization to transfer an update entity's authorization between a storage node and an authorized owner, the system transfers messages between the storage node and the non-volatile solid state storage unit. With respect to persistent messages, messages with different purposes are of different types. Depending on the type of message, the system maintains different ordering and persistence guarantees. In processing persistent messages, the messages are temporarily stored in a plurality of durable and non-durable storage hardware technologies. In some embodiments, messages are stored in RAM, NVRAM, and on NAND flash devices, and various protocols are used in order to efficiently use each storage medium. The delay sensitive client request may be held in the replicated NVRAM and then later in the NAND while the background rebalancing operation is held directly to the NAND.
The persistent message is persistently stored before being transmitted. This allows the system to continue to service client requests despite failures and component changes. While many hardware components contain unique identifiers that are visible to the system administrator, manufacturer, hardware supply chain, and ongoing monitoring quality control infrastructure, applications running on the infrastructure addresses address virtualized addresses. These virtualized addresses do not change during the lifetime of the storage system, regardless of component failures and replacement. This allows each component of the storage system to be replaced over time without requiring reconfiguration or interruption of client request processing, i.e., the system supports non-destructive upgrades.
In some embodiments, virtualized addresses are stored with sufficient redundancy. The continuous monitoring system correlates hardware and software status with a hardware identifier. This allows for detection and prediction of faults due to faulty components and manufacturing details. In some embodiments, the monitoring system is also capable of proactively transferring authorization and entities away from the affected devices prior to failure by removing components from the critical path.
FIG. 2C is a multi-level block diagram showing the contents of storage node 150 and the contents of non-volatile solid state storage 152 of storage node 150. In some embodiments, data is transferred to storage node 150 and from storage node 150 by a network interface controller ('NIC') 202. Each storage node 150 has a CPU 156 and one or more non-volatile solid state storage devices 152, as discussed above. Moving one level down in fig. 2C, each non-volatile solid-state storage 152 has relatively fast non-volatile solid-state memory, such as non-volatile random access memory ('NVRAM') 204 and flash memory 206. In some embodiments, the NVRAM 204 may be a component (DRAM, MRAM, PCM) that does not require a program/erase cycle, and may be memory capable of supporting writing much more frequently than reading from memory. Moving down another level in fig. 2C, in one embodiment, NVRAM 204 is implemented as high-speed volatile memory backed up by energy reserve 218, such as Dynamic Random Access Memory (DRAM) 216. The energy reserve 218 provides sufficient power to keep the DRAM 216 powered on long enough to transfer content to the flash memory 206 in the event of a power failure. In some embodiments, the energy reserve 218 is a capacitor, ultracapacitor, battery, or other device that supplies a suitable energy supply sufficient to enable the contents of the DRAM 216 to be transferred to a stable storage medium in the event of a loss of power. The flash memory 206 is implemented as a plurality of flash dies 222, which may be referred to as a flash die 222 package or an array of flash dies 222. It should be appreciated that the flash memory die 222 may be packaged in any number of ways, with a single die per package, multiple dies per package (i.e., multi-chip packages), in a hybrid package, as bare dies on a printed circuit board or other substrate, as encapsulated dies, etc. In the embodiment shown, the non-volatile solid-state storage 152 has a controller 212 or other processor and an input output (I/O) port 210 coupled to the controller 212. The I/O port 210 is coupled to the CPU 156 and/or the network interface controller 202 of the flash storage node 150. A flash input output (I/O) port 220 is coupled to a flash die 222, and a direct memory access unit (DMA) 214 is coupled to the controller 212, the DRAM 216, and the flash die 222. In the embodiment shown, I/O ports 210, controller 212, DMA unit 214, and flash I/O ports 220 are implemented on a programmable logic device ('PLD') 208, such as a Field Programmable Gate Array (FPGA). In this embodiment, each flash die 222 has pages organized as 16kB (kilobyte) pages 224 and registers 226 through which data may be written to the flash die 222 or read from the flash die 222. In further embodiments, other types of solid state memory are used in place of or in addition to the flash memory illustrated within flash die 222.
In various embodiments as disclosed herein, storage clusters 161 may generally be contrasted with storage arrays. Storage node 150 is the portion that creates a collection of storage clusters 161. Each storage node 150 has a slice of data and the computations needed to provide that data. The plurality of storage nodes 150 cooperate to store and retrieve data. As is commonly used in storage arrays, storage memory or storage devices are less involved in processing and manipulating data. A memory storage or storage device in a memory array receives a command to read, write, or erase data. The storage memory or storage devices in the storage array are unaware of the larger system in which they are embedded, or what the data means. The storage memory or storage devices in the storage array may include various types of storage memory, such as RAM, solid state drives, hard drives, and the like. The storage unit 152 described herein has multiple interfaces that are simultaneously active and serve multiple purposes. In some embodiments, some functionality of storage node 150 is shifted into storage unit 152, transforming storage unit 152 into a combination of storage unit 152 and storage node 150. Placing the calculation (relative to storing the data) into the storage unit 152 places the calculation closer to the data itself. Various system embodiments have a hierarchy of storage node layers with different capabilities. By contrast, in a storage array, a controller owns and knows everything about all the data that the controller manages in a shelf (shell) or storage device. In storage cluster 161, as described herein, multiple storage units 152 and/or multiple controllers in storage nodes 150 cooperate in various ways (e.g., for erasure coding, data slicing, metadata communication and redundancy, storage capacity expansion or contraction, data recovery, etc.).
Fig. 2D shows a storage server environment using an embodiment of storage node 150 and storage unit 152 of fig. 2A-C. In this version, each storage unit 152 has a processor, such as controller 212 (see fig. 2C), an FPGA (field programmable gate array), flash memory 206, and NVRAM 204 (which is a supercapacitor-backed DRAM 216, see fig. 2B and 2C) on a PCIe (peripheral component interconnect express) board in chassis 138 (see fig. 2A). The storage unit 152 may be implemented as a single board containing the storage devices and may be the largest tolerable fault domain inside the chassis. In some embodiments, up to two storage units 152 may fail and the device will continue without data loss.
In some embodiments, the physical storage is partitioned into named regions based on application usage. NVRAM 204 is a contiguous block of memory reserved in memory unit 152DRAM 216 and is supported by NAND flash memory. The NVRAM 204 is logically divided into multiple memory regions, two written as reels (e.g., spool_regions). The space within the spool of NVRAM 204 is managed independently by each authority 168. Each device provides an amount of storage space to each authority 168. That authority 168 further manages the lifetime and allocation within that space. Examples of reels include distributed transactions or concepts. The on-board ultracapacitor provides a shorter duration of power retention when primary power to the memory unit 152 fails. During this hold interval, the contents of NVRAM 204 are refreshed to flash memory 206. At the next power-on, the contents of the NVRAM 204 are restored from the flash memory 206.
With respect to storage unit controllers, the responsibilities of the logical "controller" are distributed across each of the blades that contain the authority 168. This logic control distribution is shown in FIG. 2D as host controller 242, middle tier controller 244, and storage unit controller 246. The management of the control plane and the storage plane is treated independently, but the components may be physically co-located on the same blade. Each authority 168 effectively acts as an independent controller. Each authority 168 provides its own data and metadata structure, its own background workers, and maintains its own lifecycle.
FIG. 2E is a block diagram of blade 252 hardware showing control plane 254, compute and store planes 256, 258, and authority 168 interacting with underlying physical resources in the storage server environment of FIG. 2D using the embodiments of storage nodes 150 and storage units 152 of FIGS. 2A-C. The control plane 254 is partitioned into a number of grants 168 that can run on any of the blades 252 using computing resources in the computing plane 256. The storage plane 258 is partitioned into a set of devices, each of which provides access to the flash memory 206 and NVRAM 204 resources. In one embodiment, the computing plane 256 may perform operations of a storage array controller on one or more devices of the storage plane 258 (e.g., a storage array), as described herein.
In the compute and store planes 256, 258 of fig. 2E, the authority 168 interacts with the underlying physical resources (i.e., devices). From the perspective of the authority 168, its resources are striped across all physical devices. From the device's point of view, it provides resources to all of the grants 168 wherever the grants happen to run. Each authority 168 has allocated or has been allocated one or more partitions 260 of storage memory in storage unit 152, such as flash memory 206 and partitions 260 in NVRAM 204. Each authority 168 uses those assigned partitions 260 that it belongs to for writing or reading user data. The authorizations may be associated with different physical storage amounts of the system. For example, one authority 168 may have a greater number of partitions 260 or larger-sized partitions 260 in one or more storage units 152 than one or more other authorities 168.
FIG. 2F depicts the resilient software layers in the blades 252 of the storage cluster, according to some embodiments. In the elastic structure, the elastic software is symmetrical, i.e., the computing module 270 of each blade runs three identical layers of the process depicted in FIG. 2F. The storage manager 274 performs read and write requests from the other blades 252 for data and metadata stored in the local storage unit 152NVRAM 204 and the flash memory 206. Authorization 168 fulfills client requests by issuing the necessary reads and writes to blade 252 where the corresponding data or metadata resides on its storage unit 152. Endpoint 272 parses client connection requests received from the switch fabric 146 monitoring software, relays client connection requests to the authority 168 responsible for implementation, and relays the response of the authority 168 to the client. The symmetrical three-layer structure achieves a high degree of concurrency of the storage system. In these embodiments, the elasticity is efficiently and reliably laterally expanded. In addition, the resilience implements a unique lateral expansion technique that balances work across all resources evenly, regardless of client access pattern, and maximizes concurrency by eliminating most of the need for inter-blade coordination that typically occurs with conventional distributed locking.
Still referring to fig. 2F, the authority 168 running in the computing module 270 of blade 252 performs the internal operations required to complete the client request. One feature of the resiliency is that the authority 168 is stateless, i.e., it caches valid data and metadata in the DRAM of its own blade 252 for quick access, but the authority stores each update in its partition of NVRAM 204 on three separate blades 252 until the update has been written to the flash memory 206. In some embodiments, all storage system writes to NVRAM 204 are in triplicate to the partitions on three separate blades 252. With the three mirrored NVRAM 204 and persistent storage protected by parity and Reed-Solomon RAID checksums, the storage system can withstand concurrent failure of two blades 252 without losing data, metadata, or access to either.
Because the authority 168 is stateless, it may migrate between blades 252. Each authority 168 has a unique identifier. In some, the partitions of NVRAM 204 and flash 206 are associated with identifiers of the authority 168, and not with the blade 252 in some on which they operate. Thus, as the authority 168 migrates, the authority 168 continues to manage the same memory partition from its new location. When a new blade 252 is installed in an embodiment of a storage cluster, the system automatically rebalances the load by: partitioning the storage of new blade 252 for use by the system's authorizations 168, migrating selected authorizations 168 to new blade 252, turning on endpoint 272 on new blade 252, and including it in the client connection distribution algorithm of switch fabric 146.
The migrated authority 168 keeps the contents of its NVRAM 204 partition on the flash memory 206 from its new location, processes read and write requests from other authorities 168, and completes client requests directed to it by endpoint 272. Similarly, if a blade 252 fails or is removed, the system redistributes its authority 168 among the remaining blades 252 of the system. The redistributed authority 168 continues to perform its original function from its new location.
FIG. 2G depicts a grant 168 and storage resources in a blade 252 of a storage cluster according to some embodiments. Each authority 168 is exclusively responsible for the partitioning of flash memory 206 and NVRAM 204 on each blade 252. The authority 168 manages the contents and integrity of its partition, independent of other authorities 168. The authority 168 compresses the incoming data and temporarily stores it in its partition of NVRAM 204, and then merges, RAID protects, and holds the data in segments of storage in its flash memory 206 partition. When the authority 168 writes data to the flash memory 206, the storage manager 274 performs the necessary flash translation to optimize write performance and maximize media lifetime. In the background, the authority 168 "performs waste item collection" or reclaims the space occupied by data that the client made obsolete by overwriting the data. It should be appreciated that because the partitions of the authority 168 are disjoint, no distributed locking is required to perform client and write or perform background functions.
The embodiments described herein may utilize various software, communication, and/or networking protocols. In addition, the configuration of hardware and/or software may be adapted to accommodate various protocols. For example, embodiments may utilize an active directory, which is in WINDOWS TM Database-based systems in an environment that provide authentication, cataloging, policies, and other services. In these embodiments, LDAP (lightweight directory access protocol) is an example application protocol for querying and modifying items in a directory service provider, such as active directories. In some embodiments, a network lock manager ('NLM') acts as a facility to work in concert with a network file system ('NFS') to provide system V-style advisory files and record locks via a network. The server message block ('SMB') protocol, a version of which is also referred to as the common internet file system ('CIFS'), may be integrated with the storage systems discussed herein. SMP operates as an application layer network protocol typically used to provide shared access to files, printers, and serial ports, as well as miscellaneous communications between nodes on the network. SMB also provides an authenticated inter-process communication mechanism. AMAZON TM S3 (simple storage service) is a web service provided by Amazon web service (Amazon Web Services), and the system described herein can interface with Amazon S3 through web service interfaces (REST (representational state transfer), SOAP (simple object access protocol), and BitTorrent). The RESTful API (application programming interface) breaks down transactions to create a series of small modules. Each module addresses a particular underlying portion of the transaction. The controls or permissions provided with these embodiments, particularly for object data, may include the utilization of an access control list ('ACL'). ACLs are permission lists attached to objects, and specify which users or system processes are authorized to access the object and what is allowed on a given object And (3) operating. The system may utilize internet protocol version 6 ('IPv 6') as well as IPv4 for providing a communication protocol for identifying and locating systems and routing traffic across the internet for computers on a network. Packet routing between networked systems may include equal cost multi-path routing ('ECMP'), which is a routing strategy in which next hop packet forwarding to a single destination may occur via multiple "best paths" that are on top in the routing metric calculation. Multipath routing can be used with most routing protocols because it is a per-hop decision limited to a single router. Software may support multiple tenants, which is an architecture in which a single instance of a software application serves multiple clients. Each customer may be referred to as a tenant. In some embodiments, the tenant may be given the ability to customize portions of the application, but may not be able to customize the code of the application. Embodiments may maintain an audit log. An audit log is a document that records events in a computing system. In addition to documenting what resources are accessed, audit log entries typically include destination and source addresses, time stamps, and user login information to comply with various regulations. Embodiments may support various key management policies, such as encryption key rotation. In addition, the system may support a dynamic root password or some variation of dynamically changing passwords.
Fig. 3A sets forth a diagram of a storage system 306 coupled for data communication with a cloud service provider 302 according to some embodiments of the present disclosure. Although depicted in less detail, the storage system 306 depicted in fig. 3A may be similar to the storage system described above with reference to fig. 1A-1D and 2A-2G. In some embodiments, the storage system 306 depicted in fig. 3A may be embodied as a storage system including unbalanced active/active controllers, a storage system including balanced active/active controllers, a storage system including active/active controllers (where not all resources of each controller are utilized such that each controller has reserved resources available to support failover), a storage system including fully active/active controllers, a storage system including data set isolation controllers, a storage system including a dual layer architecture having front-end controllers and back-end integrated storage controllers, a storage system including a laterally expanding cluster of dual controller arrays, and combinations of such embodiments.
In the example depicted in fig. 3A, storage system 306 is coupled to cloud service provider 302 via data communication link 304. The data communication link 304 may be embodied as a dedicated data communication link, a data communication path provided through the use of one or more data communication networks, such as a wide area network ('WAN') or a local area network ('LAN'), or some other mechanism capable of transmitting digital information between the storage system 306 and the cloud service provider 302. This data communication link 304 may be entirely wired, entirely wireless, or some aggregation of wired and wireless data communication paths. In this example, digital information may be exchanged between storage system 306 and cloud service provider 302 via data communication link 304 using one or more data communication protocols. For example, digital information may be exchanged with the cloud service provider 302 via the data communication link 304 using a handheld device transport protocol ('HDTP'), hypertext transport protocol ('HTTP'), internet protocol ('IP'), real-time transport protocol ('RTP'), transmission control protocol ('TCP'), user datagram protocol ('UDP'), wireless application protocol ('WAP'), or other protocol.
For example, the cloud service provider 302 depicted in fig. 3A may be embodied as a system and computing environment that provides services to users of the cloud service provider 302 by sharing computing resources via the data communication link 304. Cloud service provider 302 may provide on-demand access to a pool of shared configurable computing resources (e.g., computer networks, servers, storage, applications, and services, etc.). The shared configurable resource pool may be quickly built and released to users of cloud service provider 302 with minimal management effort. In general, the user of cloud service provider 302 is unaware of the exact computing resources used by cloud service provider 302 to provide the service. Although in many cases this cloud service provider 302 may be accessed via the internet, readers of skill in the art will recognize that any system that abstracts the use of shared resources to provide services to users over any data communication link may be considered a cloud service provider 302.
In the example depicted in fig. 3A, cloud service provider 302 may be configured to provide various services to storage system 306 and users of storage system 306 by implementing various service models. For example, cloud service provider 302 may be configured to provide services to storage system 306 and users of storage system 306 by implementing an infrastructure as a service ('IaaS') service model, wherein cloud service provider 302 provides computing infrastructure (e.g., virtual machines and other resources) as services to subscribers. In addition, cloud service provider 302 may be configured to provide services to storage system 306 and users of storage system 306 through an implementation platform as a service ('PaaS') service model, where cloud service provider 302 provides a development environment to application developers. For example, such a development environment may include an operating system, a programming language execution environment, a database, a web server, or other components that may be used by application developers to develop and run software solutions on a cloud platform. Further, cloud service provider 302 may be configured to provide services to storage system 306 and users of storage system 306 by implementing a software as a service ('SaaS') service model, wherein cloud service provider 302 provides application software, databases, and a platform for running applications to storage system 306 and users of storage system 306, thereby providing on-demand software to storage system 306 and users of storage system 306 and eliminating the need to install and run applications on local computers, which may simplify maintenance and support of applications. The cloud service provider 302 may be further configured to provide services to the storage system 306 and users of the storage system 306 by implementing an authentication-as-a-service ('AaaS') service model, wherein the cloud service provider 302 provides authentication services that may be used for secure access to applications, data sources, or other resources. Cloud service provider 302 may also be configured to provide services to storage system 306 and users of storage system 306 by implementing a storage-as-a-service model, wherein cloud service provider 302 provides access to its storage infrastructure for use by storage system 306 and users of storage system 306. Readers will appreciate that cloud service provider 302 may be configured to provide additional services to storage system 306 and users of storage system 306 by implementing additional service models, because the service models described above are included for purposes of explanation only, and in no way represent limitations on services that may be provided by cloud service provider 302 or limitations on service models that may be implemented by cloud service provider 302.
In the example depicted in fig. 3A, cloud service provider 302 may be embodied as, for example, a private cloud, a public cloud, or a combination of private and public clouds. In embodiments in which cloud service provider 302 is embodied as a private cloud, cloud service provider 302 may be dedicated to providing services to a single organization, rather than to multiple organizations. In embodiments in which cloud service provider 302 is embodied as a public cloud, cloud service provider 302 may provide services to multiple organizations. Public and private cloud deployment models may be different and may be accompanied by various advantages and disadvantages. For example, because public cloud deployment involves sharing computing infrastructure across different organizations, such deployment may be less than ideal for organizations with security issues, critical task workload, uptime requirements, and so forth. While private cloud deployment may address some of these issues, private cloud deployment may require local deployment (on-premises) personnel to manage the private cloud. In yet another alternative embodiment, cloud service provider 302 may be embodied as a hybrid of private and public cloud services with a hybrid cloud deployment.
Although not explicitly depicted in fig. 3A, readers will appreciate that additional hardware components and additional software components may be required to facilitate delivery of cloud services to storage system 306 and users of storage system 306. For example, the storage system 306 may be coupled to (or even include) a cloud storage gateway. For example, such a cloud storage gateway may be embodied as a hardware-based or software-based device that deploys the location locally with the storage system 306. This cloud storage gateway is operable as a bridge between local applications executing on storage array 306 and remote cloud-based storage utilized by storage array 306. By using a cloud storage gateway, an organization may move primary iSCSI or NAS to cloud service provider 302, thereby enabling the organization to save space on its locally deployed storage system. Such a cloud storage gateway may be configured to emulate a disk array, block-based device, file server, or other storage system that may translate SCSI commands, file server commands, or other suitable commands into REST space protocols that facilitate communication with cloud service provider 302.
In order to enable storage system 306 and users of storage system 306 to use the services provided by cloud service provider 302, a cloud migration process may occur during which data, applications, or other elements from an organization's local system (or even from another cloud environment) are moved to cloud service provider 302. To successfully migrate data, applications, or other elements to the environment of cloud service provider 302, middleware, such as a cloud migration tool, may be used to bridge the gap between the environment of cloud service provider 302 and the environment of the organization. Such cloud migration tools may also be configured to address potentially high network costs and long transfer times associated with migrating large amounts of data to cloud service provider 302, as well as to address security issues associated with sensitive data to cloud service provider 302 via a data communications network. In order to enable storage system 306 and users of storage system 306 to further use services provided by cloud service provider 302, a cloud orchestrator may also be used to arrange and coordinate automation tasks to create a consolidated process or workflow. This cloud orchestrator may perform tasks such as configuring various components, whether those components are cloud components or locally deployed components, and managing interconnections between such components. The cloud orchestrator may simplify inter-component communication and connections to ensure that links are properly configured and maintained.
In the example depicted in fig. 3A, and as briefly described above, cloud service provider 302 may be configured to provide services to storage system 306 and users of storage system 306 by using a SaaS service model, wherein cloud service provider 302 provides application software, databases, and a platform for running applications to storage system 306 and users of storage system 306, thereby providing on-demand software to storage system 306 and users of storage system 306 and eliminating the need to install and run applications on local computers, which may simplify maintenance and support of applications. Such applications may take many forms according to various embodiments of the present disclosure. For example, cloud service provider 302 may be configured to provide storage system 306 and users of storage system 306 with access to data analysis applications. Such data analysis applications may be configured, for example, to receive telemetry data returned (transmitted home) by the storage system 306. Such telemetry data may describe various operational characteristics of the storage system 306 and may be analyzed, for example, to determine the health of the storage system 306, identify workloads executing on the storage system 306, predict when the storage system 306 will run out of various resources, recommend configuration changes, hardware or software upgrades, workflow migration, or other actions that may improve the operation of the storage system 306.
Cloud service provider 302 may also be configured to provide storage system 306 and users of storage system 306 with access to virtualized computing environments. For example, such virtualized computing environments may be embodied as virtual machines or other virtualized computer hardware platforms, virtual storage, virtualized computer network resources, and the like. Examples of such virtualized environments may include virtual machines created to emulate an actual computer, virtualized desktop environments that separate logical desktops from physical machines, virtualized file systems that allow uniform access to different types of specific file systems, and many others.
For further explanation, fig. 3B sets forth a diagram of a storage system 306 according to some embodiments of the present disclosure. Although depicted in less detail, the storage system 306 depicted in fig. 3B may be similar to the storage system described above with reference to fig. 1A-1D and 2A-2G, as the storage system may include many of the components described above.
The storage system 306 depicted in fig. 3B may include a storage resource 308, which may be embodied in many forms. For example, in some embodiments, the storage resources 308 may include nano-RAM or another form of non-volatile random access memory utilizing carbon nanotubes deposited on a substrate. In some embodiments, the memory resource 308 may include a 3D cross-point non-volatile memory, where bit storage is based on variations in bulk resistance and a stackable cross-grid data access array. In some embodiments, the storage resources 308 may include flash memory (including single level cell ('SLC') NAND flash memory, multi-level cell ('MLC') NAND flash memory, three-level cell ('TLC') NAND flash memory, four-level cell ('QLC') NAND flash memory), and others. In some embodiments, the storage resource 308 may include a non-volatile magnetoresistive random access memory ('MRAM'), including spin transfer torque ('STT') MRAM, in which data is stored using magnetic storage elements. In some embodiments, example storage resources 308 may include non-volatile phase change memory ('PCM') that may have the ability to hold multiple bits in a single cell while the cell may achieve several distinct intermediate states. In some embodiments, storage resources 308 may include quantum memory that allows photon quantum information to be stored and retrieved. In some embodiments, example memory resources 308 may include resistive random access memory ('ReRAM') in which data is stored by varying the resistance across a dielectric solid state material. In some embodiments, the memory resources 308 may include storage class memory ('SCM'), where solid state non-volatile memory may be fabricated at high density using some combination of sub-lithographic patterning techniques, multiple bits per cell, multiple device layers, and so forth. Readers will appreciate that other forms of computer memory and storage may be utilized with the storage system described above, including DRAM, SRAM, EEPROM, general purpose memory, and many others. The storage resources 308 depicted in fig. 3A may be embodied in various form factors including, but not limited to, dual inline memory modules ('DIMMs'), non-volatile dual inline memory modules ('NVDIMMs'), m.2, U.2, and others.
The storage resources 308 depicted in fig. 3A may include various forms of storage class memory ('SCM'). The SCM can effectively treat fast non-volatile memory (e.g., NAND flash) as an extension of DRAM such that the entire data set can be considered an in-memory data set that resides entirely in DRAM. The SCM may include non-volatile media such as, for example, NAND flash memory. Such NAND flash memory can be accessed utilizing NVMe, which can use the PCIe bus as its transport, providing relatively low access latency compared to older protocols. In practice, the network protocols for SSDs in full flash arrays may include NVMe using Ethernet (ROCE, NVME TCP), fibre channel (NVMe FC), infiniband (iWARP), and others that make it possible to treat fast non-volatile memory as an extension of DRAM. In view of the fact that DRAMs are typically byte-addressable and fast non-volatile memory (e.g., NAND flash) is block-addressable, a controller software/hardware stack may be required to convert block data into bytes stored in a medium. Examples of media and software that may be used as SCM may include, for example, 3D XPoint, intel (Intel) memory drive technology, the Z-SSD of Samsung, and others.
The example storage system 306 depicted in fig. 3B may implement various storage architectures. For example, a storage system according to some embodiments of the present disclosure may utilize a block storage device, wherein data is stored in blocks, and each block essentially serves as an individual hard drive. A storage system according to some embodiments of the present disclosure may utilize an object store in which data is managed as objects. Each object may include the data itself, variable amounts of metadata, and a globally unique identifier, where object storage may be implemented at multiple levels (e.g., device level, system level, interface level). A storage system according to some embodiments of the present disclosure utilizes file storage in which data is stored in a hierarchical structure. This data may be saved in files and folders and presented to both the system storing it and the system retrieving it in the same format.
The example storage system 306 depicted in fig. 3B may be embodied as a storage system in which additional storage resources may be added using a longitudinal expansion model, additional storage resources may be added using a lateral expansion model, or some combination thereof. In the longitudinally extending model, additional storage may be added by adding additional storage. However, in the lateral expansion model, additional storage nodes may be added to the storage node cluster, where such storage nodes may include additional processing resources, additional networking resources, and so forth.
The storage system 306 depicted in fig. 3B also includes communication resources 310 that may be used to facilitate data communication between components within the storage system 306 as well as between the storage system 306 and computing devices external to the storage system 306. The communication resources 310 may be configured to utilize a variety of different protocols and data communication architectures to facilitate data communication between components within the storage system and computing devices external to the storage system. For example, the communication resources 310 may include: fibre channel ('FC') technology, such as FC architecture and FC protocol, that can transport SCSI commands over FC networks. The communication resources 310 may also include ethernet-based FC ('FCoE') technology by which FC frames are encapsulated and transmitted via an ethernet network. The communication resources 310 may also include an infiniband ('IB') technology, wherein a switch fabric topology is used to facilitate transmission between channel adapters. The communication resources 310 may also include NVM express ('NVMe') technology and architecture-based NVMe ('nvmeoh') technology by which non-volatile storage media attached via a PCI express ('PCIe') bus may be accessed. The communication resources 310 may also include mechanisms for accessing the storage resources 308 within the storage system 306 utilizing serial attached SCSI ('SAS'), a serial ATA ('SATA') bus interface for connecting the storage resources 308 within the storage system 306 to a host bus adapter within the storage system 306, internet Small computer System interface ('iSCSI') technology for providing block-level access to the storage resources 308 within the storage system 306, and other communication resources that may be used to facilitate data communication between components within the storage system 306 and between the storage system 306 and computing devices external to the storage system 306.
The storage system 306 depicted in fig. 3B also includes processing resources 312 that may be used to execute computer program instructions and to perform other computing tasks within the storage system 306. The processing resources 312 may include one or more application specific integrated circuits ('ASICs') tailored for a particular purpose, and one or more central processing units ('CPUs'). The processing resources 312 may also include one or more digital signal processors ('DSPs'), one or more field programmable gate arrays ('FPGAs'), one or more systems-on-a-chip ('socs'), or other forms of processing resources 312. Storage system 306 may utilize storage resources 312 to perform various tasks, including but not limited to supporting execution of software resources 314, as will be described in more detail below.
The storage system 306 depicted in fig. 3B also includes software resources 314 that, when executed by the processing resources 312 within the storage system 306, may perform various tasks. The software resources 314 may include, for example, one or more modules of computer program instructions that when executed by the processing resources 312 within the storage system 306 are used to implement various data protection techniques to preserve the integrity of data stored within the storage system. Readers will appreciate that such data protection techniques may be carried out, for example, by system software executing on computer hardware within a storage system, by a cloud service provider, or in other ways. Such data protection techniques may include, for example: data archiving techniques that cause data that is no longer actively usable to be moved to a separate storage device or separate storage system for long-term retention; data backup techniques by which data stored in a storage system may be replicated and stored in disparate locations to avoid loss of data in the event of equipment failure or some other form of disaster of the storage system; a data replication technique by which data stored in a storage system is replicated to another storage system such that the data is accessible via a plurality of storage systems; data snapshot techniques by which the state of data within a storage system is captured at various points in time; data and database cloning techniques by which duplicate copies of data and databases can be created; and other data protection techniques. By using such data protection techniques, business continuity and disaster recovery goals may be met, as failure of a storage system may not result in loss of data stored in the storage system.
The software resource 314 may also include software for implementing a software defined storage ('SDS'). In this example, software resource 314 may comprise one or more modules of computer program instructions that, when executed, are used for policy-based provisioning and management of data stores independent of underlying hardware. Such software resources 314 may be used to implement storage virtualization to separate storage hardware from software that manages the storage hardware.
The software resources 314 may also include software for facilitating and optimizing I/O operations directed to the storage resources 308 in the storage system 306. For example, the software resources 314 may include software modules that execute, perform various data reduction techniques such as, for example, data compression, data deduplication, and others. The software resources 314 may include software modules that intelligently group I/O operations together to facilitate better use of the underlying storage resources 308, software modules that perform data migration operations to migrate from within the storage system, and software modules that perform other functions. Such software resources 314 may be embodied as one or more software containers or in many other ways.
Readers will appreciate that the presence of such software resources 314 may provide an improved user experience of the storage system 306, an extension of the functionality supported by the storage system 306, and many other benefits. Considering a particular example of a software resource 314 implementing a data backup technique, data stored in a storage system may be replicated by the data backup technique and stored in disparate locations to avoid losing data in the event of an equipment failure or some other form of disaster. In this example, the system described herein may perform backup operations more reliably (and with less burden on the user) than an interactive backup management system that requires a high degree of user interaction, provides less robust automation and feature sets, and so forth.
The storage system described above may implement intelligent data backup techniques by which data stored in the storage system may be replicated and stored in disparate locations to avoid losing data in the event of equipment failure or some other form of disaster. For example, the storage system described above may be configured to check each backup to avoid restoring the storage system to an unexpected state. Consider an example in which malware infects a storage system. In this example, the storage system may include a software resource 314 that may scan each backup to identify those backups captured before and those backups captured after the malware infects the storage system. In this example, the storage system may restore itself from a backup that does not contain malware-or at least does not restore portions of the backup that contain malware. In this example, the storage system may include a software resource 314 that may scan each backup to identify the presence of malware (or viruses or some other unexpected thing), for example, by identifying write operations serviced by the storage system and originating from a network subnet suspected of delivering malware, by identifying write operations serviced by the storage system and originating from users suspected of delivering malware, by identifying the contents of write operations serviced by the storage system and checking the content of write operations against fingerprints of malware, and in many other ways.
The reader will further appreciate that the backup (typically in the form of one or more snapshots) may also be used to perform a quick restore of the storage system. Consider an example in which a storage system is infected with lux software that locks a user out of the storage system. In this example, the software resource 314 within the storage system may be configured to detect the presence of the lux software, and may be further configured to restore the storage system to a point in time prior to the point in time at which the lux software infects the storage system using the persisted backup. In this example, the presence of the lux software may be detected explicitly by using a software tool utilized by the system, by using a key (e.g., a USB drive) inserted into the storage system, or in a similar manner. Likewise, the presence of the luxury software may be inferred in response to system activity meeting a predetermined fingerprint, such as, for example, no reads or writes to the system for a predetermined period of time.
The reader will appreciate that the various components depicted in fig. 3B may be grouped into one or more optimized computing packages as a converged infrastructure. This converged infrastructure can include a pool of computer, storage, and networking resources that can be shared by multiple applications and managed in a collective manner using policy-driven processes. Such a converged infrastructure may minimize compatibility issues between various components within the storage system 306 while also reducing various costs associated with the establishment and operation of the storage system 306. Such a converged infrastructure may be implemented with a converged infrastructure reference architecture, with stand-alone devices, with a software-driven super-converged approach (e.g., a super-converged infrastructure), or in other ways.
The reader will appreciate that the storage system 306 depicted in FIG. 3B may be used to support various types of software applications. For example, the storage system 306 may be used to support artificial intelligence ('AI') applications, database applications, devOps projects, electronic design automation tools, event driven software applications, high performance computing applications, simulation applications, high speed data capture and analysis applications, machine learning applications, media production applications, media services applications, picture archiving and communication systems ('PACS') applications, software development applications, virtual reality applications, augmented reality applications, and many other types of applications by providing storage resources to such applications.
The storage systems described above are operable to support a wide variety of applications. In view of the fact that the storage system includes computing resources, storage resources, and a wide variety of other resources, the storage system may be well suited to support resource-intensive applications such as, for example, AI applications. Such AI applications may enable a device to perceive its environment and take actions that maximize its chance of success on a certain target. Examples of such AI applications may include IBM Watson, microsoft Oxford, google deep, hundred degrees Minwa, and others. The storage system described above may also be well suited to support other types of applications that are resource intensive, such as, for example, machine learning applications. The machine learning application may perform various types of data analysis to automate analytical model construction. Using algorithms that learn iteratively from data, machine learning applications can enable computers to learn without being explicitly programmed. One particular area of machine learning is known as reinforcement learning, which involves taking appropriate action to maximize return in certain situations. Reinforcement learning may be used to find the best possible behavior or path that a particular software application or machine should take in a particular situation. Reinforcement learning differs from other areas of machine learning (e.g., supervised learning, unsupervised learning) in that correct input/output pairs need not be presented for reinforcement learning, and suboptimal actions need not be explicitly corrected.
In addition to the resources already described, the storage system described above may also contain a graphics processing unit ('GPU'), sometimes referred to as a visual processing unit ('VPU'). Such GPUs may be embodied as dedicated electronic circuits that quickly manipulate and change memory to speed up the creation of images in a frame buffer that are desired for output to a display device. Such a CPU may be included within any of the computing devices as part of the storage system described above, including as one of many individual extensible components of the storage system, wherein other examples of individual extensible components of such a storage system may include storage components, memory components, computing components (e.g., CPU, FPGA, ASIC), networking components, software components, and others. In addition to GPUs, the storage system described above may also include a neural network processor ('NNP') for use in various aspects of neural network processing. Such NNPs may be used in place of (or in addition to) GPUs, and they may also be independently scalable.
As described above, the storage systems described herein may be configured to support artificial intelligence applications, machine learning applications, big data analysis applications, and many other types of applications. The rapid growth of such applications is driven by three technologies: deep Learning (DL), GPU processor, and big data. Deep learning is a computational model that uses a massively parallel neural network inspired by the human brain. Instead of expert manual software, the deep learning model writes its own software by learning from a large number of instances. GPUs are modern processors with thousands of cores well suited to running algorithms loosely representing the parallel nature of the human brain.
Advances in deep neural networks have motivated a new algorithm and tool for data science home Artificial Intelligence (AI) to mine its data. Using improved algorithms, larger data sets, and various frameworks (including open source software libraries for machine learning across a range of tasks), data scientists are addressing new use cases such as autonomous driving vehicles, natural language processing and understanding, computer vision, machine reasoning, strong AI, and many others. Applications of such techniques may include: detecting, identifying and avoiding the machine and the vehicle object; visual identification, classification and marking; algorithmic financial transaction policy performance management; positioning and mapping at the same time; predictive maintenance of high value machines; preventing network security threat and automation of professional knowledge; image identification and classification; question answering; a robot; text analysis (extraction, classification) and text generation and translation; and permissionAnd many others. Applications of AI technology have been implemented in a wide variety of products, including, for example: voice recognition technology of Amazon Echo (Amazon Echo), which allows users to talk to their machines; google Translate TM Allowing machine-based language translation; spotify's Weekly discoveries (Discover Weekly) that provide recommendations of new songs and singers that a user may like based on the user's usage and flow analysis; the text of Quill generates a product that takes structured data and converts it into a narrative story; chat robots (chatbots) that provide real-time context-specific answers to questions in conversational format; and many others. Furthermore, AI can affect a wide variety of industries and departments. For example, AI solutions may be used in medicine to make clinical notes, patient files, study data, and other inputs to generate potential treatment options for a physician to explore. Likewise, AI solutions may be used by retailers to personalize customer recommendations based on a person's behavioral digital footprint, profile data, or other data.
However, training deep neural networks requires both high quality input data and a large amount of computation. GPUs are massively parallel processors capable of operating on large amounts of data simultaneously. When combined into a multi-GPU cluster, a high throughput pipeline may be required to feed input data from the storage to the compute engine. Deep learning is not just building and training models. There is also an entire data pipeline that must be designed for the scale, iterations, and experiments that are required for a data science team to succeed.
Data is the core of modern AI and deep learning algorithms. Before training can begin, one problem that must be addressed is to collect around the marker data that is critical to training an accurate AI model. Full-scale AI deployments may be required to continuously collect, purge, transform, tag, and store large amounts of data. Adding additional high quality data points translates directly into a more accurate model and better insight. The data sample may undergo a series of processing steps including (but not limited to): 1) Capturing data from an external source into a training system and storing the data in raw form; 2) Clean up and transform data in a format that facilitates training, including linking data samples to appropriate labels; 3) Exploring parameters and models, testing quickly with smaller datasets, and iterating to converge on the most promising model to advance into the production cluster; 4) Performing a training phase to select batches of random input data, including both new and older samples, and feeding those to a generating GPU server for computation to update model parameters; and 5) evaluating a reserved portion comprising using data not used in training in order to evaluate model accuracy of the retained data. This lifecycle may be applicable to any type of parallelized machine learning, not just neural networks or deep learning. For example, a standard machine learning framework may rely on a CPU (rather than a GPU), but the data ingest and training workflows may be the same. Readers will appreciate that a single shared stored data hub (data hub) creates a coordination point throughout the lifecycle without requiring additional copies of data during the ingestion, preprocessing, and training phases. The ingested data is rarely used for one purpose only, and shared storage gives flexibility to train multiple different models or apply traditional analysis to the data.
The reader will appreciate that each stage in the AI data pipeline may have different requirements from a data collection point (e.g., a storage system or collection of storage systems). Laterally expanding storage systems must provide non-compromised performance for all manner of access types and patterns (from small metadata heavy to large files, random to sequential access patterns, and low to high concurrency). The storage system described above can be used as an ideal AI data collection point because the system can service unstructured work loads. In the first phase, data is ideally ingested and stored onto the same data collection point that will be used in the subsequent phase in order to avoid additional data replication. The next two steps can be done on a standard compute server optionally containing a GPU, and then, in the fourth and final stages, the complete training production job runs on a powerful GPU acceleration server. Typically, there is a production pipeline alongside the experiment pipeline operating on the same dataset. Furthermore, the GPU acceleration server may be used independently for different models or combined together to train on one larger model, even distributed across multiple systems. If the shared storage hierarchy is slow, the data must be copied to local storage for each phase, resulting in wasted time on staging the data to a different server. The ideal data concentration point of the AI training pipeline provides similar performance to data stored locally on the server node, while also having the simplicity and performance of enabling all pipeline stages to operate concurrently.
Data scientists have focused on improving the usefulness of trained models by a variety of methods: more data, better data, more intelligent training and deeper models. In many cases, there will be multiple teams of data scientists sharing the same data set and working in parallel to produce new and improved training models. Typically, a team of data scientists work concurrently on the same shared data set during these phases. Multiple concurrent workloads of the data processing, experimental and comprehensive training layers store the requirements of multiple access patterns at the level. In other words, storage cannot only satisfy large file reads, but must cope with a mix of large and small file reads and writes. Finally, where multiple data scientists explore data sets and models, it may be critical to store data in its native format in order to provide each user with the flexibility to transform, clean up, and use the data in a unique manner. The storage system described above may provide a natural shared storage home for data sets, with data protection redundancy (e.g., through the use of RAID 6) and performance required as a common access point for multiple developers and multiple experiments. The use of the storage system described above may avoid the need to carefully replicate a subset of the data for local work, thereby saving both engineering and the use time of the GPU acceleration server. With the ever-increasing updating and changing of the original data set and the desired transformations, these copies will become a continuously growing burden.
Readers will appreciate that deep learning has achieved tremendous success, the root cause being the continued improvement of models with larger dataset sizes. In contrast, classical machine learning algorithms, such as logistic regression, stop improving accuracy at smaller dataset sizes. Thus, the separation of computing and storage resources may also allow for independent expansion of each tier level, avoiding much of the complexity inherent in managing both simultaneously. As data sets grow in size or new data sets are considered, laterally expanding storage systems must be able to easily expand. Similarly, if more concurrent training is required, additional GPUs or other computing resources may be added, regardless of their internal storage. Furthermore, the storage system described above may facilitate the construction, operation, and development of AI systems due to the random read bandwidth provided by the storage system, the ability of the storage system to randomly read small files (50 KB) at a high rate (which means that no additional work is required to aggregate individual data points to form a larger storage friendly file), the ability of the storage system to expand capacity and performance as data set growth or throughput requirements grow, the ability of the storage system to support files or objects, the ability of the storage system to tune performance of large or small files (i.e., users do not have to build file systems), the ability of the storage system to support non-interfering upgrades of hardware and software even during the generation of model training, and for many other reasons.
Storage-level doclet performance may be critical because many types of inputs, including text, audio, or images, will be stored locally as doclets. If the storage hierarchy does not handle small files well, then additional steps are required to pre-process the samples and group them into larger files. Storage devices built on top of rotating disks that rely on SSDs as cache hierarchies may not achieve the desired performance. Because training with random input batches results in a more accurate model, the entire data set must be accessible with full performance. SSD caches provide high performance for only a small subset of data and will be ineffective in hiding the latency of the rotating drive.
While the preceding paragraphs discuss a deep learning application, the reader will appreciate that the storage system described herein may also be part of a distributed deep learning ('DDL') platform to support execution of DDL algorithms. Distributed deep learning may be used to significantly accelerate deep learning with distributed computation on a GPU (or other form of accelerator or computer program instruction executor) so that parallelization may be achieved. In addition, the output of training machine learning and deep learning models (e.g., fully trained machine learning models) may be used for a variety of purposes and in conjunction with other tools. For example, the trained machine learning model may be used in conjunction with tools such as core ML to integrate a wide variety of machine learning model types into an application. In practice, the trained model may be run through a core ML converter tool and inserted into a custom application that may be deployed on a compatible device. The storage system described above may also be paired with other technologies (e.g., tensorFlow, open source software libraries for data flow programming across a range of tasks available to machine learning applications (e.g., neural networks)) to facilitate the development of such machine learning models, applications, and the like.
Readers will further appreciate that the above-described systems may be deployed in a variety of ways to support AI democratization as AI is increasingly available for mass consumption. For example, AI demographics may include the ability to provide AI as a platform, i.e., a service, growth of artificial general-purpose intelligent products, proliferation of autonomous class 4 and 5 vehicles, availability of autonomous mobile robots, development of conversational AI platforms, and many others. For example, the system described above may be deployed in a cloud environment, an edge environment, or other environment that helps support AI democratization. As part of AI democratization, a transition from narrow AI (consisting of a highly ranging machine learning solution targeting a specific task) to artificial general intelligence (where the use of machine learning is extended to handle a wide range of use cases that basically perform any intelligent task that is human executable and that can learn dynamically, much like humans) may occur.
The storage system described above may also be used in neuromorphic computing environments. Neuromorphic calculations are a form of calculation that mimics brain cells. To support neuromorphic computation, the architecture via interconnected "neurons" replaces traditional computational models with low-power signals that pass directly between neurons to achieve more efficient computation. Neuromorphic calculations may use Very Large Scale Integration (VLSI) systems that contain electronic analog circuits for simulating the neurobiological architecture present in the nervous system, as well as analog/digital VLSI in analog-digital mixed mode, and software systems implementing models of the nervous system for sensing, motor control, or multi-sensing integration.
Readers will appreciate that the storage system described above may be configured to support storing or using (as well as other types of data) blockchains. Such blockchains may be embodied as a continuously growing list of records, called blocks, that are linked and protected using cryptography. Each block in the blockchain may contain a hash pointer as a link to a previous block, a timestamp, transaction data, and so forth. Blockchains may be designed to resist modification of data and may be used as an open distributed ledger that may efficiently record transactions between two parties in a verifiable and permanent manner. This makes blockchains potentially suitable for logging events, medical records, and other record management activities such as identity management, transactions, and others. In addition to supporting storage and use of blockchain technology, the storage system described above may also support storage and use of derivative items, such as, for example, as IBM TM Open source blockchains and associated tools for portions of super ledger projects, licensed blockchains in which a specific number of trusted parties are allowed to access the blockchain, blockchain products that enable developers to build their own distributed ledger projects, and others. Readers will appreciate that blockchain technology can affect a wide variety of industries and departments. For example, blockchain techniques may be used as blockchain-based contracts for real estate transactions, the use of which may eliminate the need for third parties, and enable self-executing actions when conditions are met. Likewise, a general health record may be created by aggregating and placing a person's health history on a blockchain ledger for access and update by any healthcare provider or licensed healthcare provider.
Readers will appreciate that the use of blockchains is not limited to financial transactions, contracts, and the like. Indeed, blockchains may be used to implement decentralized aggregation, ordering, time stamping, and archiving of any type of information, including structured data, correspondence, documents, and other data. By using blockchains, participants can reach a provable and permanent agreement on what data to type, when to type, and by whom to type exactly, without relying on trusted intermediaries. For example, recently introduced blockchain platforms of SAPs supporting multi-chain and super ledger architectures target a wide range of supply chains and other non-financial applications.
One way to record data using a blockchain is to embed each piece of data directly inside the transaction. Each blockchain transaction may be digitally signed by one or more parties, copied to multiple nodes, ordered and time stamped by the chain's consensus algorithm, and permanently stored in a tamper-proof manner. Thus, any data within the transaction, along with a proof of who written it and when, will be stored by each node in the same but independent manner. The user of the chain can retrieve this information at any time in the future. This type of storage may be referred to as on-chain storage. However, in-chain storage may not be particularly practical when attempting to store very large data sets. Thus, according to embodiments of the present disclosure, the blockchains and storage systems described herein may be used to support on-chain storage of data as well as off-chain storage of data.
The out-of-chain storage of data may be implemented in various ways and may occur when the data itself is not stored within the blockchain. For example, in one embodiment, a hash function may be utilized and the data itself may be fed into the hash function to generate the hash value. In this example, a hash of a large amount of pieces of data may be embedded within a transaction, rather than the data itself. Each hash may be used as a commitment (commit) to its incoming data, where the data itself is stored outside the blockchain. Readers will appreciate that any blockchain participant that needs data outside of a chain cannot copy the data from its hash, but if the data can be retrieved in some other way, the on-chain hash is used to confirm who created it and when. Just like conventional on-chain data, the hash may be embedded within digitally signed transactions that are contained in the chain by consensus.
Readers will appreciate that in other embodiments, a surrogate for blockchain may be used to facilitate decentralized storage of information. For example, one alternative to a blockchain that may be used is blockweave (blockweave). When a conventional blockchain stores each transaction to enable validation, blockspinning permits secure dispersion without using the entire chain, thereby enabling low-cost on-chain storage of data. Such block spinning may utilize a consensus mechanism based on access attestation (PoA) and proof of work (PoW). While a typical PoW system relies only on previous blocks to generate each successive block, the PoA algorithm may incorporate data from randomly selected previous blocks. In combination with the tile spinning data structure, miners (miners) need not store all tiles (forming a blockchain), but may store any previous tile, thereby forming a weave of tiles (tile spinning). This allows for increased scalability, speed and low cost levels, and reduces the cost of data storage, in part because miners do not need to store all of the blocks, resulting in a significant reduction in power consumed during the mining process, because as the network expands, block spinning requires less and less hash power for consensus as data is added to the system. Furthermore, block spinning may be deployed on a decentralized storage network, where incentives are created to encourage rapid data sharing. Such decentralized storage networks may also use a chunk shadow technique, where a node sends only a minimum chunk "shadow" to other nodes, which allows peers to reconstruct a complete chunk, rather than transmitting the complete chunk itself.
The storage systems described above may be used alone or in combination with other computing devices to support computing applications in memory. In-memory computing involves storing information in RAM distributed across a cluster of computers. The computing in memory helps commercial customers, including retailers, banks, and utilities, quickly detect patterns, analyze mass data volumes in real-time, and quickly perform their operations. The reader will appreciate that the storage systems described above, and in particular those that are configurable with customizable amounts of processing resources, storage resources, and memory resources (e.g., those in which the blade contains a configurable amount of each type of resource), may be configured in a manner that provides an infrastructure capable of supporting in-memory computing. Likewise, the storage system described above may include component parts (e.g., NVDIMMs providing persistent fast random access memory, 3D crosspoint storage), which may in fact provide an improved memory computing environment compared to memory computing environments that rely on RAM distributed across dedicated servers.
In some embodiments, the storage system described above may be configured to operate as a computing environment in hybrid memory that includes a universal interface to all storage media (e.g., RAM, flash memory storage, 3D cross-point storage). In such embodiments, the user may not know details about where their data is stored, but they may still address the data using the same complete unified API. In such embodiments, the storage system may move the data (in the background) to the fastest tier available-including intelligently placing the data depending on various characteristics of the data or depending on some other heuristics. In this example, the storage system may even use existing products (e.g., apache igite and GridGain) to move data between the various storage layers, or the storage system may use custom software to move data between the various storage layers. The storage systems described herein may implement various optimizations to improve the performance of computations in memory, such as, for example, making the computations occur as close to the data as possible.
The reader will further appreciate that in some embodiments, the storage system described above may be paired with other resources to support the application described above. For example, one infrastructure may include primary computations in the form of servers and workstations that use general purpose computations exclusively on a graphics processing unit ('GPGPU') to accelerate deep learning applications that interconnect into a compute engine to train parameters of a deep neural network. Each system may have ethernet external connectivity, infiniband external connectivity, some other form of external connectivity, or some combination thereof. In this example, GPUs may be grouped for a single large training or independently used to train multiple models. The infrastructure may also include a storage system (such as the storage system described above) to provide, for example, a laterally-expanded full flash file or object storage area through which data may be accessed via a high performance protocol (e.g., NFS, S3, etc.). The infrastructure may also include redundant shelf-top ethernet switches connected to storage and computation, for example, via ports in the MLAG port channels, to achieve redundancy. The infrastructure may also include additional computations in the form of white-box servers, optionally with GPUs, for data ingest, preprocessing, and model debugging. The reader will appreciate that additional infrastructure is also possible.
The reader will appreciate that the above-described system may be more suitable for the above-described applications than other systems that may include, for example, a Distributed Direct Attached Storage (DDAS) solution deployed in a server node. Such DDAS solutions may be implemented to handle large, less sequential accesses, but may be less capable of handling small random accesses. Readers will further appreciate that the storage system described above may be used to provide a more preferable platform for the application described above than utilizing cloud-based resources (as the storage system may be included in a more secure, more local, and internally managed, more robust in terms of feature sets and performance, on-site or internal infrastructure), or otherwise more preferable than utilizing cloud-based resources as part of the platform to support the application described above. For example, a service built on a Watson platform such as IBM may require a business enterprise to distribute individual user information (e.g., financial transaction information or identifiable patient records) to other institutions. Thus, for a wide range of technical reasons as well as various business reasons, cloud-based AI, i.e., products of service, may not be as desirable as internally managed and provided AI, i.e., services, supported by a storage system (e.g., the storage system described above).
The reader will appreciate that the storage system described above, alone or in coordination with other computing machines, may be configured to support other AI-related tools. For example, the storage system may use tools such as ONXX or other open neural network exchange formats that make it easier to communicate models written in different AI frameworks. Likewise, the storage system may be configured to support tools such as Amazon's Gluon that allow developers to prototype, build, and train deep-learning models. In practice, the storage system described above may be part of a larger platform, such as IBM TM Private cloud data, including integrated data science, data engineering, and application build services. Such a flatThe station can seamlessly collect, organize, protect, and analyze data across enterprises, as well as simplify hybrid data management, unified data governance and integration, data science, and business analysis with a single solution.
Readers will further appreciate that the storage system described above may also be deployed as an edge solution. This edge solution may be in place to optimize the cloud computing system by performing data processing near the source of the data at the edge of the network. Edge computing can push applications, data, and computing power (i.e., services) from a centralized point to the logical extremity of the network. By using an edge solution, such as the storage system described above, computing tasks can be performed using computing resources provided by such storage system, data can be stored using storage resources of the storage system, and cloud-based services can be accessed using various resources (including networking resources) of the storage system. By performing computing tasks on the edge solution, storing data on the edge solution, and generally using the edge solution, consumption of expensive cloud-based resources may be avoided, and in fact, performance improvements may be experienced relative to heavier dependencies on cloud-based resources.
While many tasks may benefit from utilizing edge solutions, some specific uses may be particularly suited for deployment in this environment. For example, devices such as drones, autopilots, robots, and others may require extremely fast processing—in practice, it may simply be too slow to send data up to the cloud environment and return to receive data processing support. Likewise, machines that generate large amounts of information (e.g., locomotives and gas turbines) by using extensive data generating sensors may benefit from the fast data processing capabilities of the edge solutions. As an additional example, some IoT devices (e.g., connected cameras) may not be well suited to utilize cloud-based resources because it may be impractical (not just from a privacy, security, or financial perspective) to send data only to the cloud due to the pure data volumes involved. Thus, many tasks that truly involve data processing, storage, or communication may be more suitable for platforms that include edge solutions, such as the storage systems described above.
Consider a particular instance of inventory management in a warehouse, distribution center, or similar location. High-resolution digital cameras are available in large volumes on inventory racks and produce large data pipes (fireholes) for mass inventory, warehousing, transportation, order fulfillment, manufacturing, or other operations. All such data can be brought into the image processing system, which can reduce the amount of data to a small data pipeline. All small data may be stored in storage by the local deployment. Locally deployed storage devices located at the facility edge may be coupled to the cloud for external reporting, real-time control, and cloud storage. Inventory management may be performed with the results of image processing so that inventory on shelves may be tracked and restocked, moved, transported, modified with new products or deleted out of production/outdated products, etc. The above scenario is a prime candidate for the embodiments of the configurable processing and storage system described above. A combination of compute-only and offload blades (possibly with offload FPGAs or with deep learning on offload custom blades) that are suitable for image processing can take large data pipes from all digital cameras and produce small data pipes. All small data may then be stored by a storage node operating with storage units in any combination of several types of storage blades, preferably processing the data stream. This is an example of storage and functional acceleration and integration. Depending on the external communication needs with the cloud and external processing in the cloud, and depending on the reliability of the network connection and cloud resources, the system may be sized for storage and computing management, with bursty workload and variable conduction reliability. Moreover, depending on other inventory management aspects, the system may be configured for scheduling and resource management in a hybrid edge/cloud environment.
The storage system described above may be used alone or in combination with other computing resources as a network edge platform for combined computing resources, storage resources, networking resources, cloud technology, network virtualization technology, and the like. As part of the network, the edge may have characteristics similar to other network facilities from customer premises and backhaul aggregation facilities to access points (pops) and regional data centers. Readers will appreciate that network workloads, such as Virtual Network Functions (VNFs) and others, will reside on the network edge platform. By a combination of containers and virtual machines implementation, the network edge platform may rely on controllers and schedulers that are no longer geographically co-located with the data processing resources. The functionality may be split into the control plane, user and data plane, or even state machine as a micro-service, allowing independent optimization and scaling techniques to be applied. Such user and data planes may be implemented with added accelerators that reside in server platforms (e.g., FPGAs and smart NICs) and are implemented with SDN enabled commercial silicon and programmable ASICs.
The storage system described above may also be optimized for big data analysis. Big data analysis may be generally described as a process of examining large and disparate data sets for hidden patterns, unknown relevance, market trends, customer preferences, and other useful information that may help an organization make more informed business decisions. Big data analysis applications enable data scientists, predictive modelers, collectists, and other analysis professionals to analyze ever-increasing amounts of structured transaction data, as well as other forms of data not typically developed by conventional Business Intelligence (BI) and analysis programs. As part of that process, semi-structured and unstructured data, such as, for example, internet click stream data, web server logs, social media content, text from customer emails and survey replies, mobile phone call detail records, ioT sensor data, and other data, may be converted into structured form. Big data analysis is a form of advanced analysis that involves complex applications with elements such as predictive models, statistical algorithms, and hypothesis analysis supported by high performance analysis systems.
The storage systems described above may also support (including being implemented as a system interface) applications that perform tasks in response to human speech. For example, the storage system may support execution of intelligent personal assistant applications such as, for example, alexa, apple Siri, google Voice, sanxinBixby, microsoft Cortana, and others of Amazon. Although the examples described in the previous sentence use speech as input, the storage system described above may also support chat robots (chatbots), talking robots, chat robots (chat bots), or manual conversation entities or other applications configured to conduct conversations via auditory or text methods. Likewise, the storage system may actually execute this application to enable a user (e.g., a system administrator) to interact with the storage system via voice. Such applications are typically capable of voice interaction, music playing, making to-do lists, setting alerts, streaming podcasts, playing audio books, and providing weather, traffic, and other real-time information, such as news, but in embodiments according to the present disclosure, such applications may serve as interfaces for various system management operations.
The storage system described above may also implement an AI platform to realize the landscape of self-driven storage. Such AI platforms can be configured to provide global predictive intelligence by collecting and analyzing large numbers of storage system telemetry data points for ease of management, analysis, and support. Indeed, such storage systems may be able to predict both capacity and performance, as well as generate intelligent advice regarding workload deployment, interaction, and optimization. Such AI platforms may be configured to scan all incoming storage system telemetry data against a problem fingerprint library to predict and resolve incidents in real-time before they affect the customer environment, and capture hundreds of performance-related variables for predicting performance load.
The storage system described above may support serial or simultaneous execution of artificial intelligence applications, machine learning applications, data analysis applications, data conversion, and other tasks that may together form an AI ladder. By combining such elements to form a complete data science pipeline, this AI ladder can be efficiently formed, where dependencies exist between the elements of the AI ladder. For example, an AI may require some form of machine learning, machine learning may require some form of analysis, analysis may require some form of data and information architecture, and so forth. Thus, each element may be considered a step in an AI ladder, which may together form a complete and complex AI solution.
The storage system described above may also be used, alone or in combination with other computing environments, to provide a ubiquitous experience of AI, where AI permeates a broad and broad range of business and living. For example, AI may play an important role in providing deep learning solutions, deep reinforcement learning solutions, artificial general intelligence solutions, automated driving vehicles, cognitive computing solutions, commercial UAVs or drones, conversational user interfaces, enterprise classification, ontology management solutions, machine learning solutions, intelligent motes, intelligent robots, intelligent workshops, and many others. The storage systems described above may also be used, alone or in combination with other computing environments, to provide a wide range of transparent immersive experiences in which technologies can introduce transparency between people, businesses, and things. This transparent immersive experience can be provided as augmented reality technology, connected home, virtual reality technology, brain-computer interface, human augmentation technology, nanotube electronics, volumetric display, 4D printing technology, or others. The storage systems described above may also be used, alone or in combination with other computing environments, to support a wide variety of digital platforms. Such digital platforms may include, for example, 5G wireless systems and platforms, digital twin platforms, edge computing platforms, ioT platforms, quantum computing platforms, serverless PaaS, software defined security, neuromorphic computing platforms, and the like.
The reader will appreciate that some transparent immersive experience may involve digital twinning using various "things," such as people, places, processes, systems, and so forth. Such digital twinning and other immersive technologies can change the way humans interact with the technology, as conversational platforms, augmented reality, virtual reality, and mixed reality provide more natural and immersive interactions with the digital world. In practice, digital twinning may be tied to the real world, possibly even in real time, to understand the state of a thing or system, respond to changes, etc. Because digital twinning incorporates a large amount of information about individual assets and groups of assets (and possibly even provides control over those assets), digital twinning can communicate with each other to a digital plant model of a multi-linked digital twinning.
The storage system described above may also be part of a multi-cloud environment, where multiple cloud computing and storage services are deployed in a single heterogeneous architecture. To facilitate operation of this multi-cloud environment, a DevOps tool can be deployed to enable orchestration across clouds. Likewise, continuous development and continuous integration tools may be deployed to standardize the process of pushing and building cloud workloads around continuous integration and delivery, new features. By normalizing these processes, a cloudy policy may be implemented that enables the best provider to be utilized for each workload. Furthermore, application monitoring and visibility tools may be deployed to move application workloads around different clouds, identify performance issues, and perform other tasks. In addition, security and compliance tools may be deployed to ensure compliance with security requirements, government regulations, and the like. This multi-cloud environment may also include tools for application delivery and intelligent workload management to ensure efficient application delivery and to help direct workloads across distributed and heterogeneous infrastructure, as well as tools to facilitate deploying and maintaining packaged and customized applications in the cloud and to enable portability among the cloud. The multi-cloud environment may similarly contain tools for data portability.
The storage system described above may be used as part of a platform to enable the use of encryption anchors that may be used to authenticate the source and content of a product to ensure that it matches blockchain records associated with the product. Such encryption anchors can take many forms, including, for example, as edible ink, as a motion sensor, as a microchip, and others. Similarly, the storage systems described above may implement various encryption techniques and schemes, including grid cryptography, as part of a kit that protects data stored on the storage systems. Grid cryptography may involve the construction of cryptographic primitives involving the grid, either in the construction itself or in the security certificates. Unlike public key schemes such as RSA, diffie-Hellman or elliptic curve cryptography which are vulnerable to attack by quantum computers, some grid-based constructions appear to be resistant to attack by both classical and quantum computers.
Quantum computers are devices that perform quantum computation. Quantum computing uses quantum mechanical phenomena for computation, such as superposition and entanglement. Quantum computers differ from transistor-based traditional computers in that such traditional computers require encoding data into binary digits (bits), each of which is always in one of two definite states (0 or 1). Unlike conventional computers, quantum computers use qubits that can be in superposition of states. A quantum computer maintains a qubit sequence in which a single qubit may represent a one, zero, or any quantum superposition of those two qubit states. A pair of qubits may be in any quantum stack of 4 states and three qubits in any stack of 8 states. Quantum computers with n qubits can typically be in any superposition of up to 2 n different states at the same time, whereas traditional computers can only be in one of these states at any one time. Quantum turing machines are theoretical models of this computer.
The storage system described above may also be paired with an FPGA acceleration server as part of a larger AI or ML infrastructure. Such FPGA acceleration servers may reside near the storage systems described above (e.g., in the same data center) or even be incorporated into an apparatus that includes one or more storage systems, one or more FPGA acceleration servers, networking infrastructure supporting communications between the one or more storage systems and the one or more FPGA acceleration servers, and other hardware and software components. Alternatively, the FPGA acceleration server may reside within a cloud computing environment that may be used to perform computing-related tasks for AI and ML jobs. Any of the embodiments described above may be used together as an FPGA-based AI or ML platform. The reader will appreciate that in some embodiments of an FPGA-based AI or ML platform, the FPGAs contained within the FPGA acceleration server may be reconfigured for different types of ML models (e.g., LSTM, CNN, GRU). The ability to reconfigure the FPGA contained within the FPGA acceleration server can enable acceleration of the ML or AI application based on the best numerical accuracy and memory model used. The reader will appreciate that by treating the collection of FPGA acceleration servers as a pool of FPGAs, any CPU in the data center can use the pool of FPGAs as a shared hardware micro-service, rather than limiting the servers to dedicated accelerators inserted therein.
The FPGA acceleration server and GPU acceleration server described above may implement a computational model in which machine learning models and parameters are fixed into a high bandwidth single chip memory with a large amount of data flow through the high bandwidth single chip memory, unlike what happens in more traditional computational models where small amounts of data are held in the CPU and long instruction streams are run through them. For this computational model, the FPGA may even be more efficient than the GPU, as the FPGA may be programmed with only the instructions needed to run such computational model.
The storage system described above may be configured to provide parallel storage, for example, by using a parallel file system such as BeeGFS. Such parallel file systems may include a distributed metadata architecture. For example, a parallel file system may include multiple metadata servers across which metadata is distributed, as well as components including services for clients and storage servers. By using a parallel file system, file content may be distributed over multiple storage servers using striping, and metadata may be distributed over multiple metadata servers at the directory level, where each server stores a portion of a complete file system tree. Readers will appreciate that in some embodiments, the storage server and metadata server may run in user space atop an existing local file system. Furthermore, the client service, metadata server, or hardware server does not require dedicated hardware, as the metadata server, storage server, and even the client service may run on the same machine.
Readers will appreciate that, due in part to the advent of many of the technologies discussed above, including mobile devices, cloud services, social networks, big data analytics, and the like, an information technology platform may be required to integrate all of these technologies and drive new business opportunities through rapid delivery of revenue-generating products, services, and experiences (rather than just technologies that provide automated internal business processes). Information technology organizations may need to balance the resources and investments required to keep core legacy systems operating properly while also integrating technology to build an information technology platform that can provide speed and flexibility in areas such as, for example, utilizing big data, managing unstructured data, and working with cloud applications and services. One possible embodiment of this information technology platform is a combinable infrastructure comprising a pool of fluid resources, such as the many systems described above that can meet the changing needs of applications by allowing the combination and re-combination of blocks of the computing, storage, and architecture infrastructure to be broken down. This combinable infrastructure may also include a single management interface for eliminating complexity, and a unified API for discovering, searching, inventorying, configuring, provisioning, updating, and diagnosing the combinable infrastructure.
The system described above may support the execution of a large number of software applications. Such software applications may be deployed in a variety of ways, including container-based deployment models. Various tools may be used to manage the containerized application. For example, a Docker Swart (a cluster and scheduling tool that enables IT administrators and developers to build and manage Docker node clusters as a single virtual system) may be used to manage containerized applications. Likewise, containerized applications can be managed through the use of Kubernetes, a containerization system for automating the deployment, scaling, and management of containerized applications. Kubernetes may execute on top of an operating system such as, for example, red cap enterprise Linux, ubuntu server, SUSE Linux enterprise server, and others. In such instances, the master node may assign tasks to the worker/slave nodes. Kubernetes may include a set of components (e.g., kubrelet, kube proxy, cAdvisor) that manage individual nodes, as well as a set of components (e.g., etcd, API server, scheduler, control manager) that form a control plane. Various controllers (e.g., copy controller, daemonSet controller) may drive the state of a Kubernetes cluster by managing a set of bins, which contain one or more containers deployed on a single node. The containerized application may be used to facilitate server-less, cloud-local computing deployment and management models for the software application. To support a serverless, cloud-local computing deployment and management model for software applications, a container may be used as part of an event handling mechanism (e.g., AWS Lambdas) such that various events cause the containerized application to be launched to operate as an event handler.
The system described above may be deployed in a variety of ways, including in a manner that supports fifth generation ('5G') networks. The 5G network may support substantially faster data communications than previous generations of mobile communication networks and thus may result in data and computing resource decomposition, as modern large-scale data centers may become less prominent and may be replaced, for example, by more local miniature data centers that are close to the mobile network towers. The systems described above may be included in such local micro-data centers, and may be part of or paired with a multiple access edge computing ('MEC') system. Such MEC systems may implement cloud computing capabilities and IT service environments at the edge of the cellular network. By running applications and performing related processing tasks closer to the cellular clients, network congestion may be reduced and applications may perform better. MEC technology is designed to be implemented on cellular base stations or other edge nodes and enables flexible and rapid deployment of new applications and services for customers. The MEC may also allow the cellular operator to open its radio access network ('RAN') to authorized third parties, such as application developers and content providers. Furthermore, edge computing and miniature data centers may substantially reduce the cost of smartphones working in conjunction with 5G networks, as customers may not need devices with such intensive processing power and expensive necessary components.
The reader will appreciate that 5G networks may generate more data than previous generation networks, especially in view of the fact that the high network bandwidth provided by 5G networks may render 5G networks handling amounts and types of data (e.g., sensor data from an autonomous car, data generated by AR/VR technology) that are not feasible for previous generation networks. In such instances, the scalability provided by the above-described systems may be very valuable as the amount of data increases, adoption of emerging technologies increases, and so forth.
For further explanation, fig. 3C illustrates an exemplary computing device 350 that may be explicitly configured to perform one or more of the processes described herein. As shown in fig. 3C, computing device 350 may include a communication interface 352, a processor 354, a storage device 356, and an input/output ("I/O") module 358 communicatively connected to each other via a communication infrastructure 360. Although the exemplary computing device 350 is shown in fig. 3C, the components illustrated in fig. 3C are not intended to be limiting. In other embodiments, additional or alternative components may be used. The components of the computing device 350 shown in fig. 3C will now be described in more detail.
The communication interface 352 may be configured to communicate with one or more computing devices. Examples of communication interface 352 include, but are not limited to, a wired network interface (e.g., a network interface card), a wireless network interface (e.g., a wireless network interface card), a modem, an audio/video connection, and any other suitable interface.
Processor 354 generally represents any type or form of processing unit capable of processing data and/or interpreting, executing, and/or directing the execution of one or more of the instructions, processes, and/or operations described herein. The processor 354 may perform operations by executing computer-executable instructions 362 (e.g., application programs, software, code, and/or other executable data examples) stored in the data device 356.
The storage 356 may include one or more data storage media, devices or configurations and may take any type, form of data storage media and/or devices, and combinations thereof. For example, storage 356 may include, but is not limited to, any combination of the non-volatile media and/or volatile media described herein. Electronic data, including the data described herein, may be temporarily and/or permanently stored in the storage 356. For example, data representing computer-executable instructions 362 configured to direct processor 354 to perform any of the operations described herein may be stored within storage 356. In some examples, the data may be arranged in one or more databases residing within the storage 356.
The I/O module 358 may include one or more I/O modules configured to receive user input and provide user output. The I/O module 358 may include any hardware, firmware, software, or combination thereof that supports input and output capabilities. For example, the I/O module 358 may include hardware and/or software for capturing user input, including but not limited to a keyboard or keypad, a touch screen component (e.g., a touch screen display), a receiver (e.g., an RF or infrared receiver), a motion sensor, and/or one or more input buttons.
The I/O module 358 may include one or more devices for presenting output to a user, including but not limited to a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., a display driver), one or more audio speakers, and one or more audio drivers. In certain embodiments, the I/O module 358 is configured to provide graphical data to a display for presentation to a user. The graphical data may represent one or more graphical user interfaces and/or any other graphical content that may serve a particular implementation. In some examples, any of the systems, computing devices, and/or other components described herein may be implemented by computing device 350.
For further explanation, FIG. 3D sets forth a block diagram illustrating a plurality of storage systems (311-402, 311-404, 311-406) supporting bins according to some embodiments of the present disclosure. Although depicted in less detail, the storage systems (311-402, 311-404, 311-406) depicted in FIG. 3D may be similar to the storage systems described above with reference to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage systems (311-402, 311-404, 311-406) depicted in FIG. 3D may include the same, fewer, or additional components than the storage systems described above.
In the example depicted in FIG. 3D, each of the storage systems (311-402, 311-404, 311-406) is depicted as having at least one computer processor (311-408, 311-410, 311-412), computer memory (311-414, 311-416, 311-418), and computer storage (311-420, 311-422, 311-424). While in some embodiments the computer memory (311-414, 311-416, 311-418) and computer storage (311-420, 311-422, 311-424) may be part of the same hardware device, in other embodiments the computer memory (311-414, 311-416, 311-418) and computer storage (311-420, 311-422, 311-424) may be part of different hardware devices. In this particular example, the distinction between computer memory (311-414, 311-416, 311-418) and computer storage (311-420, 311-422, 311-424) may be: the computer memory (311-414, 311-416, 311-418) is physically close to the computer processor (311-408, 311-410, 311-412) and may store computer program instructions for execution by the computer processor (311-408, 311-410, 311-412), while the computer storage means (311-420, 311-422, 311-424) is embodied as non-volatile storage means for storing user data, metadata describing user data, and so on. Referring to the example in FIG. 1A above, for example, a computer processor (311-408, 311-410, 311-412) and computer memory (311-414, 311-416, 311-418) for a particular storage system (311-402, 311-404, 311-406) may reside within one of more of the controllers (110A-110D), while attached storage (171A-171F) may be used as computer storage (311-420, 311-422, 311-424) within the particular storage system (311-402, 311-404, 311-406).
In the example depicted in FIG. 3D, the depicted storage systems (311-402, 311-404, 311-406) may be attached to one or more bins (311-430, 311-432) according to some embodiments of the present disclosure. Each of the bins (311-430, 311-432) depicted in FIG. 3D may include a data set (311-426, 311-428). For example, a first bin (311-430) to which three storage systems (311-402, 311-404, 311-406) have been attached contains a first data set (311-426), while a second bin (311-432) to which two storage systems (311-404, 311-406) have been attached contains a second data set (311-428). In this example, when a particular storage system is attached to a bin, the data set of the bin is copied to the particular storage system and then kept up to date as the data set is modified. The storage system may be removed from the bin, resulting in the data set no longer remaining up to date on the removed storage system. In the example depicted in fig. 3D, any storage system for which a bin is active (which is a non-faulty member of the latest operation of a non-faulty bin) may receive and process requests to modify or read the data set of the bin.
In the example depicted in FIG. 3D, each bin (311-430, 311-432) may also include a set of managed objects and management operations and a set of access operations that modify or read the data set (311-426, 311-428) associated with the particular bin (311-430, 311-432). In this example, the management operations may equivalently modify or query the managed object by any of the storage systems. Likewise, access operations that read or modify a data set may operate equivalently through any of the storage systems. In this example, while each storage system stores separate copies of the data set as the correct subset of the data set that is stored and advertised for use by the storage system, the operations performed and completed by any one storage system to modify the managed object or data set are reflected in subsequent management objects to query bins or to conduct subsequent access operations to read the data set.
Readers will appreciate that a bin may implement more capability than a cluster-only synchronously replicated dataset. For example, a bin may be used to implement a tenant, whereby data sets are securely isolated from each other in some manner. The bins may also be used to implement a virtual array or virtual storage system, where each bin is presented as a unique storage entity on a network (e.g., storage area network or internet protocol network) with a separate address. In the case of a multiple storage system silo implementing a virtual storage system, all of the physical storage systems associated with the silo may somehow present themselves as the same storage system (e.g., as if multiple physical storage systems would not be different from multiple network ports into a single storage system).
Readers will appreciate that a silo may also be a management unit representing a collection of volumes, file systems, object/analysis stores, snapshots, and other management entities, where management changes made on any one storage system (e.g., name changes, property changes, exports or permissions of managing some portion of the data set of the silo) are automatically reflected to all active storage systems associated with the silo. Additionally, a bin may also be a unit of data collection and data analysis, where performance and capacity metrics are presented in a manner that aggregates or invokes data collection and analysis for each bin individually across all active storage systems of the bin, or possibly presents the contribution of each attached storage system to the incoming content and performance of each bin.
One model of bin membership may be defined as a list of storage systems and a subset of that list in which the storage systems are considered synchronized for the bins. A storage system may be considered to be a synchronization of a bin if it is at least within a recovery of the last written copy of the data set associated with that bin having the same free content. Free content is content that has not handled new modifications after any ongoing modifications have been completed. Sometimes this is referred to as "crash recoverable" consistency. The restoration of a bin performs a process of reconciling differences when concurrent updates are applied to the synchronous storage systems in the bin. Recovery may resolve any inconsistencies between the storage systems upon completion of concurrent modifications that have been requested to the various members of the bin but that have not signaled to any requesters that they have completed successfully. Synchronous storage systems that are listed as bin members but not as bins may be described as "detached" from the bins. The storage systems that are listed as bin members, are synchronized for the bins, and are currently available to actively provide data for the bins are "online" to the bins.
Each storage system member of a bin may have its own membership copy containing which storage systems it last knows are synchronized and which storage systems it last knows comprise the entire set of bin members. In order to be online for a bin, the storage system must consider itself to be a synchronization of the bin, and must communicate with all other storage systems that it considers to be synchronized for the bin. If a storage system cannot determine that it is synchronized and communicates with all other storage systems that are synchronized, it must stop processing new incoming requests to the bins (or must complete the requests in the event of an error or exception) until it can determine that it is synchronized and communicate with all other storage systems that are synchronized. The first storage system may infer that the second paired storage system should be disengaged, which will allow the first storage system to continue because it is now synchronized with all the storage systems now in the list. Alternatively, however, the second storage system must be prevented from inferring that the first storage system should be disengaged and that the second storage system continue to operate. This will lead to a "split brain" condition that can lead to uncoordinated datasets, dataset corruption or application corruption, among other hazards.
Situations may arise when a storage system is operating properly and then found to be lost, when it is recovering from some previous failure, when it is restarting or recovering from a communication interruption that temporarily powers down or recovers, when it switches operation from one set of storage system controllers to another for any reason, or during or after any combination of these or other kinds of events, where it is necessary to determine how to continue when not communicating with the paired storage system. In fact, whenever a storage system associated with a bin cannot communicate with all known non-disengaging members, the storage system may briefly wait until communication can be established, taken offline, and continue waiting, or it may determine in some way that disengaging the non-communicating storage system is safe without risking cracking the brain because the non-communicating storage system inferred an alternative view, and then continue. If the secure detach can occur fast enough, the storage system can remain online for the bins with little delay and without causing an application interrupt for the application that can make the request to the remaining online storage system.
An example of this is when the storage system may know that it is outdated. This may occur, for example, when a first storage system is first added to a bin that has been associated with one or more storage systems or when the first storage system reconnects to another storage system and it is found that the other system has marked the first storage system as disengaged. In this case, this first storage system will simply wait until it is connected to some other set of storage systems that are synchronized for the bins.
This model requires some degree of consideration as to how the storage system is added to or removed from the bins or from the synchronized bin member list. Since each storage system will have its own list copy, and since two independent storage systems cannot update their local copies at the same time completely, and since the local copies are available at restart or in various failure scenarios, care must be taken to ensure that transient inconsistencies do not cause problems. For example, if one storage system is synchronized for a bin and a second storage system is added, then if the second storage system is updated to first list both storage systems as synchronized, then if both storage systems fail and restart, the second storage system may start and wait to connect to the first storage system, while the first storage system may not know that it should or may wait for the second storage system. If the second storage system then responds to the inability to connect with the first storage system by performing a process that is disconnected from the first storage system, the second storage system may successfully complete a process that is unknown to the first storage system, resulting in a split brain. Thus, it may be necessary to ensure that the storage system does not diverge inappropriately as to whether it chooses to perform the disassociation process without it communicating.
One way to ensure that a storage system does not have an undue divergence as to whether it chooses to perform a detach process without it communicating is to ensure that when a new storage system is added to the synchronized member list of a silo, the new storage system first stores that it is a detached member (and possibly that it is added as a synchronized member). The existing synchronous storage system may then locally store that the new storage system is a synchronous bin member, before the new storage system locally stores the same fact. If there is a set of reboots or network interruptions before the new storage system stores its synchronization state, the original storage system may disengage the new storage system by not communicating, but the new storage system will wait. Removing the communicating storage system from the bin may require a reverse version of this change: first, the removed storage system stores it out of sync, then the removed storage system will remain stored out of sync, then all storage systems will delete the removed storage system from its bin membership list. Depending on the embodiment, an intermediate hold off state may not be necessary. Whether or not local copies of the membership list need to be noted may depend on the use of the model storage system to monitor or confirm their membership to each other. Inconsistencies in the locally stored membership list may not matter if both use a consensus model, or if an external system (or an external distributed or clustered system) is used to store and validate bin membership.
When a communication fails or one or several storage systems in the cartridge fail, or when a storage system is started (or fails to transfer to a secondary controller) and cannot communicate with the paired storage system of the cartridge, and one or more storage systems are time to decide to disengage one or more paired storage systems, some algorithm or mechanism must be employed to decide that it is safe to do so and continue execution after disengagement. One approach to resolving the break-away is to use a majority (or arbitration) model for membership. In the case of three storage systems, as long as two are communicating, they may agree to disengage a third storage system that is not communicating, but that third storage system itself cannot choose to disengage either of the other two. Confusion may occur when storage system communications are inconsistent. For example, storage system A may be communicating with storage system B, but not with C, and storage system B may be communicating with both A and C. Thus, a and B may disengage C, or B and C may disengage a, but more communication between bin members may be required to address this.
Attention is required to the arbitration membership model when adding and removing storage systems. For example, if a fourth storage system is added, then the "majority" of storage systems are three at that time. The transition from three storage systems (most requiring two) to bins containing a fourth storage system (most requiring three) may require something similar to the previously described model to carefully add the storage systems to the synchronization list. For example, a fourth storage system may attach a state but not yet attach to begin, where it never triggers a vote for arbitration. Once in that state, the original three bin members may each be updated to learn about the new requirements of the fourth member and the three storage systems most deviate from the fourth. Removing a storage system from a bin may similarly move that storage system to a locally stored "disengaged" state before updating other bin members. A variant to this is to use a distributed consensus mechanism (e.g., PAXOS or RAFT) to implement any membership changes or handle detach requests.
Another approach to managing membership transitions is to use external systems other than the storage system itself to handle bin membership. To become online for a bin, the storage system must first contact the external bin membership system to verify that it is synchronized for the bin. Then, any storage system that is online to the cartridge should remain in communication with the cartridge membership system and should wait or go offline if communication is lost. The external bin membership manager may be implemented as a high availability cluster using various cluster tools (e.g., oracle RAC, linux HA, VERITAS cluster server, IBM's HACMP, or others). The external bin membership manager may also use a distributed configuration tool, such as Etcd or Zookeeper, or a reliable distributed database, such as the DynamoDB of amazon.
In the example depicted in FIG. 3D, according to some embodiments of the present disclosure, the depicted storage system (311-402, 311-404, 311-406) may receive requests to read portions of a data set (311-426, 311-428) and requests to process portions of a read data set locally. The reader will appreciate that while requests to modify (e.g., write operation) a data set (311-426, 311-428) require coordination between storage systems (311-402, 311-404, 311-406) in a bin, responding to requests to read portions of the data set (311-426, 311-428) need not be similarly coordinated between storage systems (311-402, 311-404, 311-406) because the data set (311-426, 311-428) should be consistent across all storage systems (311-402, 311-404, 311-406) in the bin. Thus, a particular storage system receiving a read request may serve the read request locally by reading portions of the data sets (311-426, 311-428) stored within the storage devices of the storage system without synchronous communication with other storage systems in the cartridge. It is desirable that a read request received by one storage system for a replicated data set in a replicated cluster avoids any communication in most cases, at least when received by a storage system operating within a cluster that is also operating nominally. Such reads should typically be handled simply by reading from a local copy of the cluster dataset, without further interaction with other storage systems in the cluster.
The reader will appreciate that the storage systems may take steps to ensure read consistency so that a read request will return the same result, regardless of which storage system handles the read request. For example, the resulting cluster dataset content of any set of updates received by any set of storage systems in the cluster should be consistent across the cluster, at least at any time the update is idle (all previous modification operations have been indicated as complete and new update requests have not been received and processed in any way). More specifically, examples of clustered data sets across a set of storage systems may differ only due to an update that has not yet been completed. This means that any two write requests that overlap, for example, in their volume block range, or any combination of write requests and overlapping snapshots, comparisons, and write or virtual block range copies must produce consistent results with respect to all copies of the data set. The two operations should not produce results as if they occur in one order on one storage system and in a different order on another storage system in the replicated cluster.
Further, the read requests may be chronologically consistent. For example, if one read request is received and completed on the replicated cluster, and then that read is followed by another read request received by the replicated cluster for the overlapping address range and where one or both reads overlap in time and volume address range with the modification request received by the replicated cluster in any way (whether any of the reads or modifications are received by the same or a different storage system in the replicated cluster), then the second read should also reflect the result of the update if the first read reflects that other than possibly returning data prior to the update. If the first read does not reflect an update, the second read may or may not reflect an update. This ensures that the "time" of the data segment does not roll back between two read requests.
In the example depicted in FIG. 3D, the depicted storage systems (311-402, 311-404, 311-406) may also detect an interruption in data communication with one or more of the other storage systems and determine whether a particular storage system should remain in the bin. Interruption of data communication with one or more of the other storage systems may occur for various reasons. For example, an interruption in data communication with one or more of the other storage systems may occur because one of the storage systems failed, because of a network interconnection failure, or for some other reason. An important aspect of synchronizing replicated clusters is to ensure that any failure handling does not result in unrecoverable inconsistencies or any inconsistencies of the response. For example, if a network failure occurs between two storage systems, at most one of the storage systems may continue to process newly incoming I/O requests for the bins. And if one storage system continues processing, the other storage system cannot process any new requests to be completed, including read requests.
In the example depicted in FIG. 3D, the depicted storage systems (311-402, 311-404, 311-406) may also determine whether a particular storage system should remain in a bin in response to detecting an interruption in data communication with one or more of the other storage systems. As mentioned above, to be 'online' as part of a bin, a storage system must consider itself synchronized for the bin, and must communicate with all other storage systems that it considers synchronized for the bin. If a storage system cannot determine that it is synchronized and communicates with all other storage systems that are synchronized, it may stop processing new incoming requests to access the data set (311-426, 311-428). Thus, a storage system may determine whether a particular storage system should remain online as part of a bin, for example, by determining whether it can communicate with all other storage systems that it deems to be synchronized with respect to the bin (e.g., via one or more test messages), by determining whether it also considers that storage system to be attached to a bin, by wherein a particular storage system must confirm that it can communicate with all other storage systems that it deems to be synchronized with respect to the bin, and that it also considers a combination of the two steps of attaching that storage system to a bin, or by some other mechanism.
In the example depicted in FIG. 3D, the depicted storage systems (311-402, 311-404, 311-406) may also keep the data sets on the particular storage system available for management and data set operations in response to determining that the particular storage system should remain in the bin. The storage system may be used to manage and data set operations, for example, by accepting requests to access versions of data sets (311-426, 311-428) stored on the storage system and processing such requests, by accepting and processing management operations associated with data sets (311-426, 311-428) issued by a host or an authorized administrator, by accepting and processing management operations associated with data sets (311-426, 311-428) issued by one of the other storage systems, or in some other manner to maintain data sets (311-426, 311-428) on a particular storage system.
In the example depicted in FIG. 3D, however, the depicted storage systems (311-402, 311-404, 311-406) may make datasets on a particular storage system unavailable for management and dataset operations in response to determining that the particular storage system should not remain in the bin. The storage system may make the data sets (311-426, 311-428) on a particular storage system unavailable for management and data set operations, such as by rejecting requests to access versions of the data sets (311-426, 311-428) stored on the storage system, by rejecting management operations associated with the data sets (311-426, 311-428) issued by a host or other authorized administrator, by rejecting management operations associated with the data sets (311-426, 311-428) issued by one of the other storage systems in the cartridge, or in some other manner.
In the example depicted in FIG. 3D, the depicted storage systems (311-402, 311-404, 311-406) may also detect that an interruption in data communication with one or more of the other storage systems has been repaired and that the data set on the particular storage system is made available for management and data set operations. The storage system may detect that an interruption in data communication with one or more of the other storage systems has been repaired, for example, by receiving a message from one or more of the other storage systems. In response to detecting that an interruption in data communication with one or more of the other storage systems has been repaired, the storage system may make available the data sets (311-426, 311-428) on the particular storage system for management and data set operations once the previously detached storage system has been resynchronized with the storage system that remains attached to the silo.
In the example depicted in FIG. 3D, the depicted storage systems (311-402, 311-404, 311-406) may also be taken off-line from the bins such that the particular storage system no longer allows management and dataset operations. The depicted storage systems (311-402, 311-404, 311-406) may be taken off-line from the cartridge for various reasons such that the particular storage system no longer allows management and dataset operations. For example, the depicted storage systems (311-402, 311-404, 311-406) may also be taken off-line from the cartridge due to some failure of the storage system itself, as updates or some other maintenance are occurring on the storage system due to communication failures or for many other reasons. In this example, the depicted storage systems (311-402, 311-404, 311-406) may then update the data set on the particular storage system to include all updates to the data set, as the particular storage system is offline and re-online with the bins such that the particular storage system allows management and data set operation, as will be described in more detail in the resynchronization section included below.
In the example depicted in FIG. 3D, the depicted storage systems (311-402, 311-404, 311-406) may also identify a target storage system for asynchronously receiving a data set, where the target storage system is not one of a plurality of storage systems across which the data set is synchronously replicated. For example, this target storage system may represent a backup storage system, some storage system that is a set of data that uses synchronous replication, and so forth. In practice, synchronous replication may be utilized to distribute copies of a data set closer to a server rack for better local read performance. One such case is that smaller overhead storage systems are replicated symmetrically to larger storage systems centrally located in a data center or campus, and where those larger storage systems are more carefully managed in terms of reliability or connected to external networks for asynchronous replication or backup services.
In the example depicted in FIG. 3D, the depicted storage systems (311-402, 311-404, 311-406) may also identify portions of the data set that were not asynchronously copied to the target storage system by any of the other storage systems, and asynchronously copy portions of the data set that were not asynchronously copied to the target storage system by any of the other storage systems to the target storage system, where two or more storage systems together copy the entire data set to the target storage system. In this way, the work associated with asynchronously replicating a particular data set can be split among the members of the bin such that each storage system in the bin is responsible for asynchronously replicating only a subset of the data set to the target storage system.
In the example depicted in FIG. 3D, the depicted storage systems (311-402, 311-404, 311-406) may also be disengaged from the bins such that the particular storage system that is disengaged from a bin is no longer included in the storage system group across which the data set was synchronously replicated. For example, if the storage system (311-404) in FIG. 3D is disengaged from the bin (311-430) illustrated in FIG. 3D, the bin (311-430) will contain only the storage system (311-402, 311-406) as the storage system across which the data set (311-426) contained in the bin (311-430) will be synchronously replicated. In this example, disengaging the storage system from the bin may also include removing the data set from the particular storage system that is disengaged from the bin. Continuing with the example in which the storage system (311-404) in FIG. 3D is disengaged from the bin (311-430) illustrated in FIG. 3D, the data set (311-426) contained in the bin (311-430) may be deleted or otherwise removed from the storage system (311-404).
The reader will appreciate that there are several unique management functions implemented by the cartridge model that may be further supported. Furthermore, the cartridge model itself introduces some problems that can be solved by implementation. For example, when a storage system is offline to a bin but is otherwise operating, such as because the interconnect is malfunctioning and another storage system of the bin wins in mediation, it may still be desirable or necessary to access the data set of the offline bin on the offline storage system. One solution might be to simply enable a bin in some disengaged mode and allow access to the data set. However, that solution may be dangerous and may render the metadata and data of the bins more difficult to reconcile when the storage system resumes communication. In addition, the host may still have a separate path to access the offline storage system as well as the storage system that is still online. In that case, the host may issue an I/O to both storage systems even though they no longer remain synchronized because the host sees that the target port reports volumes with the same identifier and the host I/O driver assumes that it sees an additional path to the same volume. This can lead to quite serious data corruption, as reads and writes issued to the two storage systems are no longer consistent, even though the host assumes that they are. As a variation on this, in a clustered application (e.g., a shared storage cluster database), a clustered application running on one host may be reading or writing to one storage system, and the same clustered application running on another host may be reading or writing to a "detached" storage system, however, two instances of the clustered application are communicating with each other, provided that the data sets they each see are completely consistent for completed writes. Because of its inconsistency, that premise is violated and the application's data set (e.g., database) may soon be destroyed and end.
One way to solve both of these problems is to allow a snapshot of an offline bin or possibly an offline bin to be copied to a new bin with a new volume of sufficient new identity so that the host I/O driver and cluster application will not confuse the copied volume as the same volume that is still online on another storage system. Because each bin maintains a complete copy of the dataset, which is crash consistent but may be slightly different from the copy of the bin dataset on another storage system, and because each bin has a separate copy of all the data and metadata required to operate on the bin contents, it is a simple problem to virtually copy some or all of the volumes or snapshots in the bin to a new volume in the new bin. For example, in a logical extent (extension) map implementation, all that is required is to define a new volume in a new cartridge that references a replicated cartridge's logical extent map associated with the cartridge's volume or snapshot, and the logical extent map is labeled copy-on-write. The new volume should be considered a new volume, similar to how a volume snapshot copied to the new volume may be implemented. The volumes may have the same management name, although in the new bin namespace. But it should have a different bottom layer identifier and a different logical unit identifier than the original volume.
In some cases, it is possible to use virtual network isolation techniques in a manner (e.g., by creating a virtual LAN in the case of an IP network or a virtual SAN in the case of a fibre channel network) such that isolation of volumes presented to some interfaces may be ensured to be inaccessible to host network interfaces or host SCSI initiator ports that may also see the original volumes. In such cases, it may be secure to provide the volume copy with the same SCSI or other storage identifier as the original volume. This may be used, for example, in situations where an application program wishes to see a particular set of storage identifiers in order to function without undue burden upon reconfiguration.
Some of the techniques described herein may also be used outside of the active fault context to test readiness to handle faults. Readiness testing (sometimes referred to as "fire drill") is typically required for disaster recovery configurations; where frequent and repeated testing is deemed necessary to ensure that most or all aspects of the disaster recovery plan are correct and to take into account any up-to-date changes in the application, dataset, or equipment changes. Readiness testing should be non-destructive to the current production operation (including replication). In many cases, the actual operation cannot actually be invoked on the active configuration, but a good approach is to use a store operation to make a copy of the production dataset, and then possibly couple it with the use of virtual networking to create an isolated environment containing all the data necessary for important applications that are deemed to have to be successfully launched in the event of a disaster. Making this copy of a synchronously replicated (or even asynchronously replicated) data set available within the site (or set of sites) where the disaster recovery readiness test program is expected to be executed and then starting important applications on that data set to ensure that it can be started and function is a good tool, as it helps to ensure that no important parts of the application data set are missed in the disaster recovery plan. If necessary and feasible, this may be coupled with a virtual quarantine network, possibly coupled with an quarantined set of physical or virtual machines, to approximate as closely as possible the real-world disaster recovery takeover scenario. Indeed, virtually copying a bin (or set of bins) to another bin as a point-in-time image of the bin dataset creates an isolated dataset immediately that contains all the copied elements and can then operate substantially the same as the original bin and allow isolation to a single site (or several sites) separate from the original bin. Moreover, these are quick operations and they can be easily disassembled and repeated, allowing the test to be repeated at the required frequency.
Some enhancements may be made to further achieve perfect disaster recovery testing. For example, along with the quarantine network, SCSI logical unit identifications or other types of identifications can be copied into the target bin so that the test server, virtual machine, and application see the same identifications. Further, the management environment of the server may be configured to respond to requests from a particular virtual set of the virtual network in response to requests and operations on the original bin name, so the script need not use test variants with an object name that replaces the "test" version. Another enhancement may be used in the case where a host-side server infrastructure that will take over in the event of a disaster can be used during testing. This includes situations where the disaster recovery data center is fully equipped with an alternative server infrastructure that is typically not used until indicated by the disaster. It also includes situations where the infrastructure is available for non-critical operations (e.g., running analysis on production data, or other functions that support only application development or may be important but may be stopped if more critical functions are needed). In particular, host definitions and configurations, and the server infrastructure with which they are to be used, may be set up as in the case of an actual disaster recovery takeover event, and tested as part of a disaster recovery takeover test, with tested volumes connected to these host definitions from virtual bin copies for providing snapshots of data sets. These host definitions and configurations for testing, as well as the volume-to-host connection configurations used during testing, can then be reused when triggering an actual disaster takeover event, from the perspective of the storage system involved, thereby greatly reducing the configuration differences between the test configuration and the actual configuration to be used in the disaster recovery takeover scenario.
In some cases, it may be meaningful to move volumes out of a first bin and to a new second bin containing only those volumes. Bin membership and high availability and recovery characteristics may then be adjusted separately, and management of the two resulting bin datasets may then be isolated from each other. Operations that may be performed in one direction may also be performed in another direction. At some point, it may make sense to take two bins and merge them into one, so that volumes in each of the original two bins will now track each other's storage system membership and high availability, as well as restore characteristics and events. Both operations can be done securely and by relying on the characteristics suggested for changing the mediation or arbitration properties of the bins as discussed in the previous section, the interference to the running application is minimized or eliminated. For example, by mediation, a mediator of bins may be changed using a sequence consisting of: wherein each storage system in the bin changes to depend on both the first mediator and the second mediator, and then each changes to depend on only the second mediator. If a failure occurs in the middle of the sequence, some storage systems may depend on both the first regulator and the second regulator, but in any event, recovery and failure handling will not result in some storage systems depending on only the first regulator and other storage systems depending on only the second regulator. Arbitration may be similarly handled to continue recovery by temporally depending on defeating the first arbitration model and the second arbitration model. This can result in a very short period of time in which the availability of the bin in the face of a failure depends on the additional resources, reducing potential availability, but this period of time is very short and the reduction in availability is typically small. With respect to mediation, if the change in mediator parameters is simply a change in the key used for mediation and the mediation service used is the same, the potential reduction in availability is even less, as it now depends only on two calls to the same service, rather than one call to that service and not separate calls to two separate services.
Readers will note that changing the arbitration model can be quite complex. In the case where the storage system is to participate in the second arbitration model but is not dependent on winning in the second arbitration model, additional steps may be required, followed by steps that also depend on the second arbitration model. This may be necessary to account for the fact that: if only one system processes a change to depend on the arbitration model it will never win arbitration because there will never be a majority. With this model to appropriately change the high availability parameters (mediation relationship, arbitration model, take over preferences), we can create a security procedure for these operations to split one bin into two or merge two bins into one. This may require the addition of another capability: the second bin is linked to the first bin to achieve high availability such that if both bins contain compatible high availability parameters, the second bin linked to the first bin can determine and initiate off-line related processing and operations, offline and synchronization states, and recovery and re-synchronization actions depending on the first bin.
To split one bin into two, which is an operation to move some volumes into a newly created bin, a distributed operation may be formed, which may be described as: forming a second bin, we will move a set of volumes previously in the first bin to the second bin, copy the high availability parameters from the first bin into the second bin to ensure that they are link compatible, and link the second bin to the first bin for high availability. This operation may be encoded as a message and should be implemented by each storage system in the bin in such a way that the storage system ensures that the operation occurs entirely on that storage system or not at all if the processing is interrupted by a fault. Once all of the synchronized storage systems of both bins have handled this operation, the storage system may handle subsequent operations that change the second bin so that it is no longer linked to the first bin. As with other changes to the high availability characteristics of the bins, this involves first making each synchronization storage system change dependent on both the previous model (that model is the high availability link to the first bin) and the new model (that model is itself now independent of high availability). In the case of mediation or arbitration, this means that the storage system handling this change will depend first on the mediation or arbitration implemented for the first bin as appropriate, and will additionally depend on a new separate mediation (e.g., a new mediation key) or arbitration implemented for the second bin after the occurrence of the failure requiring the mediation or arbitration test before the second bin can continue. As with the previous description of the change arbitration model, the intermediate step may set the storage system to participate in and depend on the arbitration of the second bin prior to the step in which the storage system participates. The splitting is completed once all of the synchronous storage systems process the changes to depend on the new parameters of the mediation or arbitration of the first and second bins.
The operation of incorporating the second bin into the first bin is substantially reversed. First, the second bin must be adjusted to be compatible with the first bin by having the same list of storage systems and by having a compatible high availability model. This may involve some set of steps (e.g., steps described elsewhere herein) to add or remove storage systems or to change moderators and arbitration models. Depending on the implementation, it may only be necessary to reach the same list of storage systems. The combining continues by processing operations on each synchronous storage system to link the second bin to the first bin to achieve high availability. Each storage system that handles that operation will then depend on the high availability of the first bin and then the high availability of the second bin. Once all of the synchronous storage systems of the second bin have processed that operation, the storage systems will each process subsequent operations to eliminate the link between the second bin and the first bin, migrate volumes from the second bin into the first bin, and delete the second bin. Host or application dataset access may be preserved throughout these operations, as long as the implementation allows host or application dataset modification or read operations to be properly directed to the volume by identification, and as long as the identification is preserved as appropriate for the storage protocol or storage model (e.g., in the SCSI case, as long as the logical unit identifier of the volume and use of the target port for accessing the volume are preserved).
Migrating volumes between bins can be problematic. This may be simple if the bins have a set of identical synchronized membership storage systems: temporarily suspending operations on migrated volumes, switching control of operations on those volumes to software and structure controlling the new cartridge, and then resuming operations. This allows applications to migrate seamlessly at continuous uptime, except for very brief operational pauses, as long as the network and ports migrate properly between the bins. Depending on the implementation, the pause operation may even be unnecessary or may be made internal to the system so that the pause operation is not affected. Copying volumes between bins with different synchronous member relationship sets is more problematic. This is not a great problem if the target bin of the replica has a subset of synchronous members from the source bin: the member storage system may be sufficiently secure to be discarded without having to do more work. However, if a target bin adds a synchronized member storage system to a volume via a source bin, the added storage system must be synchronized to contain the contents of the volume before it can be used. Before synchronization, this can make replicated volumes significantly different from synchronized volumes because the failure handling is different and the request handling from the member storage systems that have not been synchronized either cannot work or must be forwarded or not too fast because the read will have to traverse the interconnect. Moreover, the internal implementation will have to handle volumes that are in sync and ready for failure handling and other volumes that are not in sync.
In the face of faults, there are other problems related to the reliability of the operation. Coordinating volume migration among multiple storage system cartridges is a distributed operation. If a bin is the unit of failover and recovery, and if mediation or arbitration or other means are used to avoid a split brain condition, volumes are switched from one bin to another with a particular set of states and configurations and relationships for failover, recovery, mediation and arbitration, then the storage system in the bin must carefully coordinate the changes associated with that treatment of any volume. Operations cannot be atomically distributed among storage systems, but must be segmented in some way. The mediation and arbitration model essentially provides the facilities for the implementation of distributed transaction atomicity to the bins, but without adding to the implementation, this may not extend to inter-bin operations.
Even for two bins sharing the same first and second storage systems, even simple migration of volumes from the first bin to the second bin is considered. At some point, the storage system will coordinate to define that the volume is now in the second bin and no longer in the first bin. If there is no inherent mechanism of transaction atomicity across the two bins' storage systems, a simple implementation may leave volumes in a first bin on a first storage system and a second bin on a second storage system upon a network failure, which results in failure handling to disengage the storage systems from both bins. If a bin alone determines which storage system successfully left another, the result may be that the same storage system left another of the two bins, in which case the result of the volume migration recovery should be consistent, or it may cause a different storage system to leave the other of the two bins. If a first storage system is disengaged from a second storage system of a first cartridge and a second storage system is disengaged from a first storage system of a second cartridge, then the restore may result in a volume being restored to the first cartridge on the first storage system and into the second cartridge on the second storage system, then the volume is run and exported to hosts and storage applications on both storage systems. If instead the second storage system is disengaged from the first storage system of the first cartridge and the first storage is disengaged from the second storage system of the second cartridge, then the restore may result in the volume being discarded from the second cartridge by the first storage system and the volume being discarded from the first cartridge by the second storage system, resulting in the volume completely disappearing. If the bins between which volumes are being migrated are on several different sets of storage systems, then things may become even more complex.
A solution to these problems may be to use intermediate bins as well as the techniques for splitting and combining bins previously described. This middle bin may never appear as a visible management object associated with the storage system. In this model, the volumes to be moved from the first bin to the second bin are first split from the first bin into a new intermediate bin using the splitting operation previously described. The storage system members of the middle bin may then be adjusted to match the membership of the storage systems by adding the storage systems or removing the storage systems from the bins as desired. The intermediate bin may then be combined with the second bin.
For further explanation, FIG. 3E sets forth a flow chart illustrating an example method for servicing I/O operations directed to a data set (311-42) synchronized across multiple storage systems (311-38, 311-40) according to some embodiments of the present disclosure. Although depicted in less detail, the storage systems (311-38, 311-40) depicted in FIG. 3E may be similar to the storage systems described above with reference to FIGS. 1A-1D, 2A-2G, 3A-3B, or any combination thereof. In practice, the storage system depicted in FIG. 3E may include the same, fewer, or additional components than the storage system described above.
The data sets (311-42) depicted in FIG. 3E may be embodied, for example, as the contents of a particular volume, as particular shared contents of a volume, or as any other collection of one or more data elements. The data sets (311-42) may be synchronized across multiple storage systems (311-38, 311-40) such that each storage system (311-38, 311-40) retains a local copy of the data sets (311-42). In the examples described herein, such a data set (311-42) is synchronously replicated across the storage systems (311-38, 311-40) in a manner such that the data set (311-42) is accessible by any of the storage systems (311-38, 311-40) having performance characteristics such that, at least as long as the cluster and the particular storage system being accessed are nominally operational, no one of the storage systems in the cluster is substantially better than any other storage system in the cluster. In such systems, modifications to the data sets (311-42) should modify copies of the data sets residing on each storage system (311-38, 311-40) in such a way that accessing the data sets (311-42) on any storage system (311-38, 311-40) will produce consistent results. For example, a write request issued to a data set must be serviced on all storage systems (311-38, 311-40) or nominally run at the beginning of a write and remain nominally running until the write is completed on the storage system (311-38, 311-40). Likewise, some groups of operations (e.g., two write operations directed to the same location within the data set) must be performed in the same order, or other steps must be performed on all storage systems (311-38, 311-40) as described in more detail below so that the data set is ultimately the same on all storage systems (311-38, 311-40). The modification to the data sets (311-42) need not be done at the same time, but some actions (e.g., issuing a confirmation that the write request directed to the data set, enabling read access to locations within the data set targeted by write requests that have not yet been completed on both storage systems) may be delayed until a copy of the data set on each storage system (311-38, 311-40) has been modified.
In the example method depicted in FIG. 3E, designating one storage system (311-40) as a 'leader' and another storage system (311-38) as a 'follower' may refer to the respective relationship of each storage system for the purpose of synchronously replicating a particular data set across storage systems. In this example, and as will be described in more detail below, the leader storage system (311-40) may be responsible for performing some processing of the incoming I/O operations and passing this information to the follower storage system (311-38) or performing other tasks that are not needed by the follower storage system (311-40). The leader storage systems (311-40) may be responsible for performing tasks that are not needed by the follower storage systems (311-38) for all incoming I/O operations, or alternatively, the leader-follower relationship may be specific to only a subset of the I/O operations received by any storage system. For example, the leader-follower relationship may be specific to an I/O operation directed to the first volume, the first group of volumes, the first group of logical addresses, the first group of physical addresses, or some other logical or physical descriptor. In this way, the first storage system may act as a leader storage system for I/O operations directed to the first set of volumes (or other descriptors), while the second storage system may act as a leader storage system for I/O operations directed to the second set of volumes (or other descriptors). The example method depicted in FIG. 3E depicts wherein synchronizing the plurality of storage systems (311-38, 311-40) occurs in response to receipt by the leader storage system (311-40) of a request (311-04) to modify the data set (311-42), but synchronizing the plurality of storage systems (311-38, 311-40) may also be performed in response to receipt by the follower storage system (311-38) of a request (311-04) to modify the data set (311-42), as will be described in more detail below.
The example method depicted in FIG. 3E includes receiving (311-06) a request (311-04) by a leader storage system (311-40) to modify a data set (311-42). The request (311-04) to modify the data set (311-42) may be embodied, for example, as a request to write data to a location of data contained within the data set (311-42) within the storage system (311-40), a request to write data to a volume containing data contained within the data set (311-42), a request to take a snapshot of the data set (311-42), a virtual range copy, an UNMAP operation that substantially represents deleting a portion of the data in the data set (311-42), a modification transformation of the data set (311-42) other than a change in a portion of the data within the data set, or some other operation that results in a change in a portion of the data contained within the data set (311-42). In the example method depicted in FIG. 3E, a request (311-04) to modify a data set (311-42) is issued by a host (311-02), which may be embodied, for example, as an application executing on a virtual machine, an application executing on a computing device connected to a storage system (311-40), or some other entity configured to access the storage system (311-40).
The example method depicted in FIG. 3E also includes generating (311-08), by the leader storage system (311-40), information (311-10) describing the modification to the data set (311-42). The leader storage system (311-40) may generate (311-08) information (311-10) describing the modification to the data set (311-42), for example, by determining the ordering of any other operations in progress, by determining the correct result of the overlapping modification (e.g., the appropriate result of two requests to modify the same storage location), calculating any distributed state changes to common elements of metadata across all members of the bin (e.g., across all storage systems across which the data set is synchronously replicated), and so forth. The information (311-10) describing the modification to the data set (311-42) may be embodied, for example, as system level information describing I/O operations to be performed by the storage system. The leader storage system (311-40) may generate (311-08) information (311-10) describing modifications to the data set (311-42) by processing the request (311-04) to modify the data set (311-42) only enough to calculate what should happen to service the request (311-04) to modify the data set (311-42). For example, the leader storage system (311-40) may determine whether some sort of ordering of execution of the request (311-04) to modify the data set (311-42) with respect to other requests to modify the data set (311-42) is required, or whether some other step must be taken to produce an equivalent result on each storage system (311-38, 311-40) as described in more detail below.
Consider an example in which a request (311-04) to modify a data set (311-42) is embodied as a request to copy a block from a first address range in the data set (311-42) to a second address range in the data set (311-42). In this example, assume that three other write operations (write A, write B, write C) are directed to a first address range in the data set (311-42). In this example, if the leader storage system (311-40) served write A and write B (but not write C) before copying the block from the first address range in data set (311-42) to the second address range in data set (311-42), then the follower storage system (311-38) must also serve write A and write B (but not write C) before copying the block from the first address range in data set (311-42) to the second address range in data set (311-42) in order to produce consistent results. Thus, when the leader storage system (311-40) generates (311-08) information (311-10) describing the modification to the dataset (311-42), in this example, the leader storage system (311-40) may generate information (e.g., sequence numbers of write A and write B) identifying other operations that must be completed before the follower storage system (311-38) can process the request (311-04) to modify the dataset (311-42).
Consider an additional instance in which two requests (e.g., write a and write B) are directed to overlapping portions of data sets (311-42). In this example, if the leader storage system (311-40) is servicing write A and then servicing write B, and the follower storage system (311-38) is servicing write B and then servicing write A, then the data set (311-42) will not be consistent across the two storage systems (311-38, 311-40). Thus, when the leader storage system (311-40) generates (311-08) information (311-10) describing modifications to the data set (311-42), in this example, the leader storage system (311-40) may generate information (e.g., sequence numbers of write A and write B) identifying the order in which the requests should be performed. Alternatively, rather than generating information (311-10) describing modifications to the data set (311-42) that require intermediate behavior from each storage system (311-38, 311-40), the leader storage system (311-40) may generate (311-08) information (311-10) describing modifications to the data set (311-42) that contains information identifying the correct outcome of the two requests. For example, if write B logically follows write A (and overlaps with write A), then the end result must be that the data set (311-42) contains the portion of write B that overlaps with write A, rather than the portion of write A that overlaps with write B. This result may be facilitated by merging the results in memory and writing such merged results to the data set (311-42), rather than strictly requiring the particular storage system (311-38, 311-40) to perform write A, and then subsequently perform write B. Readers will appreciate that more subtle circumstances are associated with snapshots and virtual address range copies.
The reader will further appreciate that the correct outcome of any operation must be submitted to a recoverable point before the operation can be confirmed. However, multiple operations may be committed together, or if recovery will ensure correctness, the operations may be partially committed. For example, a snapshot may be committed locally, where the record depends on the intended writes of A and B, but either A or B itself may not be committed. The snapshot cannot be validated and if the missed I/O cannot be restored from another array, restoration may terminate exiting the snapshot. Also, if write B overlaps with write a, the leader may "sort" B after a, but a may actually be discarded, and then the operation on write a will only wait for B. Writes A, B, C and D coupled with snapshots between A, B and C, D may commit and/or acknowledge some or all portions together as long as the recovery does not result in snapshot inconsistencies across the array and as long as the later operations are not acknowledged before the previous operations have been held to a point where they are guaranteed to be recoverable.
The example method depicted in FIG. 3E also includes sending (311-12) information (311-10) describing the modification to the data set (311-42) from the leader storage system (311-40) to the follower storage system (311-38). Transmitting (311-12) information (311-10) describing the modification to the data set (311-42) from the leader storage system (311-40) to the follower storage system (311-38) may be carried out, for example, by the leader storage system (311-40) transmitting one or more messages to the follower storage system (311-38). The leader storage system (311-40) may also send the I/O payload (311-14) of the request (311-04) for modifying the data set (311-42) in the same message or in one or more different messages. When the request (311-04) to modify the data set (311-42) is embodied as a request to write data to the data set (311-42), the I/O payload (311-14) may be embodied, for example, as data to be written to a storage device within the follower storage system (311-38). In this example, because the request (311-04) to modify the data set (311-42) is received (311-06) by the leader storage system (311-40), the follower storage system (311-38) has not received the I/O payload (311-14) associated with the request (311-04) to modify the data set (311-42). In the example method depicted in FIG. 3E, the information (311-10) describing the modification to the dataset (311-42) and the I/O payload (311-14) associated with the request (311-04) to modify the dataset (311-42) may be sent (311-12) from the leader storage system (311-40) to the follower storage system (311-38) via one or more data communication networks coupling the leader storage system (311-40), via one or more dedicated data communication links (e.g., a first link for sending the I/O payload and a second link for sending the information describing the modification to the dataset) coupling the leader storage system (311-40) to the follower storage system (311-38), or via some other mechanism.
The example method depicted in FIG. 3E also includes receiving (311-16), by the follower storage system (311-38), information (311-10) describing a modification to the data set (311-42). The follower storage system (311-38) may receive (311-16) information (311-10) describing modifications to the data set (311-42) and the I/O payload (311-14) from the leader storage system (311-40), e.g., via one or more messages sent from the leader storage system (311-40) to the follower storage system (311-38). One or more messages may be written by the leader storage system (311-40) to a predetermined memory location (e.g., a location of a queue) on the follower storage system (311-38) using RDMA or similar mechanisms or otherwise sent from the leader storage system (311-40) to the follower storage system (311-38) via one or more dedicated data communication links between the two storage systems (311-38, 311-40).
In one embodiment, the follower storage system (311-38) may receive (311-16) information (311-10) describing modifications to the data set (311-42) and the I/O payload (311-14) from the leader storage system (311-40) using SCSI requests (written from the sender to the receiver or read from the receiver to the sender) as a communication mechanism. In this embodiment, the SCSI write request is used to encode information that is intended to be sent (which includes any data and metadata) and that can be delivered to a particular pseudo device or via a specially configured SCSI network or by any other agreed upon addressing mechanism. Or alternatively, the model may use a special device, a specially configured SCSI network, or other agreed upon mechanism to issue a set of open SCSI read requests from the receiver to the sender. The encoded information including the data and metadata will be delivered to the receiver as a response to one or more of these open SCSI requests. This model may be implemented via a fibre channel SCSI network (which is typically deployed as a "dark fibre" storage network infrastructure between data centers). This model also allows for host-to-remote array multipath and bulk array-to-array communication using the same network line.
The example method depicted in FIG. 3E also includes processing (311-18) a request (311-04) to modify the data set (311-42) by the follower storage system (311-38). In the example method depicted in FIG. 3E, the follower storage system (311-38) may process (311-18) the request (311-04) to modify the data set (311-42) by modifying the contents of one or more storage devices (e.g., NVRAM devices, SSDs, HDDs) included in the follower storage system (311-38) in dependence upon the information (311-10) describing the modification to the data set (311-42) and the I/O payload (311-14) received from the leader storage system (311-40). Consider an example in which a request (311-04) to modify a data set (311-42) is embodied as a write operation directed to a volume contained in the data set (311-42) and information (311-10) describing the modification to the data set (311-42) indicates that the write operation can only be performed after a previously issued write operation has been processed. In this example, processing (311-18) a request (311-04) to modify the dataset (311-42) may be carried out by: the follower storage system (311-38) first verifies that a previously issued write operation has been processed on the follower storage system (311-38), and then writes the I/O payload (311-14) associated with the write operation to one or more storage devices included in the follower storage system (311-38). In this example, the request (311-04) to modify the data set (311-42) may be considered complete and successfully processed, e.g., when the I/O payload (311-14) has been committed to persistent storage within the follower storage system (311-38).
The example method depicted in FIG. 3E also includes confirming (311-20) by the follower storage system (311-38) to the leader storage system (311-40) that the request (311-04) to modify the data set (311-42) is completed. In the example method depicted in FIG. 3E, confirmation (311-04) by the follower storage system (311-38) to the leader storage system (311-40) that the request (311-20) to modify the data set (311-42) is complete may be carried out by the follower storage system (311-38) sending a confirmation (311-22) message to the leader storage system (311-40). Such a message may include, for example, information identifying the particular request (311-04) to modify the data set (311-42) that has been completed and any additional information for confirming (311-20) by the follower storage system (311-38) that the request (311-04) to modify the data set (311-42) has been completed. In the example method depicted in FIG. 3E, acknowledging (311-20) to the leader storage system (311-40) that the request (311-04) to modify the data set (311-42) has been completed is illustrated by the follower storage system (311-38) sending an acknowledgment (311-22) message to the leader storage system (311-38).
The example method depicted in FIG. 3E also includes processing (311-24) a request (311-04) by the leader storage system (311-40) to modify the data set (311-42). In the example method depicted in FIG. 3E, the leader storage system (311-40) may process (311-24) the request (311-04) to modify the data set (311-42) by modifying the contents of one or more storage devices (e.g., NVRAM devices, SSDs, HDDs) included in the leader storage system (311-40) in dependence upon the information (311-10) describing the modification to the data set (311-42) and the I/O payload (311-14) received as part of the request (311-04) to modify the data set (311-42). Consider an example in which a request (311-04) to modify a data set (311-42) is embodied as a write operation directed to a volume contained in the data set (311-42) and information (311-10) describing the modification to the data set (311-42) indicates that the write operation can only be performed after a previously issued write operation has been processed. In this example, processing (311-24) the request (311-04) to modify the dataset (311-42) may be carried out by: the leader storage system (311-40) first verifies that a previously issued write operation has been processed by the leader storage system (311-40) and then writes the I/O payload (311-14) associated with the write operation to one or more storage devices included in the leader storage system (311-40). In this example, the request (311-04) to modify the data set (311-42) may be considered complete and successfully processed, e.g., when the I/O payload (311-14) has been committed to persistent storage within the leader storage system (311-40).
The example method depicted in FIG. 3E also includes receiving (311-26) from the follower storage system (311-38) an indication that the follower storage system (311-38) has processed a request (311-04) to modify the data set (311-42). In this example, the indication that the follower storage system (311-38) has processed the request (311-04) to modify the data set (311-42) is embodied as a confirmation (311-22) message sent from the follower storage system (311-38) to the leader storage system (311-40). The reader will appreciate that while many of the steps described above are depicted and described as occurring in a particular order, the particular order is not actually required. In practice, because the follower storage systems (311-38) and the leader storage systems (311-40) are independent storage systems, each storage system may perform some of the steps described above in parallel. For example, the follower storage system (311-38) may receive (311-16) information (311-10) describing a modification to the data set (311-42), process (311-18) a request (311-04) to modify the data set (311-42), or confirm (311-20) that the request (311-04) to modify the data set (311-42) is completed before the leader storage system (311-40) has processed (311-24) the request (311-04) to modify the data set (311-42). Alternatively, the leader storage system (311-40) may have processed (311-24) the request (311-04) to modify the dataset (311-42) before the follower storage system (311-38) has received (311-16) information (311-10) describing the modification to the dataset (311-42), processed (311-18) the request (311-04) to modify the dataset (311-42), or acknowledged (311-20) the request (311-04) to modify the dataset (311-42).
The example method depicted in FIG. 3E also includes confirming (311-34) by the leader storage system (311-40) that the request (311-04) to modify the data set (311-42) has been completed. In the example method depicted in FIG. 3E, acknowledging (311-34) that the request (311-04) to modify the data set (311-42) has been completed may be carried out using one or more acknowledgement (311-36) messages sent from the leader storage system (311-40) to the host (311-02) or via some other suitable mechanism. In the example method depicted in FIG. 3E, the leader storage system (311-40) may determine (311-28) whether the request (311-04) to modify the data set (311-42) has been processed (311-18) by the follower storage system (311-38) before confirming (311-34) that the request (311-04) to modify the data set (311-42) has been completed. The leader storage system (311-40) may determine (311-28) whether the request (311-04) to modify the data set (311-42) has been processed (311-18) by the follower storage system (311-38), for example, by determining whether the leader storage system (311-40) has received a confirmation message from the follower storage system (311-38) or other message indicating that the request (311-04) to modify the data set (311-42) has been processed (311-18) by the follower storage system (311-38). In this example, if the leader storage system (311-40) positively (311-30) determines that the request (311-04) to modify the data set (311-42) has been processed (311-18) by the follower storage system (311-38) and has also been processed (311-24) by the leader storage system (311-38), the leader storage system (311-40) may continue by confirming (311-34) to the host (311-02) initiating the request (311-04) to modify the data set (311-42) that the request (311-04) to modify the data set (311-42) has been completed. However, if the leader storage system (311-40) determines that the request (311-04) to modify the dataset (311-42) has not been processed (311-18) by the follower storage system (311-38) or has not been processed (311-24) by the leader storage system (311-38), then the leader storage system (311-40) may not have confirmed (311-34) to the host (311-02) initiating the request (311-04) to modify the dataset (311-42) that the request (311-04) to modify the dataset (311-42) has been completed because the leader storage system (311-40) may confirm (311-34) to the host (311-02) initiating the request (311-42) to modify the dataset (311-42) that the request (311-04) to modify the dataset (311-42) has been completed only if the request (311-04) to modify the dataset (311-42) has been successfully processed on all storage systems (311-38, 311-40) across which the data sets (311-42) were synchronously replicated.
The reader will appreciate that in the example method depicted in FIG. 3E, sending (311-12) information (311-10) describing the modification to the data set (311-42) from the leader storage system (311-40) to the follower storage system (311-38) and acknowledging (311-20) by the follower storage system (311-38) to the leader storage system (311-40) that the request (311-04) to modify the data set (311-42) has completed may be performed using a single round trip messaging. For example, single round trip messaging may be used by using fibre channel as the data interconnect. Typically, the SCSI protocol is used with fibre channel. Such interconnections are typically laid out between data centers, as some older replication techniques may be implemented to essentially replicate data as SCSI transactions over a fibre channel network. Moreover, historically, fibre channel SCSI infrastructure has less overhead and lower latency than Ethernet and TCP/IP based networks. In addition, when a data center is internally connected to a block storage array using fibre channel, the fibre channel network may be extended to other data centers so that hosts in one data center may switch to accessing storage arrays in a remote data center when a local storage array fails.
SCSI may be used as a general communication mechanism even though it is typically designed for use with block storage protocols to store and retrieve data in block oriented volumes (or for magnetic tape). For example, SCSI reads or SCSI writes may be used to deliver or retrieve message data between storage controllers in paired storage systems. Typical implementations of SCSI writes require two message round trips: the SCSI initiator sends a SCSI CDB describing the SCSI write operation, the SCSI target receives that CDB, and the SCSI target sends a "ready to receive" message to the SCSI initiator. The SCSI initiator then sends the data to the SCSI target, and when the SCSI write is complete, the SCSI target responds to the SCSI initiator with a successful completion. On the other hand, a SCSI read request requires only one round trip: the SCSI initiator sends a SCSI CDB describing the SCSI read operation, the SCSI target receives the CDB, and the data and then responds with a successful completion. Thus, in distance, the SCSI read induced distance dependent latency is half that of the SCSI write. Thus, the data communication receiver may use a SCSI read request to receive a message faster than the sender of the message may use a SCSI write request to send data. Using SCSI reads only requires the message sender to operate as a SCSI target and the message receiver to operate as a SCSI initiator. The message receiver may send a certain number of SCSI CDB read requests to any message sender, and the message sender will respond to one of the outstanding CDB read requests when message data is available. Because the SCSI subsystem may timeout if the read request is not completed for too long a time (e.g., 10 seconds), the read request should be responded to within a few seconds even if there is no message data to send.
As described in the SCSI stream command standard from the T10 technical committee of the international information technology standards committee (InterNational Committee on Information Technology Standards), SCSI tape requests support variable response data that can be more flexibly used to return message data of variable size. The SCSI standard also supports the instant mode of SCSI write requests, which may allow single round-trip SCSI write commands. The reader will appreciate that many of the embodiments described below also utilize single round trip messaging.
For further explanation, fig. 4 sets forth an example of a cloud-based storage system (403) according to some embodiments of the present disclosure. In the example depicted in fig. 4, the cloud-based storage system (403) is created entirely in the cloud computing environment (402), such as, for example, amazon web services ('AWS'), microsoft Azure, google cloud platform, IBM cloud, oracle cloud, and others. The cloud-based storage system (403) may be used to provide services similar to those that may be provided by the storage systems described above. For example, the cloud-based storage system (403) may be used to provide block storage services to users of the cloud-based storage system (403), the cloud-based storage system (403) may be used to provide storage services to users of the cloud-based storage system (403) by using solid state storage, and so on.
The cloud-based storage system (403) depicted in fig. 4 includes two cloud computing examples (404, 406), each for supporting execution of storage controller applications (408, 410). For example, cloud computing examples (404, 406) may be embodied as examples of cloud computing resources (e.g., virtual machines) that may be provided by a cloud computing environment (402) to support execution of software applications (e.g., storage controller applications (408, 410)). In one embodiment, the cloud computing examples (404, 406) may be embodied as amazon elastic computing cloud ('EC 2') examples. In this example, an amazon machine image ('AMI') including a storage controller application (408, 410) may be launched to create and configure a virtual machine that may execute the storage controller application (408, 410).
In the example method depicted in FIG. 4, the storage controller application (408, 410) may be embodied as computer program instruction modules that, when executed, perform various storage tasks. For example, the storage controller application (408, 410) may be embodied as a module of computer program instructions that, when executed, performs the same tasks as the controller (110A, 110B in fig. 1A) described above, such as writing data received from a user of the cloud-based storage system (403) to the cloud-based storage system (403), erasing data from the cloud-based storage system (403), retrieving data from the cloud-based storage system (403) and providing such data to a user of the cloud-based storage system (403), monitoring and reporting disk utilization and performance, performing redundant operations (e.g., redundant array of independent drives ('RAID') or RAID-like data redundancy operations), compressing data, encrypting data, deduplicating data, and so forth. Readers will appreciate that because there are two cloud computing instances (404, 406) that each include a storage controller application (408, 410), in some embodiments one cloud computing instance (404) may operate as a primary controller as described above, while the other cloud computing instance (406) may operate as a secondary controller as described above. In this example, to save costs, cloud computing instances operating as primary controllers (404) may be deployed on relatively high performance and relatively expensive cloud computing instances, while cloud computing instances operating as secondary controllers (406) may be deployed on relatively low performance and relatively inexpensive cloud computing instances. Readers will appreciate that the storage controller applications (408, 410) depicted in FIG. 4 may contain the same source code executing within different cloud computing instances (404, 406).
Consider an example in which the cloud computing environment (402) is embodied as an AWS and the cloud computing example is embodied as an EC2 example. In this example, the AWS provides many types of EC2 examples. For example, AWS provides a series of generic EC2 examples that include different levels of memory and processing capabilities. In this example, a cloud computing instance (404) operating as a primary controller may be deployed in one of the instance types with a relatively large amount of memory and processing power, while a cloud computing instance (406) operating as a secondary controller may be deployed in one of the instance types with a relatively small amount of memory and processing power. In this example, upon a failover event that occurs with primary and secondary role switching, a double failover may actually be performed such that: 1) A first failover event in which a cloud computing instance (406) that previously operated as a secondary controller begins to operate as a primary controller; and 2) a third cloud computing instance (not shown), which is an instance type having a relatively large amount of memory and processing power, is started with a copy of the storage controller application, wherein the third cloud computing instance begins to operate as a primary controller, and the cloud computing instance (406), which initially operates as a secondary controller, begins to operate again as a secondary controller. In this example, the cloud computing instance (404) that was previously operating as the primary controller may be terminated. Readers will appreciate that in alternative embodiments, cloud computing instance (404) operating as a secondary controller after a failover event may continue to operate as a secondary controller, and once a third cloud computing instance (not shown) has assumed the primary role, cloud computing instance (406) operating as a primary controller after the failover event occurs may be terminated.
Readers will appreciate that while the above-described embodiments relate to embodiments in which one cloud computing instance (404) operates as a primary controller and a second cloud computing instance (406) operates as a secondary controller, other embodiments are within the scope of the present disclosure. For example, each cloud computing instance (404, 406) may operate as a primary controller for some portion of the address space supported by the cloud-based storage system (403), each cloud computing instance (404, 406) may operate as a primary controller in the event that services directed to I/O operations of the cloud-based storage system (403) are partitioned in some other manner, and so on. Indeed, in other embodiments where cost savings may be prioritized over performance requirements, there may be only a single cloud computing instance containing a storage controller application. In this example, a controller failure may take more time to recover because a new cloud computing instance containing a storage controller application will need to be started, rather than having the created cloud computing instance assume the role of servicing the I/O operations that would otherwise be handled by the failed cloud computing instance.
The cloud-based storage system (403) depicted in fig. 4 includes cloud computing examples (424 a, 424b, 424 n) with local storage (414, 418, 422). For example, the cloud computing examples (424 a, 424b, 424 n) depicted in fig. 4 may be embodied as examples of cloud computing resources, which may be provided by the cloud computing environment (402) to support execution of software applications. The cloud computing example (424 a, 424b, 424 n) of fig. 4 may differ from the cloud computing examples (404, 406) described above in that the cloud computing example (424 a, 424b, 424 n) of fig. 4 has local storage (414, 418, 422) resources, while the cloud computing example (404, 406) supporting execution of the storage controller application (408, 410) does not need to have local storage resources. For example, cloud computing examples (424 a, 424b, 424 n) with local storage (414, 418, 422) may be embodied as EC 2M 5 examples including one or more SSDs, EC 2R 5 examples including one or more SSDs, EC 2I 3 examples including one or more SSDs, and so forth. In some embodiments, the local storage (414, 418, 422) must be embodied as a solid state storage (e.g., SSD) rather than a storage device using a hard disk drive.
In the example depicted in fig. 4, each of the cloud computing instances (424 a, 424b, 424 n) with the local storage (414, 418, 422) may include a software daemon (412, 416, 420) that, when executed by the cloud computing instance (424 a, 424b, 424 n), may present itself to the storage controller application (408, 410) as if the cloud computing instance (424 a, 424b, 424 n) were a physical storage (e.g., one or more SSDs). In this example, the software daemon (412, 416, 420) may include computer program instructions similar to those typically included on a storage device, so that the storage controller application (408, 410) can send and receive the same commands as the storage controller would send to the storage device. In this way, the storage controller application (408, 410) may contain the same (or substantially the same) code as would be executed by the controller in the storage system described above. In these and similar embodiments, communication between the storage controller application (408, 410) and the cloud computing instance (424 a, 424b, 424 n) with the local storage (414, 418, 422) may utilize iSCSI, TCP-based NVMe, messaging, custom protocols, or in some other mechanism.
In the example depicted in fig. 4, each of the cloud computing examples (424 a, 424b, 424 n) with local storage (414, 418, 422) may also be coupled to a block storage (426, 428, 430) provided by the cloud computing environment (402). The block storage devices (426, 428, 430) provided by the cloud computing environment (402) may be embodied, for example, as amazon elastic block storage ('EBS') volumes. For example, a first EBS volume (426) may be coupled to a first cloud computing instance (424 a), a second EBS volume (428) may be coupled to a second cloud computing instance (424 b), and a third EBS volume (430) may be coupled to a third cloud computing instance (424 n). In this example, the block storage (426, 428, 430) provided by the cloud computing environment (402) may be utilized in a manner similar to how the NVRAM devices described above are utilized, as software daemons (412, 416, 420) (or some other module) executing within a particular cloud computing instance (424 a, 424b, 424 n) may initiate writing data to its attached EBS volume and writing data to its local storage (414, 418, 422) resources upon receiving a request to write data. In some alternative embodiments, data may only be written to local storage (414, 418, 422) resources within a particular cloud computing instance (424 a, 424b, 424 n). In alternative embodiments, rather than using the block storage (426, 428, 430) provided by the cloud computing environment (402) as NVRAM, the actual RAM on each of the cloud computing instances (424 a, 424b, 424 n) with the local storage (414, 418, 422) may be used as NVRAM, thereby reducing the network utilization costs associated with using EBS volumes as NVRAM.
In the example depicted in fig. 4, cloud computing instances (424 a, 424b, 424 n) with local storage (414, 418, 422) may be utilized by cloud computing instances (404, 406) supporting execution of storage controller applications (408, 410) to service I/O operations directed to cloud-based storage system (403). Consider a first cloud computing example (404) in which a storage controller application (408) is executing to operate as an example of a primary controller. In this example, a first cloud computing instance (404) executing a storage controller application (408) may receive a request from a user of a cloud-based storage system (403) to write data to the cloud-based storage system (403) (directly or indirectly via a secondary controller). In this example, the first cloud computing instance (404) executing the storage controller application (408) may perform various tasks, such as, for example, deduplicating data contained in the request, compressing data contained in the request, determining where to write data contained in the request, etc., before eventually sending a request to one or more of the cloud computing instances (424 a, 424b, 424 n) having local storage (414, 418, 422) to write a deduplicated, encrypted, or possibly updated version of the data. In some embodiments, any cloud computing instance (404, 406) may receive a request to read data from a cloud-based storage system (403) and may ultimately send the request to read data to one or more of the cloud computing instances (424 a, 424b, 424 n) having local storage (414, 418, 422).
Readers will appreciate that when a request to write data is received by a particular cloud computing instance (424 a, 424b, 424 n) having a local storage (414, 418, 422), the software daemon (412, 416, 420) or some other computer program instruction module executing on the particular cloud computing instance (424 a, 424b, 424 n) may be configured to not only write data to its own local storage (414, 418, 422) resources and any suitable block storage (426, 428, 430) provided by the cloud computing environment (402), but also the software daemon (412, 416, 420) or some other computer program instruction module executing on the particular cloud computing instance (424 a, 424b, 424 n) may be configured to write data to a cloud-based object storage (432) attached to the particular cloud computing instance (424 a, 424b, 424 n). For example, the cloud-based object store (432) attached to a particular cloud computing instance (424 a, 424b, 424 n) may be embodied as an amazon simple storage service ('S3') store accessible by the particular cloud computing instance (424 a, 424b, 424 n). In other embodiments, cloud computing instances (404, 406), each including a storage controller application (408, 410), may initiate storage of data in local storage (414, 418, 422) and cloud-based object storage (432) of the cloud computing instance (424 a, 424b, 424 n).
Readers will appreciate that the software daemon (412, 416, 420) or other modules of computer program instructions that write data to block storage (e.g., local storage (414, 418, 422) resources) and also write data to cloud-based object storage (432) may execute on different types of processing units (e.g., different types of cloud computing examples, cloud computing examples containing different processing units). In practice, software daemons (412, 416, 420) or other modules of computer program instructions that write data to block storage (e.g., local storage (414, 418, 422) resources) and also write data to cloud-based object storage (432) may migrate between different types of cloud computing instances based on demand.
Readers will appreciate that, as described above, the cloud-based storage system (403) may be used to provide block storage services to the cloud-based storage system (403). While the local storage (414, 418, 422) resources and block storage (426, 428, 430) resources utilized by the cloud computing examples (424 a, 424b, 424 n) may support block-level access, the cloud-based object storage (432) attached to a particular cloud computing example (424 a, 424b, 424 n) only supports object-based access. To address this, the software daemon (412, 416, 420) or some other module of computer program instructions executing on the particular cloud computing instance (424 a, 424b, 424 n) may be configured to obtain blocks of data, package those blocks into objects, and write the objects to a cloud-based object store (432) attached to the particular cloud computing instance (424 a, 424b, 424 n).
Consider an example in which data is written in 1MB blocks to local storage (414, 418, 422) resources and block storage (426, 428, 430) resources utilized by a cloud computing example (424 a, 424b, 424 n). In this example, assuming a user of the cloud-based storage system (403) issues a request to write data, the data, after being compressed and de-duplicated by the storage controller application (408, 410), would result in 5MB of data needing to be written. In this example, writing data to the local storage (414, 418, 422) and block storage (426, 428, 430) resources utilized by the cloud computing example (424 a, 424b, 424 n) is relatively straightforward, as 5 blocks of 1MB size are written to the local storage (414, 418, 422) and block storage (426, 428, 430) resources utilized by the cloud computing example (424 a, 424b, 424 n). In this example, the software daemon (412, 416, 420) or some other model of computer program instructions executing on the particular cloud computing instance (424 a, 424b, 424 n) may be configured to: 1) Creating a first object containing data of a first 1MB and writing the first object to a cloud-based object store (432); 2) Creating a second object containing data of a second 1MB and writing the second object to a cloud-based object store (432); 3) Creating a third object containing data of a third 1MB and writing the third object to the cloud-based object store (432), and so on. Thus, in some embodiments, the size of each object written to the cloud-based object store (432) may be the same (or nearly the same). The reader will appreciate that in this example, metadata associated with the data itself may be included in each object (e.g., the first 1MB of the object is data and the remainder is metadata associated with the data).
Readers will appreciate that cloud-based object storage (432) may be incorporated into cloud-based storage system (403) to increase the persistence of cloud-based storage system (403). Continuing with the example described above in which the cloud computing example (424 a, 424b, 424 n) is the EC2 example, the reader will understand that only the EC2 example is guaranteed to have a normal run time of 99.9% per month and the data stored in the local example storage area is only maintained during the lifetime of the EC2 example. Thus, relying on cloud computing instances (424 a, 424b, 424 n) with local storage (414, 418, 422) as the sole persistent data storage source in a cloud-based storage system (403) may result in a relatively unreliable storage system. Similarly, EBS volumes were designed for 99.999% availability. Thus, even relying on an EBS as a persistent data store in a cloud-based storage system (403) may result in the storage system being less durable. However, amazon S3 is designed to provide 99.999999999% persistence, meaning that cloud-based storage systems (403) that can incorporate S3 into its storage pool are substantially more durable than various other options.
Readers will appreciate that while cloud-based storage system (403) that may incorporate S3 into its storage pool is substantially more durable than various other options, using S3 as the primary storage pool may result in a storage system having a relatively slow response time and a relatively long I/O latency. Thus, the cloud-based storage system (403) depicted in fig. 4 not only stores data in S3, but the cloud-based storage system (403) also stores data in local storage (414, 418, 422) resources and block storage (426, 428, 430) resources utilized by the cloud computing example (424 a, 424b, 424 n) such that read operations may provide services from the local storage (414, 418, 422) resources and block storage (426, 428, 430) resources utilized by the cloud computing example (424 a, 424b, 424 n), thereby reducing read latency when a user of the cloud-based storage system (403) attempts to read data from the cloud-based storage system (403).
In some embodiments, all data stored by the cloud-based storage system (403) may be stored in both: 1) A cloud-based object store (432); and 2) at least one of local storage (414, 418, 422) resources or block storage (426, 428, 430) resources utilized by the cloud computing instance (424 a, 424b, 424 n). In such embodiments, the local storage (414, 418, 422) resources and the block storage (426, 428, 430) resources utilized by the cloud computing examples (424 a, 424b, 424 n) may effectively operate as a cache that typically contains all of the data also stored in S3, such that all reads of the data may be serviced by the cloud computing examples (424 a, 424b, 424 n) without the cloud computing examples (424 a, 424b, 424 n) accessing the cloud-based object storage (432). However, readers will appreciate that in other embodiments, all data stored by the cloud-based storage system (403) may be stored in the cloud-based object storage (432), but not all data stored by the cloud-based storage system (403) may be stored in at least one of the local storage (414, 418, 422) resources or the block storage (426, 428, 430) resources utilized by the cloud computing examples (424 a, 424b, 424 n). In this example, various policies may be utilized to determine which subset of data stored by the cloud-based storage system (403) should reside in both: 1) A cloud-based object store (432); and 2) at least one of local storage (414, 418, 422) resources or block storage (426, 428, 430) resources utilized by the cloud computing instance (424 a, 424b, 424 n).
As described above, when the cloud computing instance (424 a, 424b, 424 n) with the local storage (414, 418, 422) is embodied as an EC2 instance, only the cloud computing instance (424 a, 424b, 424 n) with the local storage (414, 418, 422) is guaranteed to have a normal run time of 99.9% per month, and the data stored in the local instance storage area is maintained only during the lifetime of each cloud computing instance (424 a, 424b, 424 n) with the local storage (414, 418, 422). As such, one or more modules of computer program instructions executing within the cloud-based storage system (403) (e.g., a monitoring module executing on its own EC2 instance) may be designed to handle failures in one or more of the cloud computing instances (424 a, 424b, 424 n) having the local storage (414, 418, 422). In this example, the monitoring module may handle failure of one or more of the cloud computing instances (424 a, 424b, 424 n) with the local storage (414, 418, 422) by: creating one or more new cloud computing instances with local storage, retrieving data stored on the failed cloud computing instance (424 a, 424b, 424 n) from the cloud-based object storage (432), and storing the data retrieved from the cloud-based object storage (432) in the local storage on the newly created cloud computing instance. The reader will appreciate that many variations of this process may be implemented.
Consider an instance in which all cloud computing instances (424 a, 424b, 424 n) with local storage (414, 418, 422) fail. In this example, the monitoring module may create a new cloud computing instance with a local storage, wherein a high bandwidth instance type is selected that allows for a maximum data transfer rate between the newly created high bandwidth cloud computing instance with the local storage and the cloud-based object storage (432). The reader will appreciate that the type of instance that allows the maximum data transfer rate between the new cloud computing instance and the cloud-based object store (432) is selected so that the new high bandwidth cloud computing instance can be re-restored (rehydate) from the data from the cloud-based object store (432) as quickly as possible. Once the new high bandwidth cloud computing instance is restored with data from the cloud-based object storage (432), a cheaper lower bandwidth cloud computing instance may be created, the data may be migrated to the cheaper lower bandwidth cloud computing instance, and the high bandwidth cloud computing instance may be terminated.
Readers will appreciate that in some embodiments, the number of new cloud computing instances created may substantially exceed the number of cloud computing instances that need to locally store all of the data stored by the cloud-based storage system (403). The number of new cloud computing instances created may substantially exceed the number of cloud computing instances that need to locally store all of the data stored by the cloud-based storage system (403) in order to more quickly pull the data from the cloud-based object storage (432) and into the new cloud computing instances, as each new cloud computing instance may retrieve (in parallel) some portion of the data stored by the cloud-based storage system (403). In such embodiments, once data stored by the cloud-based storage system (403) has been pulled into the newly created cloud computing instances, the data may be consolidated within a subset of the newly created cloud computing instances and those too many newly created cloud computing instances may be terminated.
Consider an instance in which 1000 cloud computing instances are required in order to locally store all valid data that a user of the cloud-based storage system (403) has written to the cloud-based storage system (403). In this example, it is assumed that all of 1,000 cloud computing instances fail. In this example, the monitoring module may cause 100,000 cloud computing instances to be created, where each cloud computing instance is responsible for retrieving, from the cloud-based object store (432), a distinct 1/100,000 partition of valid data that a user of the cloud-based storage system (403) has written to the cloud-based storage system (403), and storing the distinct partition of its retrieved data set locally. In this example, because each of the 100,000 cloud computing instances may retrieve data from the cloud-based object storage (432) in parallel, the recovery of the cache layer may be 100 times faster than an embodiment in which the monitoring module creates only 1000 alternative cloud computing instances. In this example, over time, data stored locally in 100,000 may be consolidated into 1,000 cloud computing instances, and the remaining 99,000 cloud computing instances may be terminated.
The reader will appreciate that various performance aspects of the cloud-based storage system (403) may be monitored (e.g., by a monitoring module implemented in the EC2 example) such that the cloud-based storage system (403) may be expanded longitudinally or laterally as desired. Consider the example: wherein the monitoring module monitors performance of the cloud-based storage system (403) via communication with one or more of the cloud computing instances (404, 406) each for supporting execution of the storage controller application (408, 410), via communication between the monitoring cloud computing instances (404, 406, 424a, 424b, 424 n) and the cloud-based object storage (432), or in some other manner. In this example, it is assumed that the monitoring module determines that the cloud computing instance (404, 406) for supporting execution of the storage controller application (408, 410) is too small and insufficient to service I/O requests issued by users of the cloud-based storage system (403). In this example, the monitoring module may create a new more powerful cloud computing instance (e.g., a type of cloud computing instance that includes more processing power, more memory, etc.), which includes a storage controller application such that the new more powerful cloud computing instance may begin to operate as a primary controller. Likewise, if the monitoring module determines that the cloud computing instance (404, 406) for supporting execution of the storage controller application (408, 410) is too large and may obtain cost savings by switching to a smaller less powerful cloud computing instance, the monitoring module may create a new less powerful (and cheaper) cloud computing instance that contains the storage controller application such that the new less powerful cloud computing instance may begin to operate as the primary controller.
Consider an example in which the monitoring module determines that the utilization of local storage provided by the cloud computing examples (424 a, 424b, 424 n) together has reached a predetermined utilization threshold (e.g., 95%), as an additional example of dynamically sizing the cloud-based storage system (403). In this example, the monitoring module may create additional cloud computing instances with local storage to extend the local storage pool provided by the cloud computing instances. Alternatively, the monitoring module may create one or more new cloud computing instances having a greater amount of local storage than the existing cloud computing instances (424 a, 424b, 424 n) such that data stored in the existing cloud computing instances (424 a, 424b, 424 n) may migrate to the one or more new cloud computing instances and may terminate the existing cloud computing instances (424 a, 424b, 424 n), thereby expanding the local storage pool provided by the cloud computing instances. Likewise, if the local storage pool provided by the cloud computing instance does not need to be large, the data may be consolidated and some cloud computing instances may be terminated.
Readers will appreciate that the cloud-based storage system (403) may be dynamically scaled up and down by the monitoring module applying a predetermined set of rules, which may be relatively simple or relatively complex. Indeed, the monitoring module may not only take into account the current state of the cloud-based storage system (403), but the monitoring module may also apply predictive strategies based on, for example, observed behavior (e.g., from 10 pm to 6 pm per night, the usage of the storage system is relatively low), predetermined fingerprints (e.g., the number of IOPS directed to the storage system is increased by X each time the virtual desktop infrastructure adds 100 virtual desktops), and so forth. In this example, dynamic scaling of the cloud-based storage system (403) may be based on current performance metrics, predicted workload, and many other factors, including combinations thereof.
Readers will further appreciate that because the cloud-based storage system (403) is dynamically scalable, the cloud-based storage system (403) may operate even more dynamically. Consider an example of collection of discarded items. In conventional storage systems, the amount of storage is fixed. Thus, at some point, when the available storage has become constrained such that the storage system is nearing exhausted storage, the storage system may be forced to perform garbage collection. In contrast, the cloud-based storage system (403) described herein may always 'add' additional storage (e.g., by adding more cloud computing instances with local storage). Because the cloud-based storage system (403) described herein may always 'add' additional storage, the cloud-based storage system (403) may make more informed decisions about when to perform the collection of discarded items. For example, the cloud-based storage system (403) may implement a policy that performs collection of discarded items only when the number of IOPS serviced by the cloud-based storage system (403) falls below a particular level. In some embodiments, other system-level functions (e.g., deduplication, compression) may also be turned off and on in response to system loads, considering that the size of the cloud-based storage system (403) is not constrained in the same way that traditional storage systems are constrained.
Readers will appreciate that embodiments of the present disclosure address problems with the block storage services provided by some cloud computing environments because some cloud computing environments allow only one cloud computing instance to connect to a block storage volume at a single time. For example, in amazon AWS, only a single EC2 instance may be connected to an EBS volume. By using EC2 instances with local storage, embodiments of the present disclosure may provide multi-connection capability, where multiple EC2 instances may be connected to another EC2 instance with local storage ('drive instance'). In such embodiments, the drive instance may include software executing within the drive instance that allows the drive instance to support I/O that is booted to the particular volume from each connected EC2 instance. Thus, some embodiments of the present disclosure may be embodied as a multi-connection block storage service that may not include all of the components depicted in fig. 4.
In some embodiments, particularly in embodiments in which the cloud-based object storage (432) resources are embodied as amazon S3, the cloud-based storage system (403) may include one or more modules (e.g., computer program instruction modules executing on the EC2 example) configured to ensure that when the local storage of a particular cloud computing example is re-restored with data from S3, the appropriate data is in fact in S3. This problem arises primarily because S3 implements a final consistency model, where when an existing object is overwritten, the reading of the object will eventually (but not necessarily immediately) become consistent and eventually (but not necessarily immediately) will return an overwritten version of the object. To address this problem, in some embodiments of the present disclosure, the object in S3 is never overwritten. Instead, a traditional 'overwrite' would result in the creation of a new object (containing an updated version of the data) and eventually the deletion of an old object (containing a previous version of the data).
In some embodiments of the present disclosure, as part of never (or almost never) attempting to overwrite an object, when data is written to S3, the resulting object may be marked with a sequence number. In some embodiments, these sequence numbers may be maintained anywhere (e.g., in a database) such that at any point in time, the sequence number associated with some piece of data of nearly the latest version may be known. In this way, a determination may be made as to whether S3 has the latest version of a piece of data simply by reading the sequence number associated with the object (and without actually reading the data from S3). The ability to make this determination may be particularly important when a cloud computing instance with local storage crashes, as it would be undesirable to replace the local storage of the cloud computing instance with a re-restore of stale data. In practice, because the cloud-based storage system (403) does not need to access the data to verify its validity, the data may remain encrypted and the access load may be avoided.
In the example depicted in fig. 4, and as described above, cloud computing examples (404, 406) for supporting execution of storage controller applications (408, 410) may operate in a primary/secondary configuration in which one of the cloud computing examples (404, 406) for supporting execution of storage controller applications (408, 410) is responsible for writing data to a local storage (414, 418, 422) attached to the cloud computing example having the local storage (424 a, 424b, 424 n). However, in this example, because each of the cloud computing instances (404, 406) for supporting execution of the storage controller application (408, 410) may access a cloud computing instance having a local storage (424 a, 424b, 424 n), both cloud computing instances (404, 406) for supporting execution of the storage controller application (408, 410) may service requests to read data from the cloud-based storage system (403).
For further explanation, fig. 5 sets forth an example of an additional cloud-based storage system (502) according to some embodiments of the present disclosure. In the example depicted in fig. 5, the cloud-based storage system (502) is created entirely in the cloud computing environment (402), such as, for example, AWS, microsoft Azure, google cloud platform, IBM cloud, oracle cloud, and others. The cloud-based storage system (502) may be used to provide services similar to those that may be provided by the storage systems described above. For example, the cloud-based storage system (502) may be used to provide block storage services to users of the cloud-based storage system (502), the cloud-based storage system (403) may be used to provide storage services to users of the cloud-based storage system (403) by using solid state storage, and so on.
The cloud-based storage system (502) depicted in fig. 5 may operate in a manner slightly similar to the cloud-based storage system (403) depicted in fig. 4, in that the cloud-based storage system (502) depicted in fig. 5 contains a storage controller application (506) executing in a cloud computing instance (504). However, in the example depicted in fig. 5, the cloud computing example (504) executing the storage controller application (506) is the cloud computing example (504) with the local storage (508). In this example, data written to the cloud-based storage system (502) may be stored in both the local storage (508) of the cloud computing instance (504) and in the cloud-based object storage (510) in the same manner as described above using the cloud-based object storage (510). In some embodiments, for example, the storage controller application (506) may be responsible for writing data to the local storage (508) of the cloud computing instance (504), while the software daemon (512) may be responsible for ensuring that data is written to the cloud-based object storage (510) in the same manner as using the cloud-based object storage (510) above. In other embodiments, the same entity (e.g., a storage controller application) may be responsible for writing data to the local storage (508) of the cloud computing instance (504), and also for ensuring that data is written to the cloud-based object storage (510) in the same manner as described above using the cloud-based object storage (510).
Readers will appreciate that the cloud-based storage system (502) depicted in fig. 5 may represent a cheaper, less robust version than the cloud-based storage system depicted in fig. 4. In yet alternative embodiments, the cloud-based storage system (502) depicted in fig. 5 may include additional cloud computing instances with local storage that support execution of the storage controller application (506), such that if the cloud computing instance (504) executing the storage controller application (506) fails, failover may occur. Likewise, in other embodiments, the cloud-based storage system (502) depicted in fig. 5 may include additional cloud computing instances with local storage to extend the local storage provided by the cloud computing instances in the cloud-based storage system (502).
The reader will appreciate that many of the failure cases described above with reference to fig. 4 will also apply the cloud-based storage system (502) depicted in fig. 5. Likewise, the cloud-based storage system (502) depicted in fig. 5 may be dynamically scaled up and down in a similar manner as described above. The execution of various system-level tasks may also be performed intelligently by the cloud-based storage system (502) depicted in fig. 5, as described above.
The reader will appreciate that to increase the resiliency of the cloud-based storage system described above, various components may be located within different availability zones. For example, a first cloud computing instance supporting execution of a storage controller application may be located within a first availability zone, while a second cloud computing instance also supporting execution of a storage controller application may be located within a second availability zone. Likewise, cloud computing examples with local storage may be distributed across multiple availability zones. Indeed, in some embodiments, the entire second cloud-based storage system may be created in a different availability zone, wherein data in the original cloud-based storage system is copied (synchronously or asynchronously) to the second cloud-based storage system, such that if the entire original cloud-based storage system is down, then a replacement cloud-based storage system (second cloud-based storage system) may be introduced in a trace amount of time.
Readers will appreciate that the cloud-based storage system described herein may be used as part of a fleet of storage systems. Indeed, the cloud-based storage systems described herein may be paired with a locally deployed storage system. In this example, data stored in the locally deployed storage may be replicated (synchronously or asynchronously) to the cloud-based storage system, and vice versa.
For further explanation, FIG. 6 sets forth a flow chart illustrating an example method of servicing I/O operations in a cloud-based storage system (604). Although depicted in less detail, the cloud-based storage system (604) depicted in fig. 6 may be similar to the cloud-based storage system described above and may be supported by the cloud computing environment (602).
The example method depicted in fig. 6 includes receiving (606), by a cloud-based storage system (604), a request to write data to the cloud-based storage system (604). For example, a request to write data may be received by a user of a storage system communicatively coupled to the cloud computing environment from an application executing in the cloud computing environment, and otherwise received. In this example, the request may include data to be written to the cloud-based storage system (604). In other embodiments, the request to write data to the cloud-based storage system (604) may occur at a start-up time when the cloud-based storage system (604) is booted.
The example method depicted in fig. 6 also includes deduplicating the data (608). Data deduplication is a data reduction technique used to eliminate duplicate copies of duplicate data. The cloud-based storage system (604) may deduplicate (608) the data, for example, by comparing one or more portions of the data to data already stored in the cloud-based storage system (604), by comparing fingerprints of one or more portions of the data to fingerprints of data already stored in the cloud-based storage system (604), or otherwise. In this example, duplicate data may be removed and replaced by references to existing copies of data already stored in the cloud-based storage system (604).
The example method depicted in fig. 6 also includes compressing (610) the data. Data compression is a data reduction technique whereby information is encoded using fewer bits than the original representation. The cloud-based storage system (604) may compress (610) the data by applying one or more data compression algorithms to the data, at which point it may not include data already stored in the cloud-based storage system (604).
The example method depicted in fig. 6 also includes encrypting 612 the data. Data encryption is a technique that involves the conversion of data from a readable format to an encoded format that can only be read or processed after the data has been decrypted. The cloud-based storage system (604) may encrypt (612) the data using the encryption key, at which point it may have been deduplicated and compressed. The reader will appreciate that while the embodiment depicted in fig. 6 relates to deduplicating (608), compressing (610) and encrypting (612) data, other embodiments exist in which fewer of these steps are performed, and embodiments exist in which the same number of steps or fewer steps are performed in a different order.
The example method depicted in fig. 6 also includes storing 614 the data in a block storage of the cloud-based storage system 604. For example, storing (614) data in a block storage of the cloud-based storage system (604) may be performed by storing (616) the data in a solid state storage (e.g., a local storage (e.g., SSD)) of one or more cloud computing examples, as described in more detail above. In this example, data and parity data may be scattered across the local storage of many cloud computing examples to implement RAID or RAID-like data redundancy.
The example method depicted in fig. 6 also includes storing 618 the data in an object store of the cloud-based storage system 604. Storing (618) the data in an object store of the cloud-based storage system may include creating (620) one or more equal-sized objects, wherein each equal-sized object includes a distinct chunk of the data. In this example, because each object includes data and metadata, the data portions of each object may be of equal size. In other embodiments, the data portions of each created object may not be of equal size. For example, each object may include data from a predetermined number of blocks in the block store used in the previous paragraph or used in some other way.
The example method depicted in fig. 6 also includes receiving (622), by the cloud-based storage system, a request to read data from the cloud-based storage system (604). For example, a request to read data from a cloud-based storage system (604) may be received by a user of a storage system communicatively coupled to the cloud computing environment from an application executing in the cloud computing environment, and otherwise received. The request may include, for example, a logical address of data to be read from the cloud-based storage system (604).
The example method depicted in fig. 6 also includes retrieving (624) data from a block storage of the cloud-based storage system (604). Readers will appreciate that, for example, the cloud-based storage system (604) may retrieve (624) data from the block storage of the cloud-based storage system (604) by the storage controller application forwarding the read request to a cloud computing instance that contains the requested data in its local storage. Readers will appreciate that by retrieving (624) data from the block storage of the cloud-based storage system (604), the data may be retrieved faster than if the data were read from the cloud-based object storage, even if the cloud-based object storage does contain a copy of the data.
Readers will appreciate that in the example method depicted in fig. 6, the block storage of the cloud-based storage system (604) is characterized by low read latency relative to the object storage of the cloud-based storage system. Thus, by servicing read operations from a block store (rather than an object store), the cloud-based storage system (604) may be able to service read operations using low-latency block stores while still providing the flexibility associated with object store solutions provided by cloud service providers. Furthermore, the block storage of the cloud-based storage system (604) may provide relatively high bandwidth. The block storage of the cloud-based storage system (604) may be implemented in a variety of ways as will occur to readers of the present disclosure.
For further explanation, FIG. 7 sets forth a flow chart illustrating an additional example method of servicing I/O operations in a cloud-based storage system (604). The example method depicted in fig. 7 is similar to the example method depicted in fig. 6 in that the example method depicted in fig. 7 also includes: receiving (606) a request to write data to a cloud-based storage system (604); storing (614) the data in a block store of the cloud-based storage system (604); and storing (618) the data in an object store of the cloud-based storage system (604).
The example method depicted in fig. 7 also includes detecting (702) that at least some portion of a block storage device of the cloud-based storage system has become unavailable. For example, detecting (702) that at least some portion of a block storage device of a cloud-based storage system has become unavailable may be performed by detecting that one or more of the cloud computing examples including the local storage device has become unavailable, as described in more detail below.
The example method depicted in fig. 7 also includes identifying (704) data stored in a portion of the block storage of the cloud-based storage system that has become unavailable. For example, identifying (704) data stored in a portion of a block storage device of a cloud-based storage system that has become unavailable may be performed using metadata that maps some identifier (e.g., serial number, address) of a piece of data to a location where the data is stored. This metadata or separate metadata may also map the piece of data to one or more object identifiers that identify objects stored in object storage of the cloud-based storage system containing the piece of data.
The example method depicted in fig. 7 also includes retrieving (706) data stored in the portion of the block storage of the cloud-based storage system that has become unavailable from the object storage of the cloud-based storage system. For example, retrieving (706) data stored in the portion of the cloud-based storage system that has become unavailable from the object storage device of the cloud-based storage system may be performed by mapping data stored in the portion of the cloud-based storage system that has become unavailable to metadata of one or more objects stored in the object storage device of the cloud-based storage system that contains the piece of data using the description above. In this example, retrieving (706) the data may be performed by reading an object mapped to the data from an object store of the cloud-based storage system.
The example method depicted in fig. 7 also includes storing (708) the retrieved data in a block store of the cloud-based storage system. For example, storing (708) the retrieved data in the block storage of the cloud-based storage system may be performed by creating an alternate cloud computing instance with a local storage and storing the data in the local storage of one or more of the alternate cloud computing instances, as described in more detail above.
Readers will appreciate that while the embodiments described above relate to embodiments in which data stored in a portion of a block storage device of a cloud-based storage system that has become unavailable is substantially restored into a block storage layer of the cloud-based storage system by retrieving the data from an object storage layer of the cloud-based storage system, other embodiments are within the scope of the present disclosure. For example, because data may be distributed across local storage of multiple cloud computing examples using data redundancy techniques (e.g., RAID), in some embodiments, lost data may be restored into a block storage tier of a cloud-based storage system through RAID rebuild.
For further explanation, FIG. 8 sets forth a flow chart illustrating an example method of servicing I/O operations in a cloud-based storage system (804). Although depicted in less detail, the cloud-based storage system (804) depicted in fig. 8 may be similar to the cloud-based storage system described above and may be supported by a cloud computing environment (802).
The example method depicted in fig. 8 includes receiving (806), by the cloud-based storage system (804), a request to write data to the cloud-based storage system (804). For example, a request to write data may be received by a user of a storage system communicatively coupled to the cloud computing environment from an application executing in the cloud computing environment, and otherwise received. In this example, the request may include data to be written to the cloud-based storage system (804). In other embodiments, the request to write data to the cloud-based storage system (804) may occur at a boot time when the cloud-based storage system (804) is booted.
The example method depicted in fig. 8 also includes deduplicating the data (808). Data deduplication is a data reduction technique used to eliminate duplicate copies of duplicate data. For example, the cloud-based storage system (804) may deduplicate (808) the data by comparing one or more portions of the data to data already stored in the cloud-based storage system (804), by comparing fingerprints of one or more portions of the data to fingerprints of data already stored in the cloud-based storage system (804), or otherwise. In this example, duplicate data may be removed and replaced by references to existing copies of data already stored in the cloud-based storage system (804).
The example method depicted in fig. 8 also includes compressing 810 the data. Data compression is a data reduction technique whereby information is encoded using fewer bits than the original representation. The cloud-based storage system (804) may compress (810) the data by applying one or more data compression algorithms to the data, at which point it may not include the data already stored in the cloud-based storage system (804).
The example method depicted in fig. 8 also includes encrypting (812) the data. Data encryption is a technique that involves the conversion of data from a readable format to an encoded format that can only be read or processed after the data has been decrypted. The cloud-based storage system (804) may encrypt (812) the data using an encryption key, at which point it may have been deduplicated and compressed. The reader will appreciate that while the embodiment depicted in fig. 8 relates to deduplicating (808), compressing (810) and encrypting (812) data, other embodiments exist in which fewer of these steps are performed, and embodiments exist in which the same number of steps or fewer steps are performed in a different order.
The example method depicted in fig. 8 also includes storing 814 the data in a block storage of the cloud-based storage system 804. For example, storing (814) data in a block storage of the cloud-based storage system (804) may be performed by storing (816) the data in a local storage (e.g., SSD) of one or more cloud computing examples, as described in more detail above. In this example, data and parity data are scattered across the local storage of multiple cloud computing instances to implement RAID or RAID-like data redundancy.
The example method depicted in fig. 8 also includes storing 818 the data in an object store of the cloud-based storage system 804. Storing 818 the data in an object store of the cloud-based storage system may include creating 820 one or more equal-sized objects, wherein each equal-sized object includes a distinct chunk of the data, as described in more detail above.
The example method depicted in fig. 8 also includes receiving (822), by the cloud-based storage system, a request to read data from the cloud-based storage system (804). For example, a request to read data from the cloud-based storage system (804) may be received by a user of a storage system communicatively coupled to the cloud computing environment from an application executing in the cloud computing environment, and otherwise received. The request may include, for example, a logical address of data to be read from the cloud-based storage system (804).
The example method depicted in fig. 8 also includes retrieving (824) data from a block storage of the cloud-based storage system (804). Readers will appreciate that, for example, the cloud-based storage system (804) may retrieve (824) data from the block storage of the cloud-based storage system (804) by the storage controller application forwarding the read request to a cloud computing instance that contains the requested data in its local storage. Readers will appreciate that by retrieving (824) data from the block storage of the cloud-based storage system (804), the data may be retrieved faster than if the data were read from the cloud-based object storage, even if the cloud-based object storage does contain a copy of the data.
For further explanation, FIG. 9 sets forth a flow chart illustrating an additional example method of servicing I/O operations in a cloud-based storage system (804). The example method depicted in fig. 9 is similar to the example method depicted in fig. 8 in that the example method depicted in fig. 9 also includes: receiving (806) a request to write data to a cloud-based storage system (804); storing (814) the data in a block store of the cloud-based storage system (804); and storing (818) the data in an object store of the cloud-based storage system (804).
The example method depicted in fig. 9 also includes detecting (902) that at least some portion of a block storage device of the cloud-based storage system has become unavailable. For example, detecting (902) that at least some portion of a block storage device of a cloud-based storage system has become unavailable may be performed by detecting that one or more of the cloud computing examples including the local storage device has become unavailable, as described in more detail below.
The example method depicted in fig. 9 also includes identifying (904) data stored in a portion of the block storage of the cloud-based storage system that has become unavailable. For example, identifying (904) data stored in a portion of a block storage device of a cloud-based storage system that has become unavailable may be performed using metadata that maps some identifier (e.g., serial number, address) of a piece of data to a location where the data is stored. This metadata or separate metadata may also map the piece of data to one or more object identifiers that identify objects stored in object storage of the cloud-based storage system containing the piece of data.
The example method depicted in fig. 9 also includes retrieving (906) data stored in the portion of the block storage of the cloud-based storage system that has become unavailable from the object storage of the cloud-based storage system. For example, retrieving (906) data stored in the portion of the cloud-based storage system that has become unavailable from the object storage device of the cloud-based storage system may be performed by mapping data stored in the portion of the cloud-based storage system that has become unavailable to metadata of one or more objects stored in the object storage device of the cloud-based storage system that contains the piece of data using the above description. In this example, retrieving (906) the data may be performed by reading an object mapped to the data from an object store of the cloud-based storage system.
The example method depicted in fig. 9 also includes storing (908) the retrieved data in a block storage of the cloud-based storage system. For example, storing (908) the retrieved data in the block storage of the cloud-based storage system may be performed by creating an alternate cloud computing instance with a local storage and storing the data in the local storage of one or more of the alternate cloud computing instances, as described in more detail above.
For further explanation, FIG. 10 sets forth a flow chart illustrating an additional example method of servicing I/O operations in a cloud-based storage system (604). The example method depicted in fig. 10 is similar to the example method depicted in many of the above figures, in that the example method depicted in fig. 10 also includes: receiving (606) a request to write data to a cloud-based storage system (604); storing (614) the data in a block store of the cloud-based storage system (604); and storing (618) the data in an object store of the cloud-based storage system (604).
In the example method depicted in fig. 10, receiving (606) a request to write data to a cloud-based storage system may include receiving (1002), by a storage controller application executing in a cloud computing example, a request to write data to a cloud-based storage. The storage controller application executed in the cloud computing example may be similar to the storage controller application described above, and may be executed, for example, in the EC2 example, as described in more detail above. In practice, the cloud-based storage system (604) may actually contain multiple instances of EC2 or similar cloud computing instances, where the multiple instances of cloud computing each execute a storage controller application.
In the example method depicted in fig. 10, storing (614) data in block storage of a cloud-based storage system may include issuing (1004), by a storage controller application executing in a cloud computing example, instructions to write (1004) data to local storage within one or more cloud computing examples having local storage. The one or more cloud computing examples with local storage may be similar to the cloud computing examples with local storage described above. In the example method depicted in fig. 10, a storage controller application executing in a cloud computing example may be coupled for data communication with multiple cloud computing examples having local storage. In this way, a storage controller application executing in a cloud computing instance may treat multiple cloud computing instances with local storage as individual storage such that the storage controller application executing in a cloud computing instance may issue (1004) instructions to write data to local storage within one or more cloud computing instances with local storage by issuing the same set of commands that the storage controller application would issue when writing data to connected storage. Readers will appreciate that because a storage controller application executing in a cloud computing instance may be coupled for data communication with multiple cloud computing instances having local storage, a storage array controller may be connected to multiple block storage sources, and if the storage array controller is configured to use an EBS as its block storage, the storage array controller may only be connected to a single EBS volume.
In the example method depicted in fig. 10, one or more of a plurality of cloud computing examples having local storage may be coupled for data communication with a plurality of cloud computing examples each executing a storage controller application. Readers will appreciate that in some embodiments, because there are multiple cloud computing instances each executing a storage controller application, the storage controller application executing on a first cloud computing instance may serve as a primary controller, while additional storage controller applications executing on additional cloud computing instances may serve as secondary controllers that may take over the primary controller upon the occurrence of some event (e.g., failure of the primary controller).
For further explanation, FIG. 11 sets forth a flow chart illustrating an additional example method of servicing I/O operations in the cloud-based storage system (604). The example method depicted in fig. 11 is similar to the example method depicted in many of the above figures, in that the example method depicted in fig. 11 also includes: receiving (606) a request to write data to a cloud-based storage system (604); storing (614) the data in a block store of the cloud-based storage system (604); and storing (618) the data in an object store of the cloud-based storage system (604).
In the example method depicted in fig. 11, storing 614 data in a block storage of a cloud-based storage system may include writing 1102 data into one or more blocks of the block storage using a block-level protocol. In the example method depicted in fig. 11, the block storage device may be embodied as one or more block storage devices, such as NAND flash memory, in which data is stored in blocks that may each be used to store data of a maximum size (i.e., block size). Data may be written 1102 to such storage using block-level protocols such as iSCSI, fibre channel, and FCoE (ethernet-based fibre channel), for example, and the like. The reader will appreciate that by writing (1102) data into one or more blocks of the block storage using a block level protocol, the data written to the block storage of the cloud-based storage system is thus stored in the blocks.
In the example method depicted in fig. 11, storing (618) data in an object store of the cloud-based storage system may include writing (1104) data into one or more objects in the object store using an object-level protocol. In the example method depicted in fig. 11, unlike other storage architectures such as a file system that manages data as a file hierarchy and a block storage that manages data as blocks, an object storage may be configured to manage data as objects. This object store may be implemented at the device level (object store), system level, interface level, or in some other manner. Data may be written 1104 to the object storage device using object-level protocols such as, for example, the SCSI command set of the object storage device, the RESTful/HTTP protocol, the AWS S3 API, a cloud data management interface for accessing the cloud storage device, and others. The reader will appreciate that by writing (1104) one or more objects into the object store using the object-level protocol, the data written to the object store of the cloud-based storage system is thus stored in the object, rather than as a block in the case of the previous paragraph.
In the example method depicted in fig. 11, for each block of data, the data contained in a particular block may be written into a unique object. The reader will appreciate that each object written 1104 to the object store may include the data itself and its associated metadata, and that each object may be associated with a globally unique identifier (rather than a file name and file path, block number, etc.). Thus, data contained in a particular block may be written into a unique object in the sense that the unique object contains the data itself, metadata associated with the data, and a globally unique identifier. In such embodiments, the cloud-based storage system may thus maintain a mapping from each data block stored in the block storage of the cloud-based storage system and each object stored in the object storage of the cloud-based storage system. In some embodiments, each object may include data contained in multiple blocks, but the data contained in multiple blocks need only be stored in a single object.
For further explanation, FIG. 12 sets forth an example virtual storage system architecture 1200 according to some embodiments. The virtual storage system architecture may include cloud-based computing resources similar to the cloud-based storage systems described above with reference to fig. 4-11.
1A-3E, in some embodiments of a physical storage system, the physical storage system may include one or more controllers that provide storage services to one or more hosts, and wherein the physical storage system includes durable storage (e.g., solid state drives or hard disks) and also includes some fast durable storage (e.g., NVRAM). In some examples, a fast endurance storage device may be used for staging or transaction commit or for accelerating operation endurance validation to reduce latency of host requests.
Typically, fast durable storage is typically used for intent logging, fast completion, or fast assurance of transaction consistency, with such (or similar) purposes referred to herein as hierarchical memory. Typically, both physical and virtual storage systems may have one or more controllers, and may have dedicated storage components, such as in the case of physical storage, dedicated storage. Furthermore, in some cases, in physical and virtual storage systems, the hierarchical memories may be organized and reorganized in various ways, such as in the examples described later. In some examples, regardless of the manner in which the memory component or memory device is constructed, generated, or organized, there may be a set of storage system logic executing to implement a set of advertisement storage services and store large amounts of data for an unlimited duration, and there may also be some amount of hierarchical memory.
In some examples, controller logic that operates a physical storage system (e.g., physical storage systems 1A-3E) may be implemented within a virtual storage system by providing appropriate virtual components to individually or collectively serve as substitutes for hardware components in the physical storage system, wherein the virtual components are configured to operate the controller logic and interact with other virtual components configured to replace physical components other than the controller.
Continuing with this example, the virtual component executing the controller logic may implement and/or adapt to a high availability model that maintains the virtual storage system operating in the event of a failure. As another example, a virtual component executing controller logic may implement a protocol to prevent a virtual storage system from losing data while continuing to operate in the face of transient faults that may exceed the virtual storage system can tolerate.
In some implementations, and in particular with respect to the various virtual storage system architectures described with reference to fig. 12-17, a computing environment may include a system for providing a cloud-based infrastructure as a service platform (e.g., by Amazon Web Services TM 、Microsoft Azure TM And/or Google Cloud Platform TM Provided cloud infrastructure) is a typical set of available ad constructs. In some implementations, example constructions and construction characteristics within such cloud platforms may include:
Computing examples, where the computing examples may be executed or run as virtual machines flexibly assigned to physical host servers;
dividing the computing resources into separate geographic areas, wherein the computing resources may be distributed or divided among the separate geographic areas such that users within the same area or region as a given cloud computing resource may experience faster and/or higher bandwidth access than users in a different area or region than the computing resources;
in the case of large-scale data center outages, network failures, grid failures, management errors, etc., resources within a geographic region are divided into "availability" areas with individual availability and survivability. Moreover, in some examples, resources within a particular cloud platform in separate availability zones within the same geographic area typically have relatively high bandwidth and relatively low latency from one another;
local example storage that may provide private storage to the computing examples, e.g., hard drives, solid state drives, rack local storage. Other examples of local example storage devices are described above with reference to fig. 4 through 11;
relatively high-speed and durable, and can connect to a virtual machine but its access can be migrated to the block storage area. Some examples include EBS (Elastic Block Store) in AWS TM )、Microsoft Azure TM Is a managed disk in (a) and Google Cloud Platform TM Is a persistent disk of the compute engine. The EBS in the AWS operates within a single availability zone, but is otherwise fairly reliable and available, and is intended for long-term use by computing examples, even though those computing examples may move between physical systems and racks;
for example Amazon S3 TM Or an object store using a protocol derived from S3, compatible with S3, or having some characteristics similar to S3 (e.g., microsoft Azure Blob Storage TM ). In general, object storage areas are sufficiently durable to survive extensive disruption by mutual availability areas and cross-geographic replication;
cloud platform, which can support various object storage areas or other storage types that may differ in their combination of capacity price, access price, expected latency, expected throughput, availability guarantees, or durability guarantees. For example, in AWS TM The standard and infrequently accessed S3 storage class (referred to herein as the standard and primarily written storage class) differ in availability (rather than endurance) as well as capacity and access price (with infrequently accessed storage tiers being cheaper in capacity but more expensive to retrieve and having an intended availability of 1/10). Infrequent access S3 also supports even cheaper times that cannot tolerate a complete loss of availability zone A variant, referred to herein as a single availability zone durable storage area. AWS further supports archive levels (e.g., glacier) that provide their lowest capacity price, but have very high access latency (on the order of minutes to hours for Glacier, and up to 12 hours for Deep Glacier, with limits on retrieval frequency) TM Deep glass TM ). Glacer and Deep Glacer are referred to herein as examples of archive and Deep archive storage classes;
databases and typically many different types of databases, including high-scale key value store databases with reasonable durability (similar to high-speed durable block store) and a convenient set of atomic update primitives. Some examples of durable key value databases include AWS DynamoDB TM 、Google Cloud Platform Big Table TM CosmoDB of Microsoft Azure TM The method comprises the steps of carrying out a first treatment on the surface of the A kind of electronic device with high-pressure air-conditioning system
Dynamic functions, such as code segments that may be configured to run dynamically within the cloud platform infrastructure in response to events or actions associated with the configuration. For example, in AWS, these dynamic functions are called AWS Lambdas TM And microsoft Azure and Gu Geyun platforms refer to such dynamic Functions as Azure Functions, respectively TM Cloud Functions TM
In some embodiments, the local example storage device is not desirably configured for long-term use, and in some examples, the local example storage device may not be migrated as the virtual machine migrates between host systems. In some cases, local example storage may also not be shared among virtual machines, and may be accompanied by little availability assurance due to its local nature (likely to survive local power and software failures, but not necessarily more extensive failures). Furthermore, in some instances, the local example storage may be relatively inexpensive compared to the object storage, and may not be based on I/O billing for its issue, as is often the case with more durable block storage services.
In some implementations, objects within the object store are easy to create (e.g., web services PUT operations to create objects with names within a certain bucket associated with an account) and retrieve (e.g., web services GET operations), and creating and retrieving in parallel across a sufficient number of objects can result in significant bandwidth. However, in some cases, latency is often poor, and modification or replacement of an object may be completed in an unpredictable amount of time, or it may be difficult to determine when an object is fully durable and consistently available across the cloud platform infrastructure. Furthermore, in general, the availability of object storage areas is often low, as opposed to durability, which is often a problem for many services running in a cloud environment.
In some implementations, as an example baseline, the virtual storage system may include one or more of the following virtual components and concepts for constructing, building, and/or defining a virtual storage system that is built on a cloud platform:
virtual controllers, such as virtual storage system controllers running on computing instances within the cloud platform's infrastructure or cloud computing environment. In some examples, the virtual controller may run on a virtual machine, in a container, or on a bare metal server;
A virtual drive, wherein the virtual drive may be a particular storage object provided to a virtual storage system controller to represent a data set; for example, a virtual drive may be a volume or emulated disk drive that may be serviced within a virtual storage system similar to a physical storage system "storage device". In addition, virtual drives may be provided to the virtual storage system controller by a "virtual drive server";
the virtual drive server may be implemented by a computing example, where the virtual drive server may present storage devices, such as virtual drives, outside of available components provided by the cloud platform (e.g., various types of local storage options), and where the virtual drive server implements logic that provides virtual drives to one or more virtual storage system controllers, or in some cases to one or more virtual storage systems.
A hierarchical memory that may be fast and durable, or at least fairly fast and fairly durable, wherein fairly durable may be specified in terms of a durability metric, and wherein fairly fast may be specified in terms of a performance metric (e.g., IOPS);
a virtual storage system dataset, which may be a defined set of data and metadata representing coherently managed content, the defined set representing a set of file systems, volumes, objects, and other similarly addressable portions of memory;
Object store, which may provide back-end durable object store to hierarchical memory. As illustrated in fig. 12, cloud-based object store 432 may be managed by virtual drives 1210-1216;
segments, which may be designated as media-sized data chunks. For example, a segment may be defined as being in the range of 1MB to 64MB, where the segment may hold a combination of data and metadata; a kind of electronic device with high-pressure air-conditioning system
Virtual storage system logic, which may be a set of algorithms running on at least one or more virtual controllers 408, 410, and in some cases where some virtual storage system logic also runs on one or more virtual drives 1210-1216.
In some implementations, the virtual controller may accept or receive I/O operations and/or configuration requests from the client hosts 1260, 1262 (possibly through an intermediary server, not depicted) or from a management interface or tool, and then ensure that the I/O requests and other operations run until completion.
In some examples, the virtual controller may present a file system, a block-based volume, an object store, and/or a particular kind of mass storage database or key/value store, and may provide data services such as snapshot, replication, migration services, provisioning, host connectivity management, deduplication, compression, encryption, secure sharing, and other such storage system services.
In the example virtual storage system 1200 architecture illustrated in FIG. 12, the virtual storage system 1200 includes two virtual controllers, one of which operates in one time zone (time zone 1251) and the other of which operates in the other time zone (time zone 1252). In this example, two virtual controllers are depicted as a storage controller application 408 running within cloud computing instance 404 and a storage controller application 410 running within cloud computing instance 406, respectively.
In some implementations, the virtual drive server, as discussed above, may represent things to the host that are similar to a physical storage device, such as a disk drive or a solid state drive, where the physical storage device operates within the context of the physical storage system.
However, while the virtual driver similarly appears to the host as a physical storage device in this example, the virtual driver is implemented by a virtual storage system architecture, which may be any of those architectures depicted in fig. 4-16. Furthermore, in contrast to virtual drives that have physical storage as emulation, as implemented within an example virtual storage system architecture, virtual drive servers may not have emulation within the context of a physical storage system. In particular, in some examples, the virtual drive server may implement logic that exceeds that typical of storage devices in a physical storage system, and in some cases may rely on atypical storage system protocols between the virtual drive server and a virtual storage system controller that is not emulated in the physical storage system. However, conceptually, a virtual drive server may share a similarity point with a laterally expanding shared-nothing or software-defined storage system.
In some implementations, referring to fig. 12, the respective virtual drive servers 1210-1216 can implement respective software applications or daemons 1230-1236 to provide virtual drives whose functionality is similar or even identical to that of the physical storage devices, which allows for easier migration of storage system software or applications designed for the physical storage systems. For example, it may implement standard SAS, SCSI, or NVMe protocols, or it may implement these protocols but with small or significantly non-standard extensions.
In some implementations, referring to fig. 12, hierarchical memory may be implemented by one or more virtual drives 1210-1216, where the one or more virtual drives 1210-1216 store data within respective block storage volumes 1240-1246 and local storage devices 1220-1226. In this example, the block storage volume may be an AWS EBS volume that may be attached to two or more other virtual drives one after the other (as depicted in fig. 12). As illustrated in fig. 12, block storage volume 1240 is attached to virtual drive 1212, block storage volume 1242 is attached to virtual drive 1214, and so on.
In some embodiments, segments may be designated as part of an erasure coding set, for example, based on RAID style embodiments, where segments may store parity content calculated based on erasure codes (e.g., RAID-5P and Q data) calculated from the content of other segments. In some examples, segment content may be created once and not modified after the segment is created and filled until the segment is discarded or discarded item collection occurs.
In some embodiments, virtual storage system logic may also run from other virtual storage system components, such as dynamic functions. Virtual storage system logic may provide a complete implementation of the capabilities and services advertised by virtual storage system 1200, where virtual storage system 1200 reliably and with appropriate durability implements these services using one or more available cloud platform components, such as the cloud platform components described above.
Although the example virtual storage system 1200 illustrated in FIG. 12 includes two virtual controllers, more generally, other virtual storage system architectures can have more or fewer virtual controllers, as illustrated in FIGS. 13-16. Further, in some implementations, and similar to the physical storage systems described in fig. 1A-4, the virtual storage system may include an active virtual controller and one or more passive virtual controllers.
For further explanation, FIG. 13 sets forth an example virtual storage system architecture 1300 according to some embodiments. The virtual storage system architecture may include cloud-based computing resources similar to the cloud-based storage systems described above with reference to fig. 4-12.
In this embodiment, the virtual storage system may run virtual storage system logic concurrently on multiple virtual controllers, e.g., by partitioning the data set or by carefully implementing concurrent distributed algorithms, as specified above with reference to fig. 12. In this example, a plurality of virtual controllers 1320, 408, 410, 1322 are implemented within respective cloud computing examples 1310, 404, 406, 1312.
As described above with reference to fig. 12, in some embodiments, a particular set of hosts may be directed preferentially or exclusively to a subset of virtual controllers for a data set, while a different particular set of hosts may be directed preferentially or exclusively to a different subset of controllers for that same data set. For example, SCSI ALUA (asymmetric logical unit access) or NVMe ANA (asymmetric namespace access) or some similar mechanism may be used to establish preferred (sometimes referred to as "optimized") path preferences from one host to a subset of controllers, where traffic is typically directed to a preferred subset of controllers, but where the traffic may be redirected to a different subset of virtual storage system controllers, for example in the event of a request failure or network failure or virtual storage system controller failure. Alternatively, SCSI/NVMe volume advertising or network constraints or some similar alternative mechanism may force all traffic from a particular set of hosts exclusively to one subset of controllers, or may force traffic from a different particular set of hosts to a different subset of controllers.
As illustrated in fig. 13, the virtual storage system may preferentially or exclusively direct I/O requests from host 1260 to virtual storage controllers 1320 and 408 (where storage controller 410 and possibly 1322 are potentially available to host 1260 for use in the event of a request failure), and may preferentially or exclusively direct I/O requests from host 1262 to virtual storage controllers 410 and 1322 (where storage controller 408 and possibly 1320 are potentially available to host 12622 for use in the event of a request failure). In some implementations, a host may be booted to issue I/O requests to one or more virtual storage controllers within the same availability zone as the host, where virtual storage controllers in a different availability zone than the host are available for use in the event of failure.
For further explanation, FIG. 14 illustrates an example virtual storage system architecture 1400 according to some embodiments. The virtual storage system architecture may include cloud-based computing resources similar to the cloud-based storage systems described above with reference to fig. 4-13.
In some implementations, the boundary between the virtual controller and the virtual drive server hosting the virtual drive may be flexible. Moreover, in some examples, boundaries between virtual components may not be visible to client hosts 1450 a-1450 p, and client hosts 1450 a-1450 p may not detect any dissimilarity between virtual storage systems of two different architectures that provide a set of the same storage system services.
For example, virtual controllers and virtual drivers may be combined into a single virtual entity that may provide similar functionality to a traditional blade-based laterally-extending storage system. In this example, virtual storage system 1400 includes n virtual blades (virtual blades 1402 a-1402 n), where each respective virtual blade 1402 a-1402 n may include a respective virtual controller 1404 a-1404 n, and also includes a respective local storage 1220-1226, 1240-1246, but where the storage function may use a platform that provides object storage, as in the case of the previously described virtual drive implementations.
In some embodiments, because the virtual drive server supports general purpose computing, this virtual storage system architecture supports migration of functionality between the virtual storage system controller and the virtual drive server. Furthermore, in other cases, this virtual storage system architecture supports other kinds of optimizations, such as those described above that may be performed within the hierarchical memory. Further, virtual blades may be configured with different levels of processing power, where the performance specifications of a given virtual blade or blades may be based on the intended optimizations to be performed.
For further explanation, FIG. 15 sets forth an example virtual storage system architecture 1500 according to some embodiments. The virtual storage system architecture may include cloud-based computing resources similar to the cloud-based storage systems described above with reference to fig. 4-14.
In this embodiment, the virtual storage system 1500 may be adapted to different availability zones, wherein this virtual storage system 1500 may use the interleaved storage system synchronous replication logic to isolate as many portions of an instance of the virtual storage system as possible within one availability zone. For example, the presented virtual storage system 1500 may be constructed from a first virtual storage system 1502 in one availability zone (zone 1), the first virtual storage system 1502 synchronously copying data to a second virtual storage system 1504 in another availability zone (zone 2) such that the presented virtual storage system may continue to operate and provide its services even in the event that data or availability in one availability zone or another is lost. This implementation may further be implemented to share the use of durable objects such that storing data into object storage areas is coordinated such that two virtual storage systems do not duplicate the stored content. Furthermore, in this implementation, two synchronous replication storage systems may synchronously replicate updates to the hierarchical memory and possibly the local example storage within each of their availability zones to significantly reduce the chance of data loss, while coordinating the updates to the object storage as later asynchronous activities to significantly reduce the capacity cost stored in the object storage.
In this example, virtual storage system 1504 is implemented within cloud computing environment 1501. Further, in this example, virtual storage system 1502 can use cloud-based object storage 1550 and virtual storage system 1504 can use cloud-based storage 1552, where in some cases, such as AWS S3, the different object storage 1550, 1552 can be the same cloud object storage with different buckets.
Continuing with this example, in some cases, virtual storage system 1502 may synchronously copy data to other virtual or physical storage systems in other availability zones (not depicted).
In some implementations, the virtual storage system architectures of virtual storage systems 1502 and 1504 can be dissimilar, and even incompatible, where synchronous replication can instead depend on a protocol-compatible synchronous replication model. Synchronous replication is described in more detail above with reference to fig. 3D and 3E.
In some implementations, virtual storage system 1502 may be implemented similar to virtual storage system 1400 described above with reference to fig. 14, and virtual storage system 1504 may be implemented similar to virtual storage system 1200 described above with reference to fig. 12.
For further explanation, FIG. 16 sets forth an example virtual storage system architecture 1500 according to some embodiments. The virtual storage system architecture may include cloud-based computing resources similar to the cloud-based storage systems described above with reference to fig. 4-15.
In some implementations, similar to the example virtual storage system 1500 described above with reference to fig. 15, the virtual storage system 1600 may include multiple virtual storage systems 1502, 1504 that coordinate to perform synchronous replication from one virtual storage system to another.
However, in contrast to the example virtual storage system 1500 described above, the virtual storage system 1600 illustrated in fig. 16 provides a single cloud-based object storage 1650 shared among the virtual storage systems 1502, 1504.
In this example, shared cloud-based object store 1650 can be viewed as an additional data replica target, delaying updates using mechanisms and logic associated with a consistent but unsynchronized replication model. In this way, a single cloud-based object storage 1650 may be consistently shared among multiple individual virtual storage systems 1502, 1504 of virtual storage system 1600.
In each of these example virtual storage systems, the virtual storage system logic may generally incorporate a distributed programming concept to effectuate the implementation of the core logic of the virtual storage system. In other words, when applied to a virtual storage system, the virtual system logic may be distributed between laterally expanding implementations of virtual storage system controllers, combined virtual system controllers, and virtual drive servers, and implementations that split or otherwise optimize processing between the virtual storage system controllers and the virtual drive servers.
For further explanation, FIG. 17 sets forth a flow chart illustrating an example method of data flow within virtual storage system 1700. The example method depicted in fig. 17 may be implemented on any of the virtual storage systems described above with reference to fig. 12-16. In other words, virtual storage system 1700 may be implemented by any of virtual storage systems 1200, 1300, 1400, 1500, or 1600.
As depicted in fig. 17, an example method includes: receiving (1702), by virtual storage system 1700, a request to write data to virtual storage system 1700; storing (1704) the data 1754 in hierarchical memory provided by one or more virtual drives of the virtual storage system 1700; and migrating (1706) at least a portion of the data stored within the hierarchical memory from the hierarchical memory to a more durable data storage device provided by the cloud service provider.
The request to write data to virtual storage system 1700 received (1702) by virtual storage system 1700 may be performed as described above with reference to fig. 4-16, wherein the data may be contained within one or more received storage operations 1752, and the request may be received using one or more communication protocols or one or more API calls provided by cloud computing environment 402 hosting virtual storage system 1700.
Storing (1704) data 1754 within hierarchical memory provided by one or more virtual drives of virtual storage system 1700 may be performed as described above with reference to virtual storage systems 1200-1600, wherein a virtual storage system (e.g., virtual storage system 1200) receives data from client host 1260 at virtual controllers 408, 410, and wherein virtual controllers 408, 410 store the data in local storage of the layers of virtual drives 1210-1216. The hierarchical memory provided by the virtual drive is described in more detail above with reference to fig. 12.
Migrating (1706) at least a portion of the data stored within the hierarchical memory from the hierarchical memory to a more durable data storage provided by the cloud service provider may be effectuated as described above with reference to fig. 4-16, wherein the data is migrated from the hierarchical memory to the cloud-based object storage.
Additional examples of receiving data and storing the data within a hierarchical memory, and then migrating the data from the hierarchical memory to a more durable storage device, are described in co-pending patent application No. 16/524,861, which is incorporated herein in its entirety for all purposes. In particular, all migration techniques described in co-pending patent application Ser. No. 16/524,861 describe storing data within a hierarchical memory (also referred to as a first storage tier) and optionally processing, modifying, or optimizing data within the hierarchical memory prior to migrating the hierarchical memory data to a more durable memory or cloud-based object storage device based on migration events.
For further explanation, FIG. 18 sets forth a flow chart illustrating an example method of data flow within virtual storage system 1700. The example method depicted in fig. 18 may be implemented by any of the virtual storage systems described above with reference to fig. 4-16. In other words, virtual storage system 1700 may be implemented by at least virtual storage system 1200, 1300, 1400, 1500, 1502, 1504, or 1600, individually, or through a combination of individual features.
An implementation of data flow through storage tiers of a virtual storage system is described with respect to the above example of FIG. 18, and more specifically, the flow of data from hierarchical storage to more durable object storage. More generally, however, data flow through a virtual storage system may occur in a level between any pair of multiple different storage tiers. In particular, in this example, the different storage levels may be: (1) virtual controller storage; (2) Hierarchical memory for transaction consistency and fast completion; (3) A storage device within a virtual drive provided by a virtual drive server; (4) virtual drive server local instance storage; and (5) an object storage area provided by the cloud service provider.
As depicted in fig. 18, an example method includes: receiving (1802) by virtual storage system 1700 a request to write data to virtual storage system 1700; storing (1804) data 1854 in a storage device provided by a first storage tier of virtual storage system 1700; and migrating (1806) at least a portion of the data stored within the first storage tier from the first storage tier to the second storage tier.
The request to write data 1854 to virtual storage system 1700 received (1802) by virtual storage system 1700 may be carried out as described above with reference to fig. 4-17, wherein the data may be contained within one or more received storage operations 1852 from a host computer or application, and the request may be received using one or more communication protocols or one or more API calls provided by cloud computing environment 402 hosting virtual storage system 1700.
Storing (1804) data 1854 within storage provided by a first storage tier of virtual storage system 1700 may be performed as described above with reference to fig. 4-17, wherein one or more virtual controllers may be configured to receive and handle storage operations 1852, including processing write requests and storing corresponding write data into one or more storage tiers of virtual storage system 1700. The 5 example storage tiers of the virtual storage system are described above with reference to the beginning description of FIG. 18.
Migrating (1806) at least a portion of data stored within the first storage tier from the first storage tier to the second storage tier may be performed as described above with respect to the movement of data through the respective storage tiers. Further, in some examples, as described above, as data flows from one or more virtual controllers through virtual storage system 1700 into back-end storage (including one or more of the object storage devices and any of the storage class options described below), the data may be transformed in various ways including deduplication, overwriting, aggregation into segments, and other transformations, resulting in recovery metadata or continuous data protection metadata.
The virtual storage system may dynamically adjust cloud platform resource usage based on the cloud platform pricing structure in response to changes in cost requirements, as described in more detail below.
Under various conditions, budget, capacity, usage, and/or performance requirements may change, and cost prediction and various cost calculation cases may be presented to a user, which may include models that increase the number of servers or storage components, the available types of components, the platform that may provide the appropriate components, and/or how currently set alternatives may work in the future and calculate costs. In some examples, such cost predictions may include the cost of migration between substitutes, where migration often includes management overhead, considering that network transfers incur costs, and additional total capacity may be required for the duration of data transfer between several types of storage devices or providers until the necessary services are fully operational.
Further, in some implementations, instead of pricing the content used and providing options for the configuration based on potential costs, the user may instead provide a budget or otherwise specify a cost threshold, and the storage system service may generate a virtual storage system configuration with specified resource usage such that the storage system service operates within the budget or cost threshold.
Continuing with this example of storage system services with respect to computing resources operating within budget or cost thresholds, while limiting computing resource ultimate performance, costs may be managed based on modifying the configuration of virtual application servers, virtual storage system controllers, and other virtual storage system components by adding, removing, or replacing with faster or slower virtual storage system components. In some examples, if the cost or budget is considered for a given length of time (e.g., monthly, quarterly, or yearly billing), then more computing resources may be available in response to an increase in workload by reducing the cost of virtual computing resources in response to a decrease in workload.
Further, in some examples, in response to determining that given workloads may be executed at flexible times, those workloads may be scheduled to execute during periods of time when computing resources within the operating or initiating virtual storage system are less expensive. In some examples, costs and usage may be monitored during the billing period to determine whether earlier usage in the billing period may affect the ability to run at an expected or acceptable performance level later in the billing period, or whether less than expected usage during portions of the billing period suggests that there is sufficient budget remaining to run optional work or that renegotiation terms will reduce costs.
Continuing with this example, this model of dynamically adjusting the virtual storage system in response to cost or resource constraints may be extended from computing resources to also include storage resources. However, a different consideration for storage resources is that storage resources have a lower elastic cost than computing resources, as stored data continues to occupy storage resources for a given period of time.
Further, in some examples, there may be transfer costs associated with migrating data between storage services having different capacities and transfer prices within the cloud platform. Each of these costs of maintaining virtual storage system resources must be considered and can serve as a basis for configuring, deploying, and modifying computing and/or storage resources within the virtual storage system.
In some cases, the virtual storage system may be adjusted in response to a cost-predicted storage cost, which may include comparing the sustained storage cost of using existing resources to a combination of the delivery cost of storage content and the storage cost of less expensive storage resources (e.g., storage provided by different cloud platforms, or storage to or from storage hardware in a customer-managed data center or to or from customer-managed hardware maintained in a collocated shared-management data center). In this way, the budget constraint-based virtual storage system model can be adjusted in response to different cost or budget constraints or requirements for a given time span long enough to support data transfer, and in some cases based on a predictable usage pattern.
In some implementations, as capacity grows in response to accumulation of stored data and as workload fluctuates around some average or trend line over a period of time, a dynamically configurable virtual storage system may calculate whether costs of transferring an amount of data to some cheaper type of storage class or cheaper storage location within a given budget or within a given budget change are likely. In some examples, the virtual storage system may determine the storage transfer based on a cost over a period of time including a billing cycle or multiple billing cycles, and in this way, prevent exceeding a budget or cost in a subsequent billing cycle.
In some embodiments, cost-managed or cost-constrained virtual storage systems, in other words, virtual storage systems that reconfigure themselves in response to cost constraints or other resource constraints, may also use write-based, archive, or deep archive storage classes available from cloud infrastructure providers. Moreover, in some cases, the virtual storage system may operate in accordance with the models and constraints described elsewhere with respect to implementing the storage system to work with storage classes of different behaviors.
For example, if it is determined that data having a low access likelihood is merged, e.g., merging segments of data having a similar access pattern or similar access likelihood characteristics, the virtual storage system may automatically use the write-based storage class based on determining that cost or budget may be saved and reused for other purposes.
Further, in some cases, the merged data segment may then be migrated to a write-based storage class or other lower cost storage class. In some instances, using local example storage areas on virtual drives may result in cost reduction, which allows virtual storage system resource adjustment, which results in cost reduction to meet cost or budget change constraints. In some cases, the local example storage area may use the write-dominated object storage area as the back-end, and since read loads are typically borne entirely by the local example storage area, the local example storage area may operate primarily as a cache, rather than storing a complete copy of the current data set.
In some examples, if the identifiable data set does not need or desire to survive the loss in the availability zone, a single availability, persistent storage area may also be used, and this use may serve as a cost-effective basis for dynamically reconfiguring the virtual storage system. In some cases, using a single availability zone for a data set may involve explicit specification of the data set or indirect specification through some storage policy.
Furthermore, the designation or storage policy may also include an association with a particular availability zone; however, in some cases, the particular availability zone may be determined by association with a data set of a host system that is accessing the virtual storage system from within the particular availability zone, for example. In other words, in this example, a particular availability zone may be determined to include the same availability zone of the host system.
In some implementations, if the virtual storage system is capable of providing or meeting performance requirements while storage operations are limited by the constraints of the archive and/or deep archive storage class, the virtual storage system may be dynamically reconfigured based on the use of the archive or deep archive storage class. Further, in some cases, the transfer of old snapshots or consecutive data protection datasets or other datasets that are no longer active may be implemented for transfer to the archive storage class based on a storage policy that specifies data transfer in response to a particular activity level, or based on a storage policy that specifies data transfer in response to data that is not accessed within a specified period of time. In other examples, the virtual storage system may transfer data to the archive storage class in response to a particular user request.
Further, given that retrieving from an archive storage class may take minutes, hours, or days, a user of a particular data set stored in an archive or deep archive storage class may be requested by a virtual storage system to provide a particular approval of the time required to retrieve the data set. In some instances, where a deep archive storage class is used, there may also be a limit on the frequency with which data access is allowed, which may impose additional constraints on the cases where a data set may be stored in the archive or deep archive storage class.
Implementing a virtual storage system to work with storage classes of different behavior may be performed using a variety of techniques, as described in more detail below.
In various implementations, some types of storage devices, such as write-based storage classes, may be less expensive to store and maintain than to access and retrieve data. In some examples, if the data can be identified or determined to be rarely retrieved, or retrieved below a specified threshold frequency, the cost can be reduced by storing the data in a write-dominated storage class. In some cases, this write-based storage class may become an additional storage tier that may be used by virtual storage systems accessing one or more cloud infrastructures that provide such storage class.
For example, a storage policy may specify a write-dominated storage class or other archive storage class that may be used to store segments of data from a snapshot, checkpoint, or historical continuous data protection dataset that have been overwritten or deleted from the most recent instance of the dataset that it tracked. Further, in some cases, the segments may not be accessed based on exceeding a time limit, wherein the time limit may be specified in a storage policy, and wherein the time limit corresponds to a low search likelihood, except for inadvertent deletion or corruption of older calendar history copies that may require access to the data set, or may require some judicial investigation of malfunctions or larger scale disasters, crime events, administrative errors, such as inadvertent deletion of recent data or encryption or deletion or part or all of the data set and combinations of recent snapshots, clone or sequential data protection tracking images thereof, as part of a luxury software attack.
In some embodiments, using a cloud platform write-dominated storage class may save costs, which in turn may be used to provision computing resources to improve the performance of the virtual storage system. In some examples, if the virtual storage system tracks and maintains storage access information, such as age and snapshot/clone/sequential data protection aware garbage collection or segment merging and/or migration algorithms, the virtual storage system may use the segment model as part of establishing valid metadata references while minimizing the amount of data transferred to the write-dominated storage class.
Further, in some embodiments, virtual storage systems that integrate snapshot, clone, or continuous data protection tracking information may also reduce the amount of data that can be read back from write-dominated storage libraries, as data that already resides in cheaper storage classes (e.g., local example storage areas on virtual drives or objects stored in standard storage classes of cloud platforms) may be used for data that is not overwritten or deleted since these local storage sources are still available and since the snapshot, clone, or continuous data protection recovery point is written to write-dominated storage. Furthermore, in some instances, data retrieved from a write-dominated storage class may be written to some other storage class, such as a virtual drive local example storage area, for further use, and in some cases avoiding charging for retrieval again.
In some implementations, additional levels of recoverable content may be provided based on the methods and techniques described above with respect to loss recovery from hierarchical memory content, where the additional levels of recoverable content may be used to provide reliability back to some point of consistency in the past entirely from data stored in these secondary storage areas, including objects stored in these other storage classes.
Furthermore, in this example, the restorability may be based on using information that is fully maintained within that storage class to record the information needed to roll back to some consistent point (e.g., snapshot or checkpoint). In some examples, this implementation may be based on a storage class that includes a complete past image of the data set (rather than just data that has been overwritten or deleted), where overwriting or deleting may prevent data from appearing in the most recent content from the data set. While this example implementation may increase costs, as a result, virtual storage systems may provide valuable services, such as recovering from a lux software attack, where protection against lux software attacks may be based on requiring additional permissions or access levels that limit objects stored in a given storage class from being deleted or overwritten.
In some implementations, virtual storage systems may use archive storage classes and/or deep archive storage classes in addition to or instead of write-dominated storage classes for content that is less likely to be accessed relative to write-dominated storage classes or that is only needed in anticipation of rare disaster events, but for which the ability to retrieve content is worth of high cost. Examples of such low access content may include, for example, historical versions or snapshots or clones of data sets that may be needed in rare instances, such as a discovery phase in litigation or some other similar disaster, particularly where another party may be expected to pay for retrieval.
However, as described above, maintaining a historical version, snapshot, or clone of a data set in the event of a luxury software attack may be another example. In some instances, such as litigation events, and to reduce the amount of data stored, the virtual storage system may store only previous versions of data within a data set that has been overwritten or deleted. In other instances, such as in the event of a luxury software or disaster recovery, as described above, the virtual storage system may store complete data sets in an archive or deep archive storage class, including storing any data needed to recover consistent data sets from at least several different points in time, in addition to storage control to eliminate the possibility of unauthorized deletion or overwriting of objects stored in a given archive or deep archive storage class.
In some implementations, how the virtual storage system uses the differences between (a) objects stored in write-dominated storage classes and (b) objects stored in archive or deep archive storage classes may include access snapshots, clone, or sequential data protection checkpoints that access a given storage class. In an example of a write-dominated storage class, objects may be retrieved with similar or possibly the same latency as objects stored in a standard storage class provided by a virtual storage system cloud platform, where storage costs in the write-dominated storage class may be higher than the standard storage class.
In some examples, the virtual storage system may implement the use of write-dominated storage classes as secondary variants of the conventional model for accessing content corresponding to segments that are currently only available from objects in standard storage classes. In particular, in this example, when some operation is reading that data, the data may be retrieved, for example, by a logical offset read from a snapshot of the tracking volume. In some cases, the virtual storage system may request that the user agree to pay a premium for any such retrieval when requesting access to a snapshot or other type of storage image, and the retrieved data may be stored in a local example storage area associated with the virtual drive or copied (or converted) into an object in a standard storage class to avoid continuing to pay higher storage retrieval fees using other storage classes not contained within the architecture of the virtual storage system.
In some implementations, the latency or program associated with retrieving objects from an archive or deep archive storage class may make implementation impractical as compared to the negligible latency in the write-based storage class discussed above. In some cases, if it takes hours or days to retrieve an object from an archive or deep archive storage class, an alternative procedure may be implemented. For example, a user may request access to a snapshot of at least some segments of an object known to need to be stored in an archive or deep archive storage class, and in response, instead of reading any such segments as needed, the virtual storage system may determine a list of segments in the object that include the requested data set (or snapshot, clone, or sequential data protection recovery point) and that are stored into the archive or deep archive storage.
In this way, in this example, the virtual storage system may request that the segments in the determined segment list be retrieved to copy them into objects in, for example, a standard storage class or into a virtual drive to be stored in the local example storage area. In this example, retrieval of the segment list may take hours or days, but from a performance and cost standpoint, it is preferable to request the entire segment list at once, rather than making individual requests as needed. Ending this example, after retrieving the segment list from archive or deep archive storage, access to the retrieved snapshot, clone, or sequential data protection recovery point may then be provided.
Readers will appreciate that while the embodiments described above relate to embodiments in which data stored in a portion of a block storage device of a cloud-based storage system that has become unavailable is brought back substantially into a block storage layer of the cloud-based storage system by retrieving data from an object storage layer of the cloud-based storage system, other embodiments are within the scope of the present disclosure. For example, because data may be distributed over local storage of multiple cloud computing examples using data redundancy techniques such as RAID, in some embodiments lost data may be brought back into the block storage layer of the cloud-based storage system through RAID reconstruction.
Readers will further appreciate that while the preceding paragraphs describe a cloud-based storage system and its operation, the cloud-based storage system described above may be used to provide block storage as a service, as the cloud-based storage system may be booted and used to provide block services in an on-demand, as-needed manner. In this example, providing block storage, i.e., services, in a cloud computing environment may include: receiving a request for a block storage service from a user; creating a volume for use by a user; receiving an I/O operation directed to a volume; and forwarding the I/O operations to a storage system co-located with hardware resources for the cloud computing environment.
For further explanation, FIG. 19 sets forth an example virtual storage system 1900 architecture according to some embodiments. The virtual storage system architecture may include virtual components and architecture similar to the cloud-based storage system described above with reference to fig. 4-18. However, the virtual storage system 1900 architecture depicted in FIG. 19 is a locally deployed virtual storage system that is deployed in a virtual environment 1902 supported by locally deployed physical storage resources. Here, "locally deployed" refers to physical storage resources owned or leased by an enterprise or organization and located in a private data center, rather than cloud-based storage resources provided by cloud service providers in a public cloud infrastructure. While the locally deployed virtual storage system differs from the cloud-based virtual storage system area in that the configuration of underlying physical storage resources may be serviced and managed by enterprise personnel, the virtual environment 1902 itself may be a cloud computing environment, such as a private cloud platform that represents an abstraction of the locally deployed physical resources. Thus, the management and configuration of storage services provided by the in-place virtual storage system 1900 may be decoupled from the management and configuration of the physical in-place resources hosting the virtual storage system 1900, allowing the in-place virtual storage system to be managed in the same manner and using the same interfaces as if it were provisioned on the resources provided by the cloud service provider. As will be explained in greater detail below, the virtual environment 1902 hosted on the locally deployed resources allows virtual components of the virtual storage system 1900 to be copied or reconfigured into the cloud computing environment (or vice versa), e.g., to facilitate lateral expansion of the virtual storage system, migration of the virtual storage system, and movement of the virtual storage system dataset between the locally deployed virtual storage system and the cloud-based virtual storage system.
In the example depicted in fig. 19, virtual storage system 1900 includes one or more virtual controllers implemented in one or more computing examples, where the computing examples can be executed or run as virtual machines flexibly assigned to locally deployed physical host servers. Like virtual controllers 408, 406, the virtual controllers may accept or receive I/O operations and/or configuration requests from client hosts 1260, 1262 (possibly through an intermediary server, not depicted) or from a management interface or tool, and then ensure that the I/O requests and other operations are running until completion. In some examples, the virtual controller may present a file system, a block-based volume, an object store, and/or some sort of mass storage database or key/value store, and may provide data services such as snapshot, replication, migration services, provisioning, host connectivity management, deduplication, compression, encryption, secure sharing, and other such storage system services.
In the example depicted in fig. 19, two virtual controllers are depicted as a storage controller application 1908 running within computing example 1904 and a storage controller application 1910 running within computing example 1906, respectively. The computing examples 1904, 1906 may execute on virtual machines within the virtual environment 1902 hosted on locally deployed physical resources. For example, multiple computing instances running a storage controller application may be hosted on disparate servers within one or more data centers, such that in the event of a server failure, the storage controller application in the computing instance hosted on a different server may continue to service storage operations directed to the virtual storage system.
In the example depicted in fig. 19, virtual storage system 1900 includes one or more virtual drivers 1910-1916 implemented in one or more computing examples that can be executed or run as virtual machines flexibly allocated to locally deployed physical host servers. Similar to virtual drivers 1210-1216, virtual drivers 1910-1916 provide persistent storage (e.g., block-level storage, object storage) to virtual controllers (e.g., storage controller applications 1908, 1910). In some implementations, the hierarchical memory may be implemented by one or more virtual drives 1910-1916, where the one or more virtual drives 1910-1916 store data, for example, within local storage 1920-1926. In some examples, local storage 1920-1926 may be one or more SSDs hosting respective locally deployed physical resources in which computing examples of virtual drives are implemented, or as other forms of storage, such as one or more direct flash modules. In some implementations, the contents of the local storage 1920-1926 of one or more virtual drives 1910-1916 can be replicated or mirrored across multiple virtual drives for data recovery and high availability of data. In additional implementations, data in the local storage 1920-1926 of one or more virtual drives 1910-1916 may be striped across multiple virtual drives 1910-1916 in a RAID configuration.
In some implementations, hierarchical memory implemented by one or more virtual drives (e.g., virtual drives 1910 and 1916) can store data with respective block storage volumes 1940 and 1946. Readers will appreciate that while the remaining description of FIG. 19 relates to an embodiment in which virtual drives 1910 and 1916 store data within block storage volumes 1940 and 1946, this description is included merely for ease of explanation and does not represent a limitation on the types of storage (e.g., block storage, object storage, file storage) that may be provided by virtual drives 1910-1916. Readers will appreciate that the virtual drive may or may not contain a block storage volume. In some implementations, block storage volumes 1940 and 1946 may be block storage volumes in one or more locally deployed physical storage systems. The physical storage system may operate as described above. For example, the physical storage systems may implement synchronous replication such that one or more of block storage volumes 1940 and 1946 may be synchronously replicated across multiple physical storage systems. In some implementations, the locations and deployment of block storage volumes 1940 and 1946 within locally deployed resources are not visible to an administrator of a host application or storage service provided by the virtual storage system, such that block storage volumes 1940 and 1946 may behave as cloud-based block storage volumes (e.g., amazon EBS volumes). The block storage volumes may be attached to two or more other virtual drives, one after the other (as depicted in fig. 12). In some implementations, the block storage volume may be a cloud-based block storage volume (e.g., an AWS EBS volume) provided by a cloud service provider.
In the example depicted in fig. 19, virtual drivers 1910-1916 are coupled to an object store that provides back-end durable object storage, such as cloud-based object store 432. As illustrated in fig. 19, submitting data to cloud-based object store 432 may be managed by virtual drivers 1910-1916. In some implementations, the software daemons 1230-1236 or some other module of computer program instructions executing on the virtual drive examples 1910-1916 can be configured to write data not only to their own local storage 1920-1926 resources and any appropriate block storage 1940 and 1946 provided by the virtual computing environment 1902, but also to the cloud-based object storage 432 attached to a particular virtual drive 1910-1916. For example, data written to storage resources of locally deployed hosted virtual drives 1910-1916 may be automatically replicated to cloud-based object storage as previously discussed.
Readers will appreciate that the locally deployed virtual storage system 1900 constructed using the architecture set forth above allows a host application or administrator to treat the locally deployed virtual storage system 1900 as if it were a cloud-based virtual storage system, such that the virtual storage system 1900 allows users to build storage resources from multiple storage tiers based on performance and durability characteristics while keeping the configuration of locally deployed physical resources for supporting the virtual storage system agnostic. Readers will also appreciate that the locally deployed virtual storage system 1900 may provide a set of storage services and interfaces that are similar, if not identical, to cloud-based virtual storage systems, thereby facilitating interoperability between locally deployed storage resources and cloud-local applications. For example, the on-site deployment virtual storage system 1900 provides the same set of virtual controllers, drive examples, block-level storage services, object storage services, and interfaces as provided by the cloud-based virtual storage system depicted in fig. 4-18. In one example, the same APIs used to construct the on-site deployed virtual storage system 1900 may be used to construct the cloud-based virtual storage system depicted in FIGS. 4-18. Readers will also appreciate that locally deployed virtual storage system 1900 can be readily laterally expanded into or migrated to and from a cloud computing environment, e.g., according to a cost model. For example, the virtual storage system service may launch instances of virtual controllers and/or instances of virtual drives in the cloud computing environment and connect those instances to the locally deployed virtual storage system 1900.
In some embodiments, in-place virtual storage system 1900 may be provided to a customer as a "cloud in box" containing virtual environments, hardware infrastructure, and storage resources for hosting in-place virtual storage system 1900. In this example, the in-place virtual storage system 1900 may include a VM template to create a virtual machine that hosts virtual controllers and virtual drives. As such, the locally deployed virtual storage system 1900 may contain pre-installed storage controller applications that are compatible with storage controller applications used to manage other locally deployed physical resources (e.g., NFS or storage arrays). By implementing a storage controller application that may be hosted on a cloud-based virtual storage system or a locally deployed virtual storage system and that is compatible with the storage controller application for physical storage resources, a unified data experience may be provided to a customer. Further, by providing a locally deployed virtual storage system that utilizes the customer's locally deployed physical resources, the customer may allow its personnel to configure the virtual storage system as if it were a cloud-based storage system (e.g., by setting up quotas, creating volumes and other storage components, monitoring performance, defining access controls, applying policies), while leaving management of the physical environment (e.g., provisioning the virtual storage system, moving the virtual storage system across physical infrastructure, load balancing, replicating policies) to the customer or provider's technician.
For further explanation, FIG. 20 sets forth an example virtual storage system 2000 architecture according to some embodiments. The virtual storage system architecture may include virtual components similar to the cloud-based virtual storage system and the locally deployed virtual storage system described above with reference to fig. 4-19.
In this embodiment, virtual storage system 2000 includes an example of a locally deployed virtual storage system 2002 and an example of a cloud-based virtual storage system 2004. In some instances, virtual storage system 2000 is constructed by reconfiguring locally deployed virtual storage system 2002 in cloud computing environment 402 to create cloud-based virtual storage system 2004, e.g., as part of a lateral expansion operation or migration of a virtual storage system dataset to cloud computing environment 402. In some instances, the virtual storage system 2000 is constructed by reconfiguring a cloud-based virtual storage system 2004 in a virtual computing environment 1902 to create a locally deployed virtual storage system 2002, e.g., to reduce latency by moving the virtual storage system closer to physical storage resources in a data center local deployment. In some examples, the locally deployed virtual storage system 2002 and the cloud-based virtual storage system 2004 may be configured to synchronously copy data between the two virtual storage systems so that the presented virtual storage system 2000 may continue to function and provide its services even in the event of a loss of data or availability in the virtual storage system instance. In the example depicted in fig. 20, the locally deployed virtual storage system 2002 and the cloud-based virtual storage system 2004 share the cloud-based object storage 432 as durable back-end storage, it should also be appreciated that in some implementations, the locally deployed virtual storage system 2002 and the cloud-based virtual storage system 2004 may be attached to respective object storage areas or respective buckets in the object storage areas.
Consider an instance in which a data set or portion thereof is migrated from a locally deployed virtual storage system 2002 to a cloud-based virtual storage system 2004, e.g., in response to a user request or detection of a failure. The virtual storage system logic may initiate examples of virtual controllers and virtual drives of the locally deployed virtual storage system 2002 in a cloud computing instance of the cloud computing environment (e.g., by implementing the virtual controllers in an AWS EC2 instance and the virtual drives in an AWS EC2 instance having a local instance storage area). The virtual storage system logic can then migrate data in the local storage and/or block storage volumes of the locally deployed virtual storage system 2002 to the local storage and block storage volumes of the cloud computing environment (e.g., by copying the data to an AWS EC2 instance having the local storage and attached EBS volumes). In the event of a failure in the locally deployed virtual storage system 2002, the local storage and the block storage volumes of the cloud-based virtual storage system 2004 may be restored with data from the shared cloud-based object storage. Further, the virtual storage system logic may apply the same connectivity, policies, and other configurations of the locally deployed virtual storage system 2002 to the cloud-based virtual storage system 2004. The process may be reversed, for example, by creating a compute instance in the virtual environment 1902 and migrating virtual controllers and virtual drivers from the cloud compute instance to the compute instance of the virtual environment 1902 and copying data from the local storage and block storage of the cloud-based virtual storage system 2004 to the locally deployed virtual storage system 2002. In some examples, the computing examples 1904, 1910 and the driving examples 1910-1916 may be AWS EC2 examples hosted in the virtual environment 1902 where physical resources are deployed locally.
In some examples, the locally deployed virtual storage system 2002 and the cloud-based virtual storage system 2004 may be configured to synchronously copy data between the two virtual storage systems such that the presented virtual storage system 2000 may continue to function and provide its services even in the event of a loss of data or availability in the virtual storage system example. This implementation may further be implemented to share the use of durable objects such that storing data into object storage areas is coordinated such that the two virtual storage systems 2002, 2004 do not repeat the stored content. Furthermore, in this implementation, two synchronous replication virtual storage systems 2002, 2004 can synchronously replicate updates to the hierarchical memory and possibly the local example storage areas to significantly reduce the chance of data loss, while coordinating updates to the object storage areas as later asynchronous activities to significantly reduce the capacity cost stored in the object storage areas.
For further explanation, FIG. 21 sets forth a flow chart illustrating an example method of creating a virtual storage system 2100. The example method depicted in fig. 21 may be implemented on any of the virtual storage systems described above with reference to fig. 12-20. In other words, example methods may implement virtual storage systems 1200, 1300, 1400, 1500, 1600, 1700, 1800, 1900, or 2000. Thus, the example method depicted in fig. 21 may include creating a cloud-based virtual storage system, deploying the virtual storage system locally, or a combination thereof.
In some embodiments, creating virtual storage system 2100 can be performed on virtual platform 2130 (e.g., cloud computing environment 402 or virtual computing environment 1902). In some instances, creating the virtual storage system 2100 may be performed by the virtual storage system service 2110. Such virtual storage system services may be defined as services that may dynamically create virtual storage systems within and across multiple local and cloud platforms, and may be able to use various virtual components available in various local or cloud-based platforms to provide storage services supported by multiple classes of storage devices, including local and cloud-based block storage devices, object storage devices, file system storage devices, and other classes discussed above, thereby enabling presentation of such services to clients providing selection of the local and cloud platforms and various optional storage classes.
As depicted in fig. 21, an example method includes instantiating (2102) one or more virtual storage controllers. Instantiating (2102) one or more virtual storage controllers may be carried out by creating any of the virtual storage controllers or storage controller applications discussed above with respect to the storage system architecture depicted in fig. 4-20. In some examples, instantiating (2102) one or more virtual storage controllers is carried out by creating one or more computing instances of a hosted storage controller application. In one example, a virtual controller may be instantiated in cloud computing environment 402 using services provided by a cloud service provider by creating cloud computing instances 404, 406 to host storage controller applications 408, 410. In another example, a virtual controller can be instantiated in a virtual computing environment 1902 hosted on locally deployed physical resources by creating computing examples 1904, 1906 to host storage controller applications 408, 410. Instantiating (2102) one or more virtual storage controllers may be carried out via, for example, an API call to a virtual storage system service (2110).
The example method depicted in fig. 21 also includes instantiating (2104) one or more virtual storage devices each including a plurality of storage tiers. Instantiating (2104) one or more virtual storage devices each including multiple storage tiers may be carried out as discussed above with respect to the storage system architecture depicted in fig. 4-20. In some implementations, instantiating (2104) one or more virtual storage devices each including multiple storage tiers can be carried out by creating one or more computing instances of the hosted virtual storage device. In various examples, the virtual storage may be a virtual drive, virtual service, or virtual blade including an attached local storage, an attached block storage volume, and an attached object storage (e.g., cloud-based object storage 432), as discussed above. In one example, virtual storage may be instantiated in cloud computing environment 402 using services provided by a cloud service provider by creating virtual drive cloud computing examples (e.g., virtual drives 1210-1216) or virtual blade cloud computing examples (e.g., virtual blades 1410-1416) with attached local storage (e.g., local storage 1220-1226) and cloud-based block storage (e.g., block storage volumes 1240-1246). In another example, a virtual storage device may be instantiated in a virtual environment 1902 hosted on locally deployed physical resources by creating a virtual drive (e.g., virtual drives 1910-1916) with attached local storage (e.g., local storage 1920-1926) and, in some implementations, creating a block storage (e.g., block storage volume 1940). In these examples, the virtual storage device provides access to tiers and classes of storage that differ in terms of their bandwidth, capacity, durability, availability, and write frequency. In one example, instantiating (2104) one or more virtual storage devices each including multiple storage tiers may be carried out via, for example, an API call to the virtual storage system service 2110.
The example method depicted in fig. 21 also includes constructing (2106) a virtual storage system 2100 in which one or more virtual storage devices are coupled to each of the one or more virtual storage controllers. Constructing (2106) a virtual storage system 2100 in which one or more virtual storage devices are coupled to each of one or more virtual storage controllers can be carried out by implementing any of the virtual storage system architectures described above. In some implementations, constructing (2106) a virtual storage system 2100 in which one or more virtual storage devices are coupled to each of the one or more virtual storage controllers is performed by attaching a virtual storage device (e.g., virtual drivers 1210-1216) to a virtual controller (e.g., storage controller applications 408, 410) and presenting storage services to a client host having a namespace as if the virtual storage system were a physical storage system. In some implementations, constructing (2106) a virtual storage system 2100 in which one or more virtual storage devices are coupled to each of the one or more virtual storage controllers is further carried out by attaching a block storage volume to each virtual storage device. In some implementations, constructing (2106) a virtual storage system 2100 in which one or more virtual storage devices are coupled to each of the one or more virtual storage controllers is further carried out by attaching a cloud-based object storage device to each virtual storage device. In one example, constructing (2106) a virtual storage system 2100 in which one or more virtual storage devices are coupled to each of the one or more virtual storage controllers can be carried out by, for example, making an API call to a virtual storage system service 2110.
For further explanation, FIG. 22 sets forth a flow chart illustrating an additional example method of creating a virtual storage system according to some embodiments of the present disclosure. The example method depicted in fig. 22 is similar to the example method depicted in fig. 21 in that the example method depicted in fig. 22 also includes: instantiating (2102) one or more virtual storage controllers; instantiating (2104) one or more virtual storage devices each comprising a plurality of storage tiers; and constructing (2016) a virtual storage system in which one or more virtual storage devices are coupled to each of the one or more virtual storage controllers.
The example method depicted in fig. 22 includes migrating (2202) a data set from a virtual storage system to another virtual storage system, wherein at least one of the virtual storage systems is a locally deployed virtual storage system utilizing locally deployed physical storage resources. Migrating (2202) a data set from a virtual storage system to another virtual storage system, wherein at least one of the virtual storage systems is a locally deployed virtual storage system utilizing locally deployed physical storage resources may be effectuated as discussed above, for example, with respect to the example depicted in fig. 20. In some embodiments, migrating the data set from the locally deployed virtual storage system is performed by launching a cloud computing instance to replace virtual components of the locally deployed virtual storage system and migrating data from the locally deployed virtual storage system to the cloud-based virtual storage system by designating the cloud-based virtual storage system as a replication target. In some embodiments, migrating the data set from the locally deployed virtual storage system is performed by launching a computing instance in the hosted virtual environment to replace virtual components of the cloud-based virtual storage system and migrating data from the cloud-based virtual storage system to the locally deployed virtual storage system by designating the locally deployed virtual storage system as a replication target.
While components, data, and policies of virtual storage systems in public cloud infrastructure can be easily grouped together by account, physical locally deployed storage systems are not emulated. Thus, a locally deployed virtual storage system may embody management units representing volumes, file systems, object storage, analytics storage, snapshots, policies, connectivity, and other management entities related by the virtual storage system, with management changes made to the management units (e.g., moving the management units across storage systems, moving the management units between storage classes) operating on each of the management entities in the management units. In some embodiments, membership in a management element may be defined by a user (e.g., an administrator) selecting or creating an entity for inclusion in the management element. In some implementations, membership in the management element can be determined based on a set of affinity rules. Using a set of affinity rules, membership may be inferred based on commonalities such as stored data sets, policies, replication or synchronization queues, host attachment, physical location, and other sharing characteristics. In some implementations, membership in the management element can be determined based on a set of anti-affinity rules. Using a set of anti-affinity rules, membership may be inferred based on inconsistencies in stored data sets, policies, replication or synchronization queues, host attachment, physical location, and other characteristics, for example, that are not shared or in some cases are incompatible or unlikely to be shared.
Readers will appreciate that migrating a data set from a locally deployed virtual storage system to another virtual or physical storage system includes migrating the policies, metadata, and connectivity of the data set so that the virtual storage system can be reconfigured in the replication target.
For further explanation, FIG. 23 sets forth a flow chart illustrating an additional example method of creating a virtual storage system according to some embodiments of the present disclosure. The example method depicted in fig. 23 is similar to the example method depicted in fig. 21 in that the example method depicted in fig. 23 also includes: instantiating (2102) one or more virtual storage controllers; instantiating (2104) one or more virtual storage devices each comprising a plurality of storage tiers; and constructing (2016) a virtual storage system in which one or more virtual storage devices are coupled to each of the one or more virtual storage controllers.
The example method depicted in fig. 23 also includes migrating (2302) the data set from the locally deployed virtual storage system to a local environment executing on the physical storage system. Migrating (2302) a data set from a locally deployed virtual storage system to a local environment executing on a physical storage system may be carried out by: reconstructing a locally deployed virtual storage system (e.g., locally deployed virtual storage system 1900 depicted in fig. 19) using storage controller resources local to the physical storage system and storage resources available in the physical storage system (e.g., the storage systems depicted in fig. 1A-1D, 2A-2G, 3A-3B); and migrating the data in the virtual storage system to the physical storage system. In some implementations, the virtual storage controller may implement the same storage controller application that is hosted in the physical storage environment. In this example, a storage controller application (e.g., a storage operating system) of the local operating environment may be configured to migrate the target storage controller. In one example, virtual storage devices in a virtual storage system that are characterized by performance, capacity, durability, and other metrics are approximated using physical storage resources available in a physical environment. In this example, connectivity among the storage controller and the physical storage resources is also recreated. For example, a virtual storage system may operate as a test and development model prior to deploying the storage system in a physical storage array.
For further explanation, FIG. 24 sets forth a flow chart illustrating an additional example method of creating a virtual storage system according to some embodiments of the present disclosure. The example method depicted in fig. 24 is similar to the example method depicted in fig. 21 in that the example method depicted in fig. 24 also includes: instantiating (2102) one or more virtual storage controllers; instantiating (2104) one or more virtual storage devices each comprising a plurality of storage tiers; and constructing (2016) a virtual storage system in which one or more virtual storage devices are coupled to each of the one or more virtual storage controllers.
The example method depicted in fig. 24 also includes exposing (2404) a first set of interfaces to a first role of the virtual storage system, wherein the first set of interfaces configures a physical environment hosting the virtual storage system. In some examples, exposing (2404) the first set of interfaces to a first role in the virtual storage system (where the second set of interfaces configures a physical environment hosting the virtual storage system) is performed by exposing APIs to build the virtual storage system on the physical storage resources such that the APIs are accessible to an infrastructure administrator. In these instances, the infrastructure administrator role manages the physical storage environment supporting the virtual storage system, such as the capacity consumed by the virtual storage system; hardware failure; network connectivity and topology; protection policies (e.g., snapshot and copy policies); load balancing moves across virtual storage systems of a physical infrastructure; the wide performance of the physical system is reserved and the trend is realized; the physical system has wide storage capacity and reserved; copy status to other areas; and virtual storage system level copy and protection policies. The infrastructure administrator is responsible for creating virtual storage system administrators and mapping virtual storage system administrators to virtual storage systems, defining virtual storage system quotas and other virtual storage system-level constraints, cloning virtual storage systems to new virtual storage systems, creating deduplication reports across virtual storage systems, arrays, and zones.
In the example of FIG. 24, the virtual storage system infrastructure may expose services and APIs to create and manage an infrastructure that supports the virtual storage system. In one embodiment, the set of APIs available to the infrastructure administrator may comprise: an interface for: creating a virtual storage system; defining an administrator for the virtual storage system; defining a replication relationship for a virtual storage system; adding a logical quota to the virtual storage system, recovering the virtual storage system from the snapshot; creating a replication link for the volume; moving volumes between virtual storage systems; setting a QoS target for a virtual storage system; the volume size and deduplication efficiency are queried.
The example method depicted in fig. 24 also includes exposing (2402) a second set of interfaces to a second role of the virtual storage system, wherein the second set of interfaces configures virtual components in the virtual storage system. In some examples, exposing (2402) the second set of interfaces to a second role of the virtual storage system (where the second set of interfaces configures virtual components in the virtual storage system) is performed by exposing virtual computing environment APIs or cloud computing environment APIs so that system administrators of the virtual storage system can access these APIs to configure virtual components. In these examples, the system administrator role configures and maintains the virtual storage system for use by the application, including tasks such as: managing virtual storage system datasets, including creating volumes, buckets, file systems, and other addressable portions of memory within the virtual storage system; attaching a logical limit (e.g., quota or size) to the virtual storage system component; defining an access control list for the virtual storage resource; monitoring the logical space consumption of the virtual storage system and its components; monitoring performance of the virtual storage system and performance thereof; monitoring connectivity performance within the virtual storage system and to an external application host; a snapshot policy is defined.
In the example of fig. 24, the virtual storage system may expose services and APIs to configure and manage the virtual storage system. In one embodiment, a set of APIs available to a system administrator may contain interfaces for: building a cloth roll; taking a snapshot of the virtual storage system, dataset, or volume; is connected to a host; creating a virtual storage system snapshot schedule; manually deleting the snapshot; creating a manual snapshot; adding a host to the connection policy; discarding the volume; growing a roll; reducing the volume; restoring the volume from the snapshot; setting a QoS target for a volume; inquiring the volume connectivity state; inquiring volume performance statistics; querying the volume access point.
Readers will appreciate that management of the virtual storage system may be performed by the system administrator role and the infrastructure administrator role separately, such that the system administrator role is provided with an interface to configure the virtual storage system without configuring the underlying infrastructure or even without any knowledge of the physical hardware. This aspect is particularly advantageous in connection with locally deploying a virtual storage system, where a system administrator of the virtual storage system manages storage services abstracted from physical hardware. Unlike cloud services provided by cloud service providers, where consumers of storage resources are unable to configure the hardware infrastructure, in locally deployed virtual storage systems, the underlying physical infrastructure is configured to host the virtual storage system by, for example, an infrastructure administrator.
Readers will appreciate that whereas a locally deployed virtual storage system must be built on locally deployed hardware resources based on some awareness of available resources, a system administrator of the virtual storage system can configure storage services and storage components without such awareness. In this way, a system administrator of the virtual storage system may manage the storage system using an infrastructure-independent interface (e.g., an interface provided in a cloud-based platform); for example, by selecting configurations and policies for the virtual storage system based on desired performance (e.g., bandwidth, capacity), reliability (e.g., endurance, availability), and data characteristics. In some embodiments, the set of APIs for configuring and maintaining the locally deployed virtual storage system is exclusively available for the system administrator role, while the set of APIs for configuring and managing the infrastructure supporting the locally deployed virtual storage system is exclusively available for the infrastructure administrator role, such that the system administrator role is distinct from the infrastructure administrator role.
Example embodiments are described primarily in the context of a fully functional computer system. However, readers of skill in the art will recognize that the present disclosure also may be embodied in a computer program product disposed on computer readable storage media for use with any suitable data processing system. Such computer-readable storage media may be any storage media of machine-readable information, including magnetic media, optical media, or other suitable media. Examples of such media include magnetic disks in hard or floppy disks, optical disks for optical drives, magnetic tape, or others as will occur to those of skill in the art. One of ordinary skill in the art will immediately recognize that any computer system having suitable programming means will be capable of executing the steps of a method embodied in a computer program product. Those skilled in the art will also recognize that while some of the example embodiments described in this specification are directed to software installed and executing on computer hardware, alternative embodiments implemented as firmware or hardware are within the scope of the disclosure.
Embodiments may include systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium (or multiple media) having computer-readable program instructions thereon for causing a processor to carry out aspects of the present disclosure.
The computer-readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: portable computer diskette, hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static Random Access Memory (SRAM), portable compact disc read-only memory (CD-ROM), digital Versatile Disc (DVD), memory stick, floppy disk, mechanical coding device such as a punch card or a protrusion structure in a groove having instructions recorded thereon, and any suitable combination of the foregoing. As used herein, a computer-readable storage medium should not be construed as a transitory signal itself, such as a radio wave or other freely propagating electromagnetic wave, an electromagnetic wave propagating through a waveguide or other transmission medium (e.g., a pulse of light passing through a fiber optic cable), or an electrical signal transmitted through an electrical wire.
The computer readable program instructions described herein may be downloaded from a computer readable storage medium to a corresponding computing/processing device or to an external computer or external storage device via a network (e.g., the internet, a local area network, a wide area network, and/or a wireless network). The network may include copper transmission cables, optical transmission fibers, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.
Computer readable program instructions for carrying out operations of the present disclosure may be assembly instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional programming languages such as the "C" programming language or the like. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, electronic circuitry, including, for example, programmable logic circuitry, field Programmable Gate Arrays (FPGAs), or Programmable Logic Arrays (PLAs), may personalize the electronic circuitry by utilizing state information of computer readable program instructions to execute the computer readable program instructions in order to perform aspects of the present disclosure.
Aspects of the present disclosure, according to some embodiments of the present disclosure, are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable storage medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The advantages and features of the present disclosure may be further described by the following statements:
statement 1 a method of servicing I/O operations in a virtual storage system, the method comprising: receiving, by the virtual storage system, a request to write data to the virtual storage system; storing the data in a storage device provided by a first storage tier of the virtual storage system; at least a portion of data stored within the first storage tier is migrated from the first storage tier to a second storage tier that is more durable than the first storage tier of the virtual storage system.
Statement 2 the method of statement 1, wherein migrating the at least the portion of data stored within the hierarchical memories is in response to detecting a condition to transfer data between the hierarchical memories to the durable data storage provided by a cloud service provider.
Statement 3 the method of statement 2 or statement 1 wherein the hierarchical memory comprises a plurality of virtual drive servers.
Statement 4 the method of statement 3, statement 2 or statement 1 wherein the plurality of virtual drive servers includes respective local storage.
Statement 5 the method of statement 4, statement 3, statement 2, or statement 1, wherein the plurality of virtual drive servers provide virtual drives as block type data storage.
Statement 6 the method of statement 5, statement 4, statement 3, statement 2, or statement 1, wherein the request to write data to the virtual storage system is received by one or more virtual controllers running within a virtual machine, container, or bare metal server.
Statement 7. The method of statement 6, statement 5, statement 4, statement 3, statement 2, or statement 1, wherein the hierarchical memory is provided by a plurality of virtual drive servers that each include both a virtual controller and local memory.
Statement 8 the method of statement 7, statement 6, statement 5, statement 4, statement 3, statement 2, or statement 1, wherein the at least the portion of the data stored within the hierarchical memory is deduplicated, encrypted, or compressed prior to migration from the hierarchical memory to the durable data storage device.
The method of statement 8, statement 7, statement 6, statement 5, statement 4, statement 3, statement 2, or statement 1, wherein the hierarchical memory of the virtual storage system is characterized by a low read latency relative to the durable data storage provided by the cloud service provider.
Statement 10. The method of statement 9, statement 8, statement 7, statement 6, statement 5, statement 4, statement 3, statement 2, or statement 1, wherein the first storage tier comprises hierarchical memory providing transaction consistency and write acknowledgement, and wherein the second storage tier comprises a virtual driver provided by a virtual drive server of the virtual storage system.
Statement 11, method of statement 10, statement 9, statement 8, statement 7, statement 6, statement 5, statement 4, statement 3, statement 2, or statement 1, wherein the first storage tier comprises the virtual driver provided by a virtual drive server of the virtual storage system, and wherein the second tier comprises object storage provided by a cloud service provider that provides object storage independent of the virtual storage system.
The advantages and features of the present disclosure may be further described by the following statements:
statement 1 a virtual storage system included in a cloud computing environment, the cloud-based storage system comprising: one or more virtual drives providing hierarchical memory for storage operations; and one or more virtual controllers, each virtual controller executing in a cloud computing instance, wherein the one or more virtual controllers are configured to: receiving, by the virtual storage system, a request to write data to the virtual storage system; storing the data in a storage device provided by a first storage tier of the virtual storage system; at least a portion of data stored within the hierarchical memory is migrated from the first storage tier to a second storage tier that is more durable than the first storage tier of the virtual storage system.
Statement 2 the virtual storage system of statement 1 wherein the local storage of a given virtual drive is connected to one or more other virtual drives.
Statement 3 the virtual storage system of statement 2 or statement 1, wherein the virtual controllers of the first subset are located within the first availability zone, and wherein the virtual controllers of the second subset are located within the second availability zone.
Statement 4 the virtual storage system of statement 3, statement 2, or statement 1, wherein the virtual drives of the first subset are positioned within a first availability zone, and wherein the virtual drives of the second subset are positioned within a second availability zone.
Statement 5 the virtual storage system of statement 4, statement 3, statement 2, or statement 1, wherein the virtual drives of the first subset in the first availability zone and the virtual drives of the second subset in the second availability zone both use the same cloud-based object storage.
Statement 6 the virtual storage system of statement 5, statement 4, statement 3, statement 2, or statement 1, wherein migrating the at least the portion of data stored within the hierarchical memories is in response to detecting a condition to transfer data between the hierarchical memories to the durable data storage provided by the cloud service provider.
Statement 7 the virtual storage system of statement 6, statement 5, statement 4, statement 3, statement 2, or statement 1, wherein the hierarchical memory includes a plurality of virtual drive servers.
Statement 8 the virtual storage system of statement 7, statement 6, statement 5, statement 4, statement 3, statement 2, or statement 1, wherein the respective virtual drive includes both a respective virtual controller and a respective local storage.
Statement 9 the virtual storage system of statement 8, statement 7, statement 6, statement 5, statement 4, statement 3, statement 2, or statement 1, wherein the hierarchical memory is provided by a plurality of virtual drive servers that each include both a virtual controller and local memory.
Statement 10 the virtual storage system of statement 9, statement 8, statement 7, statement 6, statement 5, statement 4, statement 3, statement 2, or statement 1, wherein the virtual storage system synchronously replicates the data with one or more other virtual storage systems.
Statement 11 the virtual storage system of statement 10, statement 9, statement 8, statement 7, statement 6, statement 5, statement 4, statement 3, statement 2, or statement 1, wherein the virtual storage system architecture implementing the virtual storage system is different from the virtual storage system architecture implementing at least one of the one or more other virtual storage systems.

Claims (20)

1. A method, comprising:
instantiating one or more virtual storage controllers;
instantiating one or more virtual storage devices, each comprising a plurality of storage tiers; a kind of electronic device with high-pressure air-conditioning system
A virtual storage system is constructed, wherein the one or more virtual storage devices are coupled to each of the one or more virtual storage controllers.
2. The method of claim 1, wherein the one or more virtual storage devices each comprise a local storage device.
3. The method of claim 1, wherein the one or more virtual storage devices are attached to a cloud-based object storage device.
4. The method of claim 1, wherein the virtual storage system is a cloud-based virtual storage system created using a service provided by a cloud service provider.
5. The method of claim 4, wherein the one or more virtual controllers are implemented in respective cloud computing instances of a cloud platform; and wherein the one or more virtual storage devices are implemented in respective cloud computing instances of the cloud platform.
6. The method of claim 1, wherein the virtual storage system is a locally deployed virtual storage system that creates a virtual environment supported by locally deployed physical storage resources.
7. The method of claim 1, further comprising migrating a data set from the virtual storage system to another virtual storage system, wherein at least one of the virtual storage systems is a locally deployed virtual storage system that utilizes locally deployed physical storage resources.
8. The method of claim 1, further comprising migrating the data set from the locally deployed virtual storage system to a local environment executing on the physical storage system.
9. The method of claim 1, further comprising copying data sets from the virtual storage system to another virtual storage system, wherein at least one of the virtual storage systems is a locally deployed virtual storage system that utilizes locally deployed physical storage resources.
10. The method of claim 1, further comprising migrating the data set from the locally deployed virtual storage system to a local environment executing on the physical storage system.
11. The method as recited in claim 1, further comprising:
exposing a first set of interfaces to a first role of the virtual storage system, wherein the first set of interfaces is configured to host a physical environment of the virtual storage system; a kind of electronic device with high-pressure air-conditioning system
And exposing a second set of interfaces to a second role of the virtual storage system, wherein the second set of interfaces configures virtual components in the virtual storage system.
12. An apparatus comprising a computer processor, a computer memory operably coupled to the computer processor, the computer memory having computer program instructions disposed therein, which when executed by the computer processor, cause the apparatus to perform the steps of:
Instantiating one or more virtual storage controllers;
instantiating one or more virtual storage devices, each comprising a plurality of storage tiers; a kind of electronic device with high-pressure air-conditioning system
A virtual storage system is constructed, wherein the one or more virtual storage devices are coupled to each of the one or more virtual storage controllers.
13. The apparatus of claim 12, further comprising computer program instructions that, when executed by the computer processor, cause the apparatus to perform the step of migrating a data set from the virtual storage system to another virtual storage system, wherein at least one of the virtual storage systems is a locally deployed virtual storage system that utilizes locally deployed physical storage resources.
14. The apparatus of claim 12, further comprising computer program instructions that, when executed by the computer processor, cause the apparatus to perform the step of migrating a data set from a locally deployed virtual storage system to a local environment executing on a physical storage system.
15. The apparatus of claim 12, further comprising computer program instructions that, when executed by the computer processor, cause the apparatus to perform the step of copying a data set from the virtual storage system to another virtual storage system, wherein at least one of the virtual storage systems is a locally deployed virtual storage system that utilizes locally deployed physical storage resources.
16. The apparatus of claim 12, further comprising computer program instructions that, when executed by the computer processor, cause the apparatus to perform the step of copying a data set from a locally deployed virtual storage system to a local environment executing on a physical storage system.
17. The apparatus of claim 12, further comprising computer program instructions that, when executed by the computer processor, cause the apparatus to:
exposing a first set of interfaces to a first role of the virtual storage system, wherein the first set of interfaces is configured to host a physical environment of the virtual storage system; a kind of electronic device with high-pressure air-conditioning system
And exposing a second set of interfaces to a second role of the virtual storage system, wherein the second set of interfaces configures virtual components in the virtual storage system.
18. A computer program product disposed on a computer readable medium, the computer program product comprising computer program instructions that, when executed, cause a computer to:
instantiating one or more virtual storage controllers;
instantiating one or more virtual storage devices, each comprising a plurality of storage tiers; a kind of electronic device with high-pressure air-conditioning system
A virtual storage system is constructed, wherein the one or more virtual storage devices are coupled to each of the one or more virtual storage controllers.
19. The computer program product of claim 18, further comprising computer program instructions that, when executed, cause the computer to perform the step of migrating a data set from the virtual storage system to another virtual storage system, wherein at least one of the virtual storage systems is a locally deployed virtual storage system that utilizes locally deployed physical storage resources.
20. The computer program product of claim 18, further comprising computer program instructions that, when executed, cause the computer to perform the step of migrating a data set from a locally deployed virtual storage system to a local environment executing on a physical storage system.
CN202180070563.6A 2020-10-14 2021-10-13 Creating virtual storage systems Pending CN116391169A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US17/070,372 US11422751B2 (en) 2019-07-18 2020-10-14 Creating a virtual storage system
US17/070,372 2020-10-14
PCT/US2021/054670 WO2022081632A1 (en) 2020-10-14 2021-10-13 Creating a virtual storage system

Publications (1)

Publication Number Publication Date
CN116391169A true CN116391169A (en) 2023-07-04

Family

ID=78516973

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202180070563.6A Pending CN116391169A (en) 2020-10-14 2021-10-13 Creating virtual storage systems

Country Status (3)

Country Link
EP (1) EP4204942A1 (en)
CN (1) CN116391169A (en)
WO (1) WO2022081632A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9753669B2 (en) * 2014-05-13 2017-09-05 Velostrata Ltd. Real time cloud bursting
US10353634B1 (en) * 2016-03-28 2019-07-16 Amazon Technologies, Inc. Storage tier-based volume placement
WO2019209392A1 (en) * 2018-04-24 2019-10-31 Pure Storage, Inc. Hybrid data tiering

Also Published As

Publication number Publication date
EP4204942A1 (en) 2023-07-05
WO2022081632A1 (en) 2022-04-21

Similar Documents

Publication Publication Date Title
US11093139B1 (en) Durably storing data within a virtual storage system
US11550514B2 (en) Efficient transfers between tiers of a virtual storage system
US11526408B2 (en) Data recovery in a virtual storage system
US11838359B2 (en) Synchronizing metadata in a cloud-based storage system
US20210311841A1 (en) Data Recovery Service
US10990480B1 (en) Performance of RAID rebuild operations by a storage group controller of a storage system
US11126364B2 (en) Virtual storage system architecture
US11704202B2 (en) Recovering from system faults for replicated datasets
US11416298B1 (en) Providing application-specific storage by a storage system
US11714572B2 (en) Optimized data resiliency in a modular storage system
US11422751B2 (en) Creating a virtual storage system
US11662909B2 (en) Metadata management in a storage system
US11403000B1 (en) Resiliency in a cloud-based storage system
US11442669B1 (en) Orchestrating a virtual storage system
US11487715B1 (en) Resiliency in a cloud-based storage system
US11899582B2 (en) Efficient memory dump
US11893126B2 (en) Data deletion for a multi-tenant environment
US11327676B1 (en) Predictive data streaming in a virtual storage system
US20240004568A1 (en) Striping data across erase blocks having differing sizes
US11861221B1 (en) Providing scalable and reliable container-based storage services
CN116391169A (en) Creating virtual storage systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination