US20180210848A1 - Storage in Multi-Queue Storage Devices Using Queue Multiplexing and Access Control - Google Patents

Storage in Multi-Queue Storage Devices Using Queue Multiplexing and Access Control Download PDF

Info

Publication number
US20180210848A1
US20180210848A1 US15/847,992 US201715847992A US2018210848A1 US 20180210848 A1 US20180210848 A1 US 20180210848A1 US 201715847992 A US201715847992 A US 201715847992A US 2018210848 A1 US2018210848 A1 US 2018210848A1
Authority
US
United States
Prior art keywords
storage
queue
server
commands
running
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US15/847,992
Other versions
US10031872B1 (en
Inventor
Alex Friedman
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Amazon Technologies Inc
Original Assignee
E8 Storage Systems Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by E8 Storage Systems Ltd filed Critical E8 Storage Systems Ltd
Priority to US15/847,992 priority Critical patent/US10031872B1/en
Assigned to E8 STORAGE SYSTEMS LTD. reassignment E8 STORAGE SYSTEMS LTD. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: FRIEDMAN, ALEX
Application granted granted Critical
Publication of US10031872B1 publication Critical patent/US10031872B1/en
Publication of US20180210848A1 publication Critical patent/US20180210848A1/en
Assigned to AMAZON TECHNOLOGIES, INC. reassignment AMAZON TECHNOLOGIES, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: E8 STORAGE SYSTEMS LTD.
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F13/00Interconnection of, or transfer of information or other signals between, memories, input/output devices or central processing units
    • G06F13/14Handling requests for interconnection or transfer
    • G06F13/20Handling requests for interconnection or transfer for access to input/output bus
    • G06F13/28Handling requests for interconnection or transfer for access to input/output bus using burst mode transfer, e.g. direct memory access DMA, cycle steal
    • G06F13/287Multiplexed DMA
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • G06F3/0613Improving I/O performance in relation to throughput
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0659Command handling arrangements, e.g. command buffers, queues, command scheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/10Network architectures or network communication protocols for network security for controlling access to devices or network resources
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Definitions

  • the present invention relates generally to data storage, and particularly to methods and systems for distributed storage.
  • U.S. Pat. No. 9,112,890 whose disclosure is incorporated herein by reference, describes a method for data storage including, in a system that includes one or more storage controllers, multiple servers and multiple multi-queue storage devices, assigning in each storage device server-specific queues for queuing data-path storage commands exchanged with the respective servers. At least some of the data-path storage commands are exchanged directly between the servers and the storage devices, not via the storage controllers, to be queued and executed in accordance with the corresponding server-specific queues.
  • An embodiment of the present invention that is described herein provides a method for data storage including, in a system that includes multiple servers, multiple multi-queue storage devices and at least one storage controller that communicate over a network, running, in a server among the servers, multiple data-path instances (DPs) that operate independently of one another and issue storage commands for execution in the multi-queue storage devices.
  • the storage commands, issued by the multiple DPs running in the server, are multiplexed using an Input-Output Multiplexer (I/O MUX) process.
  • I/O MUX Input-Output Multiplexer
  • the multiplexed storage commands are executed in the multi-queue storage devices.
  • executing the multiplexed storage commands includes, in a given multi-queue storage device, queuing the storage commands issued by the multiple DPs running in the server in a single queue pair (QP) associated with the I/O MUX process.
  • QP queue pair
  • multiplexing the storage commands includes running the I/O MUX process in the server. In an alternative embodiment, multiplexing the storage commands includes running the I/O MUX process in a gateway separate from the server. The method may further include running in the gateway an access-control process that enforces an access-control policy on the storage commands.
  • multiplexing and executing the storage commands includes accessing the multi-queue storage devices using remote direct memory access, without running code on the at least one storage controller.
  • a computing system including multiple multi-queue storage devices, at least one storage controller, and multiple servers.
  • a server among the servers is configured to run multiple data-path instances (DPs) that operate independently of one another and issue storage commands for execution in the multi-queue storage devices.
  • a processor in the computing system is configured to multiplex the storage commands issued by the multiple DPs running in the server using an Input-Output Multiplexer (I/O MUX) process, so as to execute the multiplexed storage commands in the multi-queue storage devices.
  • I/O MUX Input-Output Multiplexer
  • FIG. 1 is a block diagram that schematically illustrates a computing system that uses distributed data storage, in accordance with an embodiment of the present invention
  • FIG. 2 is a block diagram that schematically illustrates elements of a storage agent used in the system of FIG. 1 , in accordance with an embodiment of the present invention
  • FIG. 3 is a block diagram that schematically illustrates queuing and I/O multiplexing elements in the computing system of FIG. 1 , in accordance with an embodiment of the present invention.
  • FIG. 4 is a block diagram that schematically illustrates a computing system that uses distributed data storage, in accordance with an alternative embodiment of the present invention.
  • Embodiments of the present invention that are described herein provide improved methods and systems for distributed data storage.
  • the disclosed techniques are typically implemented in a computing system comprising multiple servers that store data in multiple shared multi-queue storage devices, and one or more storage controllers.
  • the servers run data-path instances (DPs) that execute storage commands in the storage devices on behalf of user applications.
  • DPs data-path instances
  • the DPs perform logical-to-physical address translation and implement redundant storage such as RAID.
  • Computing systems of this sort are described, for example, in U.S. Pat. Nos. 9,112,890, 9,274,720, 9,519,666, 9,521,201, 9,525,737 and 9,529,542, whose disclosures are incorporated herein by reference.
  • a given server may run two or more DPs in parallel.
  • the server further comprises an Input/Output multiplexer (I/O MUX) that multiplexes storage commands (e.g., write and read commands) from the various DPs vis-à-vis the storage devices.
  • I/O MUXs are not implemented in the servers but on separate gateways. In the latter embodiments, the I/O MUXs may also carry out additional operations, such as enforce access control policies.
  • the multi-queue storage devices are able to hold a respective queue (or queue pair—QP) per I/O MUX rather than per data-path instance (DP).
  • QP queue pair—QP
  • DP data-path instance
  • FIG. 1 is a block diagram that schematically illustrates a computing system 20 , in accordance with an embodiment of the present invention.
  • System 20 may comprise, for example, a data center, a High-Performance Computing (HPC) cluster, or any other suitable system.
  • System 20 comprises multiple servers 24 denoted S1 . . . Sn, and multiple storage devices 28 denoted D1 . . . Dm.
  • the servers and storage devices are interconnected by a communication network 32 .
  • the system further comprises one or more storage controllers 36 that manage the storage of data in storage devices 28 .
  • data-path operations such as writing and readout are performed directly between the servers and the storage devices, without having to trigger or run code on the storage controller CPUs.
  • the storage controller CPUs are involved only in relatively rare control-path operations.
  • the servers do not need to, and typically do not, communicate with one another or otherwise coordinate storage operations with one another. Coordination is typically performed by the servers accessing shared data structures that reside, for example, in the memories of the storage controllers.
  • storage devices 28 are comprised in a storage-device enclosure 30 , e.g., a rack, drawer or cabinet.
  • Enclosure 30 further comprises a staging Random Access Memory (RAM) unit 42 that comprises multiple staging RAMS 43 .
  • the staging RAM unit is used as a front-end for temporary caching of I/O commands en-route from servers 24 to storage devices 28 .
  • Staging RAMS 43 are therefore also referred to herein as interim memory.
  • Enclosure 30 may also comprise a Central Processing Unit (CPU—not shown).
  • the staging RAM and staging RAM unit are also referred to herein as “NVRAM cache” or “cache memory.”
  • the use of staging RAMS 42 is advantageous, for example, in various recovery processes.
  • the use of staging RAMS 42 essentially converts write workloads having low queue depth (e.g., queue depth of 1) into write workloads having a high queue depth, thereby significantly improving the average write latency, throughput and IOPS of such workloads.
  • Agents 40 typically comprise software modules installed and running on the respective servers.
  • a storage agent 40 may comprise one or more Data-Path (DP) modules 46 .
  • DP modules 46 are also referred to herein as “DP instances” or simply as DPs.
  • DPs 46 perform storage related functions for user applications running in the server.
  • a given agent 40 on a given server 24 may run multiple DPs 46 , e.g., for increasing throughput and IOPS.
  • each DP is pinned to a specific CPU core of the server, and each DP operates independently of the other DPs.
  • an I/O multiplexer (MUX) 47 multiplexes the communication between DPs 46 of that agent 40 vis-à-vis staging RAMS 43 and storage devices 28 .
  • Running multiple independent DPs on a server is highly effective in achieving scalability and performance improvement, while avoiding various kinds of resource contention.
  • each agent 40 (and thus each server 24 ) comprises two DPs 46 .
  • different agents 40 (and thus different servers 24 ) may comprise different numbers of DPs 46 .
  • Any agent 40 (and thus each server 24 ) may comprise any suitable number of DPs 46 .
  • MUX 47 in that agent may be omitted.
  • I/O MUX 47 may run, for example, in a dedicated software thread on the CPU of server 24 .
  • MUX 47 may run in the same software thread as one of DPs 46 .
  • a given server 24 may comprise two or more MUX 47 , each serving a subset of the DPs 46 running in the server.
  • MUX 47 need not necessarily run in server 24 , and may alternatively run on any suitable processor in system 20 .
  • FIG. 4 below depicts an alternative system configuration in which I/O MUXs 47 run in gateways that are separate from servers 24 .
  • Servers 24 may comprise any suitable computing platforms that run any suitable applications.
  • the term “server” includes both physical servers and virtual servers.
  • a virtual server may be implemented using a Virtual Machine (VM) that is hosted in some physical computer.
  • VM Virtual Machine
  • Storage controllers 36 may be physical or virtual.
  • the storage controllers may be implemented as software modules that run on one or more physical servers 24 .
  • Storage devices 28 may comprise any suitable storage medium, such as, for example, Solid State Drives (SSD), Non-Volatile Random Access Memory (NVRAM) devices or Hard Disk Drives (HDDs).
  • SSD Solid State Drives
  • NVRAM Non-Volatile Random Access Memory
  • HDDs Hard Disk Drives
  • storage devices 28 comprise multi-queued SSDs that operate in accordance with the NVMe specification.
  • each storage device 28 provides multiple queues for storage commands.
  • the storage devices typically have the freedom to queue, schedule and reorder execution of storage commands.
  • storage commands and “I/Os” are used interchangeably herein.
  • Network 32 may operate in accordance with any suitable communication protocol, such as Ethernet or Infiniband.
  • DMA Direct Memory Access
  • RDMA Remote Direct Memory Access
  • the embodiments described below refer mainly to RDMA protocols, by way of example.
  • RDMA may be used for this purpose, e.g., Infiniband (IB), RDMA over Converged Ethernet (RoCE), Virtual Interface Architecture and internet Wide Area RDMA Protocol (iWARP).
  • IB Infiniband
  • RoCE RDMA over Converged Ethernet
  • iWARP internet Wide Area RDMA Protocol
  • the disclosed techniques can be implemented using any other form of direct memory access over a network, e.g., Direct Memory Access (DMA), various Peripheral Component Interconnect Express (PCIe) schemes, or any other suitable protocol.
  • DMA Direct Memory Access
  • PCIe Peripheral Component Interconnect Express
  • system 20 may comprise any suitable number of servers, storage devices and storage controllers.
  • the system comprises two storage controllers denoted C1 and C2, for resilience.
  • both controllers are continuously active and provide backup to one another, and the system is designed to survive failure of a single controller.
  • the assumption is that any server 24 is able to communicate with any storage device 28 , but there is no need for the servers to communicate with one another.
  • Storage controllers 36 are assumed to be able to communicate with all servers 24 and storage devices 28 , as well as with one another.
  • FIG. 2 is a block diagram that schematically illustrates elements of storage agent 40 , in accordance with an embodiment of the present invention.
  • a respective storage agent of this sort typically runs on each server 24 .
  • servers 24 may comprise physical and/or virtual servers. Thus, a certain physical computer may run multiple virtual servers 24 , each having its own respective storage agent 40 .
  • FIG. 2 depicts an example in which agent 40 comprises two DPs 46 . As noted above, agent 40 may alternatively comprise any suitable number of DPs, or even a single DP.
  • each DP 46 performs storage-related functions for one or more user applications 44 running on server 24 .
  • different DPs 46 in a given storage agent 40 (in a given server 24 ) access storage devices 28 independently of one another.
  • Each DP 46 comprises a Redundant Array of Independent Disks (RAID) layer 48 and a user-volume layer 52 .
  • RAID layer 48 carries out a redundant storage scheme over storage devices 28 , including handling storage resiliency, detection of storage device failures, rebuilding of failed storage devices and rebalancing of data in case of maintenance or other evacuation of a storage device.
  • RAID layer 48 also typically stripes data across multiple storage devices 28 for improving storage performance.
  • RAID layer 48 implements a RAID-10 scheme, i.e., replicates and stores two copies of each data item on two different storage devices 28 .
  • One of the two copies is defined as primary and the other as secondary.
  • the primary copy is used for readout as long as it is available. If the primary copy is unavailable, for example due to storage-device failure, the RAID layer reverts to read the secondary copy.
  • RAID layer 48 may implement any other suitable redundant storage scheme (RAID-based or otherwise), such as schemes based on erasure codes, RAID-1, RAID-4, RAID-5, RAID-6, RAID-10, RAID-50, multi-dimensional RAID schemes, or any other suitable redundant storage scheme.
  • RAID-based or otherwise such as schemes based on erasure codes, RAID-1, RAID-4, RAID-5, RAID-6, RAID-10, RAID-50, multi-dimensional RAID schemes, or any other suitable redundant storage scheme.
  • RAID layer 48 stores data in stripes that are distributed over multiple storage devices, each stripe comprising multiple data elements and one or more redundancy elements (e.g., parity) computed over the data elements.
  • the stripes are made up of data and redundancy blocks, but the disclosed techniques can be used with other suitable types of data and redundancy elements.
  • parity and “redundancy” are used interchangeably herein.
  • RAID-6 in which each stripe comprises N data blocks and two parity blocks.
  • RAID layer 48 accesses storage devices 28 using physical addressing.
  • RAID layer 48 exchanges with storage devices 28 read and write commands, as well as responses and retrieved data, which directly specify physical addresses (physical storage locations) on the storage devices.
  • all logical-to-physical address translations are performed in DPs 46 within agents 40 in servers 24 , and none in storage devices 28 .
  • RAID layer 48 maps between physical addresses and Logical Volumes (LVs) to be used by user-volume layer 52 . Each LV is mapped to two or more physical-address ranges on two or more different storage devices. The two or more ranges are used for storing the replicated copies of the LV data as part of the redundant storage scheme.
  • LVs Logical Volumes
  • the redundant storage scheme (e.g., RAID) is thus hidden from user-volume layer 52 .
  • Layer 52 views the storage medium as a set of guaranteed-storage LVs.
  • User-volume layer 52 is typically unaware of storage device failure, recovery, maintenance and rebuilding, which are handled transparently by RAID layer 48 . (Nevertheless, some optimizations may benefit from such awareness by layer 52 . For example, there is no need to rebuild unallocated storage space.)
  • User-volume layer 52 provides storage resources to applications 44 by exposing user volumes that are identified by respective Logical Unit Numbers (LUNs).
  • LUNs Logical Unit Numbers
  • user volume and “LUN” are used interchangeably herein.
  • a user application 44 views the storage system as a collection of user volumes, and issues storage commands having user-volume addresses.
  • the user-volume addresses are also referred to as User Block Addresses (UBAs) and the LV addresses are also referred to as RAID Block Addresses (RBAs).
  • UBAs User Block Addresses
  • RBAs RAID Block Addresses
  • Each DP 46 translates between the different address spaces using a RAID table 56 and a volume map 60 .
  • RAID table 56 holds the translation between LV addresses and physical addresses
  • volume map 60 holds the translation between user-volume addresses and LV addresses.
  • any server 24 may attach to any user volume.
  • a given user volume may have multiple servers attached thereto.
  • storage controllers 36 define and maintain a global volume map that specifies all user volumes in system 20 .
  • Volume map 60 in each DP 46 comprises a locally-cached copy of at least part of the global volume map.
  • volume maps 60 of DPs 46 hold at least the mapping of the user volumes (LUNs) to which this server is attached.
  • volume map 60 supports thin provisioning.
  • each I/O MUX 47 multiplexes storage commands of two or more DPs 46 .
  • the use of MUXs 47 enables each multi-queue storage device 28 to hold a respective queue per MUX 47 , rather than per DP 46 . Since storage devices are typically limited in the number of queues they are able to support, the disclosed embodiments enhance the scalability of system 20 considerably.
  • FIG. 3 is a block diagram that schematically illustrates queuing and I/O multiplexing elements in system 20 , in accordance with an embodiment of the present invention.
  • storage agent 40 comprises multiple data-path instances (DPs) 46 .
  • Agent 40 maintains a respective Queue Pair (QP) per DP 46 .
  • QP comprises a DP submission Queue 62 (DPSQ) and a DP Completion Queue (DPCQ) 63 .
  • each storage device 28 maintains a respective QP that comprises a Storage-Device submission Queue (SDSQ) 64 and a Storage-Device Completion Queue (SDCQ) 65 .
  • SDSQ Storage-Device submission Queue
  • SDCQ Storage-Device Completion Queue
  • each DPSQ 62 and each DPCQ 63 may be implemented as an in-memory shared cyclic queue, with a phase bit that toggles between “0” and “1” on each wraparound of the queue.
  • the phase bit is typically used to determine the location of the producer.
  • the phase bit is initially set to “1” such that all commands submitted in the first cycle of the submission queue (until the first wraparound) have their phase bit set to “1”.
  • the phase bit is set to “0”, so that any new command is submitted with a “0” phase bit. Toggling continues in a similar manner for subsequent cycles. It is assumed that the queue is initially zeroed. Using this technique, a consumer can easily detect the position of the most-recently submitted command in the cyclic queue. The location of the consumer of a DPSQ is sent through the corresponding DPCQ, and vice versa.
  • a storage command issued by a DP 46 typically comprises (i) the type of command (e.g., read or write), (ii) relevant storage-device parameters such as LBA, (iii) remote staging RAM locations, and (iv) local data pointers for the RDMA transfer.
  • the type of command e.g., read or write
  • relevant storage-device parameters such as LBA, (iii) remote staging RAM locations, and (iv) local data pointers for the RDMA transfer.
  • write commands are executed as follows:
  • read commands are executed as follows:
  • write-command and read-command execution flows described above are example flows, which are depicted purely for the sake of conceptual clarity. In alternative embodiments, any other suitable flows can be used.
  • FIG. 4 is a block diagram that schematically illustrates a computing system that uses distributed data storage, in accordance with an alternative embodiment of the present invention.
  • I/O MUXs 47 are implemented in separate Gateways (GWs) 70 and not in servers 24 .
  • the system comprises a respective GW 70 for each storage device 28 .
  • GWs 70 may comprise, for example, dedicated computers or other processors, located at any suitable location in the system.
  • each GW 70 further comprises a respective access control module (ACC) 74 .
  • Each ACC 74 permits or denies execution of storage commands on the respective storage device 28 , in accordance with a certain access-control policy.
  • the policy may specify that each DP is allowed to access only volumes that are mapped thereto. Without access control, malicious hosts may illegitimately gain access to information they are not entitled to access on storage devices 28 .
  • each ACC 74 is notified of the volumes to which the various DPs 46 are mapped to, and is therefore able to permit or deny each storage command, as appropriate.
  • each ACC 74 is provided with a “reverse volume map” that maps RBA ranges to the (one or more) servers mapped to the volumes having these RBA ranges.
  • ACCs 74 may enforce any other suitable access-control policy. ACCs are typically configured with the policy by storage controllers 36 .
  • MUX 47 and ACC 74 are implemented as separate software processes on GW 70 . In other embodiments, the functions of MUX 47 and ACC 74 on a given GW 70 may be carried out jointly by a single software module.
  • each agent 40 runs one or more “MUX DP” (MDP) modules
  • each GW 70 runs one or more host-specific (i.e., server-specific) submission queues (HSQs) and one or more respective host-specific (i.e., server-specific) completion queues (HCQs).
  • MDP MUX DP
  • HSQs host-specific submission queues
  • HCQs host-specific completion queues
  • An MDP in a certain agent 40 is defined as the consumer of some or all of DPSQs 62 of DPs 46 in that agent 40 .
  • an MDP in an agent 40 is defined as the producer of some or all of DPCQs 63 of DPs 46 in that agent 40 .
  • each agent comprises only a single MDP.
  • a given agent 40 may comprise two or more MDPs that run in different contexts, for enhancing performance.
  • each MDP is typically assigned to a respective subset of the DPs in agent 40 .
  • An MDP may run in the same context (e.g., same thread or same CPU core) as one of the DPs, or in a dedicated context.
  • Each MDP is configured to aggregate its assigned DPSQs in a fair manner, e.g., Round Robin, and to write each command (using RDMA) to the appropriate HSQ on the appropriate GW 70 .
  • RDMA read-only memory
  • the same QPs are used for (i) RDMA writes that write data of write commands, and (ii) RDMA writes that write the write commands themselves.
  • write commands are executed in the system of FIG. 4 as follows:
  • read commands in the system of FIG. 4 are executed as follows:
  • a given I/O MUX 47 running on a given GW 70 , may be implemented as a single-threaded or multi-threaded (or otherwise multi-core) process.
  • each thread of the MUX typically runs on a dedicated core of GW 70 and is responsible for a configurable subset of servers 24 .
  • MUX 47 typically polls the HSQs of its designated servers 24 , issues storage commands (I/Os) to storage devices 28 , and polls the CQs of the storage devices. The CQs are multiplexed back into HCQs, which are in turn RDMA-written back to the servers.
  • a phase bit and a consumer location are passed in the HSQ and HCQ.
  • the relevant data is either RDMA-written to the server by the gateway, or RDMA-read by the server from the staging RAM.
  • write-command and read-command execution flows described above are example flows, which are depicted purely for the sake of conceptual clarity. In alternative embodiments, any other suitable flows can be used.
  • the MDPs can be omitted, meaning that each DP essentially serves as an MDP.
  • FIGS. 1-3 are example configurations, which are chosen purely for the sake of conceptual clarity. In alternative embodiments, any other suitable system configurations can be used.
  • the system may comprise only a single storage controller 36 .
  • the functionality of storage controllers 36 may be distributed among servers 24 .
  • Each server 24 typically comprises a suitable network interface for communicating over network 32 , e.g., with the storage devices and/or storage controllers, and a suitable processor that carries out the various server functions.
  • Each storage controller 36 typically comprises a suitable network interface for communicating over network 32 , e.g., with the storage devices and/or servers, and a suitable processor that carries out the various storage controller functions.
  • servers 24 and/or storage controllers 36 comprise general-purpose processors, which are programmed in software to carry out the functions described herein.
  • the software may be downloaded to the processors in electronic form, over a network, for example, or it may, alternatively or additionally, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
  • I/O MUXs 47 may also be used to add extra features missing from the underlying storage devices 28 .
  • Non-limiting examples of such features may comprise atomic operations, reservations and copy offload operations.
  • a MUX 47 may coordinate such atomic operations, lock ranges to implement reservations or offload copying data from one range to another by performing the copy by itself, without involving servers 24 or storage controllers 36 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method for data storage includes, in a system that includes multiple servers, multiple multi-queue storage devices and at least one storage controller that communicate over a network, running, in a server among the servers, multiple data-path instances (DPs) that operate independently of one another and issue storage commands for execution in the multi-queue storage devices. The storage commands, issued by the multiple DPs running in the server, are multiplexed using an Input-Output Multiplexer (I/O MUX) process. The multiplexed storage commands are executed in the multi-queue storage devices.

Description

    CROSS-REFERENCE TO RELATED APPLICATIONS
  • This application claims the benefit of U.S. Provisional Patent Application 62/449,131, filed Jan. 23, 2017, whose disclosure is incorporated herein by reference.
  • FIELD OF THE INVENTION
  • The present invention relates generally to data storage, and particularly to methods and systems for distributed storage.
  • BACKGROUND OF THE INVENTION
  • Various techniques for distributed data storage are known in the art. For example, U.S. Pat. No. 9,112,890, whose disclosure is incorporated herein by reference, describes a method for data storage including, in a system that includes one or more storage controllers, multiple servers and multiple multi-queue storage devices, assigning in each storage device server-specific queues for queuing data-path storage commands exchanged with the respective servers. At least some of the data-path storage commands are exchanged directly between the servers and the storage devices, not via the storage controllers, to be queued and executed in accordance with the corresponding server-specific queues.
  • SUMMARY OF THE INVENTION
  • An embodiment of the present invention that is described herein provides a method for data storage including, in a system that includes multiple servers, multiple multi-queue storage devices and at least one storage controller that communicate over a network, running, in a server among the servers, multiple data-path instances (DPs) that operate independently of one another and issue storage commands for execution in the multi-queue storage devices. The storage commands, issued by the multiple DPs running in the server, are multiplexed using an Input-Output Multiplexer (I/O MUX) process. The multiplexed storage commands are executed in the multi-queue storage devices.
  • In some embodiments, executing the multiplexed storage commands includes, in a given multi-queue storage device, queuing the storage commands issued by the multiple DPs running in the server in a single queue pair (QP) associated with the I/O MUX process.
  • In an embodiment, multiplexing the storage commands includes running the I/O MUX process in the server. In an alternative embodiment, multiplexing the storage commands includes running the I/O MUX process in a gateway separate from the server. The method may further include running in the gateway an access-control process that enforces an access-control policy on the storage commands.
  • In a disclosed embodiment, multiplexing and executing the storage commands includes accessing the multi-queue storage devices using remote direct memory access, without running code on the at least one storage controller.
  • There is additionally provided, in accordance with an embodiment of the present invention, a computing system including multiple multi-queue storage devices, at least one storage controller, and multiple servers. A server among the servers is configured to run multiple data-path instances (DPs) that operate independently of one another and issue storage commands for execution in the multi-queue storage devices. A processor in the computing system is configured to multiplex the storage commands issued by the multiple DPs running in the server using an Input-Output Multiplexer (I/O MUX) process, so as to execute the multiplexed storage commands in the multi-queue storage devices.
  • The present invention will be more fully understood from the following detailed description of the embodiments thereof, taken together with the drawings in which:
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram that schematically illustrates a computing system that uses distributed data storage, in accordance with an embodiment of the present invention;
  • FIG. 2 is a block diagram that schematically illustrates elements of a storage agent used in the system of FIG. 1, in accordance with an embodiment of the present invention;
  • FIG. 3 is a block diagram that schematically illustrates queuing and I/O multiplexing elements in the computing system of FIG. 1, in accordance with an embodiment of the present invention; and
  • FIG. 4 is a block diagram that schematically illustrates a computing system that uses distributed data storage, in accordance with an alternative embodiment of the present invention.
  • DETAILED DESCRIPTION OF EMBODIMENTS Overview
  • Embodiments of the present invention that are described herein provide improved methods and systems for distributed data storage. The disclosed techniques are typically implemented in a computing system comprising multiple servers that store data in multiple shared multi-queue storage devices, and one or more storage controllers. The servers run data-path instances (DPs) that execute storage commands in the storage devices on behalf of user applications. Among other tasks, the DPs perform logical-to-physical address translation and implement redundant storage such as RAID. Computing systems of this sort are described, for example, in U.S. Pat. Nos. 9,112,890, 9,274,720, 9,519,666, 9,521,201, 9,525,737 and 9,529,542, whose disclosures are incorporated herein by reference.
  • In order to improve performance (e.g., increase the number of I/O operations per second—IOPS—and increase throughput) a given server may run two or more DPs in parallel. In some embodiments, the server further comprises an Input/Output multiplexer (I/O MUX) that multiplexes storage commands (e.g., write and read commands) from the various DPs vis-à-vis the storage devices. In other embodiments, the I/O MUXs are not implemented in the servers but on separate gateways. In the latter embodiments, the I/O MUXs may also carry out additional operations, such as enforce access control policies.
  • By employing I/O MUXs, the multi-queue storage devices are able to hold a respective queue (or queue pair—QP) per I/O MUX rather than per data-path instance (DP). As a result, system scalability is enhanced significantly. Current NVMe disks, as a non-limiting example, are typically limited to a maximum of 128 queues per disk. Without I/O multiplexing, the system would be limited to no more than 128 DPs. By using the disclosed multiplexing schemes, the number of DPs per server, and the total number of DPs in the system, are virtually unlimited.
  • System Description
  • FIG. 1 is a block diagram that schematically illustrates a computing system 20, in accordance with an embodiment of the present invention. System 20 may comprise, for example, a data center, a High-Performance Computing (HPC) cluster, or any other suitable system. System 20 comprises multiple servers 24 denoted S1 . . . Sn, and multiple storage devices 28 denoted D1 . . . Dm. The servers and storage devices are interconnected by a communication network 32. The system further comprises one or more storage controllers 36 that manage the storage of data in storage devices 28.
  • In the disclosed techniques, data-path operations such as writing and readout are performed directly between the servers and the storage devices, without having to trigger or run code on the storage controller CPUs. The storage controller CPUs are involved only in relatively rare control-path operations. Moreover, the servers do not need to, and typically do not, communicate with one another or otherwise coordinate storage operations with one another. Coordination is typically performed by the servers accessing shared data structures that reside, for example, in the memories of the storage controllers.
  • In the present example, although not necessarily, storage devices 28 are comprised in a storage-device enclosure 30, e.g., a rack, drawer or cabinet. Enclosure 30 further comprises a staging Random Access Memory (RAM) unit 42 that comprises multiple staging RAMS 43. The staging RAM unit is used as a front-end for temporary caching of I/O commands en-route from servers 24 to storage devices 28. Staging RAMS 43 are therefore also referred to herein as interim memory. Enclosure 30 may also comprise a Central Processing Unit (CPU—not shown). The staging RAM and staging RAM unit are also referred to herein as “NVRAM cache” or “cache memory.” The use of staging RAMS 42 is advantageous, for example, in various recovery processes. Moreover, the use of staging RAMS 42 essentially converts write workloads having low queue depth (e.g., queue depth of 1) into write workloads having a high queue depth, thereby significantly improving the average write latency, throughput and IOPS of such workloads.
  • Storage-related functions in each server 24 are carried out by a respective storage agent 40. Agents 40 typically comprise software modules installed and running on the respective servers. In some embodiments described herein, a storage agent 40 may comprise one or more Data-Path (DP) modules 46. DP modules 46 are also referred to herein as “DP instances” or simply as DPs. DPs 46 perform storage related functions for user applications running in the server.
  • A given agent 40 on a given server 24 may run multiple DPs 46, e.g., for increasing throughput and IOPS. In a typical implementation, each DP is pinned to a specific CPU core of the server, and each DP operates independently of the other DPs. When a given agent 40 comprises two or more DPs 46, an I/O multiplexer (MUX) 47 multiplexes the communication between DPs 46 of that agent 40 vis-à-vis staging RAMS 43 and storage devices 28. Running multiple independent DPs on a server is highly effective in achieving scalability and performance improvement, while avoiding various kinds of resource contention.
  • The functions of agents 40, including DPs 46 and 47, and their interaction with the other system elements, are described in detail below. In the example of FIG. 1, each agent 40 (and thus each server 24) comprises two DPs 46. Alternatively, different agents 40 (and thus different servers 24) may comprise different numbers of DPs 46. Any agent 40 (and thus each server 24) may comprise any suitable number of DPs 46. When a certain agent 40 comprises only a single DP, MUX 47 in that agent may be omitted.
  • I/O MUX 47 may run, for example, in a dedicated software thread on the CPU of server 24. Alternatively, MUX 47 may run in the same software thread as one of DPs 46. In some embodiments, a given server 24 may comprise two or more MUX 47, each serving a subset of the DPs 46 running in the server. Further alternatively, MUX 47 need not necessarily run in server 24, and may alternatively run on any suitable processor in system 20. For example, FIG. 4 below depicts an alternative system configuration in which I/O MUXs 47 run in gateways that are separate from servers 24.
  • Servers 24 may comprise any suitable computing platforms that run any suitable applications. In the present context, the term “server” includes both physical servers and virtual servers. For example, a virtual server may be implemented using a Virtual Machine (VM) that is hosted in some physical computer. Thus, in some embodiments multiple virtual servers may run in a single physical computer. Storage controllers 36, too, may be physical or virtual. In an example embodiment, the storage controllers may be implemented as software modules that run on one or more physical servers 24.
  • Storage devices 28 may comprise any suitable storage medium, such as, for example, Solid State Drives (SSD), Non-Volatile Random Access Memory (NVRAM) devices or Hard Disk Drives (HDDs). In an example embodiment, storage devices 28 comprise multi-queued SSDs that operate in accordance with the NVMe specification. In such an embodiment, each storage device 28 provides multiple queues for storage commands. The storage devices typically have the freedom to queue, schedule and reorder execution of storage commands. The terms “storage commands” and “I/Os” are used interchangeably herein.
  • Network 32 may operate in accordance with any suitable communication protocol, such as Ethernet or Infiniband. In some embodiments, some of the disclosed techniques can be implemented using Direct Memory Access (DMA) and/or Remote Direct Memory Access (RDMA) operations. The embodiments described below refer mainly to RDMA protocols, by way of example. Various variants of RDMA may be used for this purpose, e.g., Infiniband (IB), RDMA over Converged Ethernet (RoCE), Virtual Interface Architecture and internet Wide Area RDMA Protocol (iWARP). Further alternatively, the disclosed techniques can be implemented using any other form of direct memory access over a network, e.g., Direct Memory Access (DMA), various Peripheral Component Interconnect Express (PCIe) schemes, or any other suitable protocol. In the context of the present patent application and in the claims, all such protocols are referred to as “remote direct memory access.” Any of the RDMA operations mentioned herein is performed without triggering or running code on any storage controller CPU.
  • Generally, system 20 may comprise any suitable number of servers, storage devices and storage controllers. In the present example, the system comprises two storage controllers denoted C1 and C2, for resilience. In an example embodiment, both controllers are continuously active and provide backup to one another, and the system is designed to survive failure of a single controller.
  • In the embodiments described herein, the assumption is that any server 24 is able to communicate with any storage device 28, but there is no need for the servers to communicate with one another. Storage controllers 36 are assumed to be able to communicate with all servers 24 and storage devices 28, as well as with one another.
  • FIG. 2 is a block diagram that schematically illustrates elements of storage agent 40, in accordance with an embodiment of the present invention. A respective storage agent of this sort typically runs on each server 24.
  • As noted above, servers 24 may comprise physical and/or virtual servers. Thus, a certain physical computer may run multiple virtual servers 24, each having its own respective storage agent 40. FIG. 2 depicts an example in which agent 40 comprises two DPs 46. As noted above, agent 40 may alternatively comprise any suitable number of DPs, or even a single DP.
  • In some embodiments, each DP 46 performs storage-related functions for one or more user applications 44 running on server 24. Typically, different DPs 46 in a given storage agent 40 (in a given server 24) access storage devices 28 independently of one another.
  • Each DP 46 comprises a Redundant Array of Independent Disks (RAID) layer 48 and a user-volume layer 52. RAID layer 48 carries out a redundant storage scheme over storage devices 28, including handling storage resiliency, detection of storage device failures, rebuilding of failed storage devices and rebalancing of data in case of maintenance or other evacuation of a storage device. RAID layer 48 also typically stripes data across multiple storage devices 28 for improving storage performance.
  • In one simple example embodiment, RAID layer 48 implements a RAID-10 scheme, i.e., replicates and stores two copies of each data item on two different storage devices 28. One of the two copies is defined as primary and the other as secondary. The primary copy is used for readout as long as it is available. If the primary copy is unavailable, for example due to storage-device failure, the RAID layer reverts to read the secondary copy.
  • Alternatively, however, RAID layer 48 may implement any other suitable redundant storage scheme (RAID-based or otherwise), such as schemes based on erasure codes, RAID-1, RAID-4, RAID-5, RAID-6, RAID-10, RAID-50, multi-dimensional RAID schemes, or any other suitable redundant storage scheme.
  • Typically, RAID layer 48 stores data in stripes that are distributed over multiple storage devices, each stripe comprising multiple data elements and one or more redundancy elements (e.g., parity) computed over the data elements. In some embodiments the stripes are made up of data and redundancy blocks, but the disclosed techniques can be used with other suitable types of data and redundancy elements. The terms “parity” and “redundancy” are used interchangeably herein. One non-limiting example is RAID-6, in which each stripe comprises N data blocks and two parity blocks.
  • In each DP 46, RAID layer 48 accesses storage devices 28 using physical addressing. In other words, RAID layer 48 exchanges with storage devices 28 read and write commands, as well as responses and retrieved data, which directly specify physical addresses (physical storage locations) on the storage devices. In this embodiment, all logical-to-physical address translations are performed in DPs 46 within agents 40 in servers 24, and none in storage devices 28.
  • RAID layer 48 maps between physical addresses and Logical Volumes (LVs) to be used by user-volume layer 52. Each LV is mapped to two or more physical-address ranges on two or more different storage devices. The two or more ranges are used for storing the replicated copies of the LV data as part of the redundant storage scheme.
  • The redundant storage scheme (e.g., RAID) is thus hidden from user-volume layer 52. Layer 52 views the storage medium as a set of guaranteed-storage LVs. User-volume layer 52 is typically unaware of storage device failure, recovery, maintenance and rebuilding, which are handled transparently by RAID layer 48. (Nevertheless, some optimizations may benefit from such awareness by layer 52. For example, there is no need to rebuild unallocated storage space.)
  • User-volume layer 52 provides storage resources to applications 44 by exposing user volumes that are identified by respective Logical Unit Numbers (LUNs). The terms “user volume” and “LUN” are used interchangeably herein. In other words, a user application 44 views the storage system as a collection of user volumes, and issues storage commands having user-volume addresses.
  • In the embodiments described herein, the user-volume addresses are also referred to as User Block Addresses (UBAs) and the LV addresses are also referred to as RAID Block Addresses (RBAs). Thus, layer 52 in each DP 46 (within agent 40 running in server 24) translates between UBAs and RBAs.
  • Each DP 46 translates between the different address spaces using a RAID table 56 and a volume map 60. RAID table 56 holds the translation between LV addresses and physical addresses, and volume map 60 holds the translation between user-volume addresses and LV addresses.
  • Typically, any server 24 may attach to any user volume. A given user volume may have multiple servers attached thereto. In some embodiments, storage controllers 36 define and maintain a global volume map that specifies all user volumes in system 20. Volume map 60 in each DP 46 comprises a locally-cached copy of at least part of the global volume map. In agent 40 of a given server, volume maps 60 of DPs 46 hold at least the mapping of the user volumes (LUNs) to which this server is attached. In an embodiment, volume map 60 supports thin provisioning.
  • Multiplexing of Multiple Data-Path Instances
  • In the embodiments described herein, each I/O MUX 47 multiplexes storage commands of two or more DPs 46. The use of MUXs 47 enables each multi-queue storage device 28 to hold a respective queue per MUX 47, rather than per DP 46. Since storage devices are typically limited in the number of queues they are able to support, the disclosed embodiments enhance the scalability of system 20 considerably.
  • FIG. 3 is a block diagram that schematically illustrates queuing and I/O multiplexing elements in system 20, in accordance with an embodiment of the present invention. In this example, storage agent 40 comprises multiple data-path instances (DPs) 46. Agent 40 maintains a respective Queue Pair (QP) per DP 46. Each QP comprises a DP Submission Queue 62 (DPSQ) and a DP Completion Queue (DPCQ) 63. In addition, each storage device 28 maintains a respective QP that comprises a Storage-Device Submission Queue (SDSQ) 64 and a Storage-Device Completion Queue (SDCQ) 65.
  • In one non-limiting example, each DPSQ 62 and each DPCQ 63 may be implemented as an in-memory shared cyclic queue, with a phase bit that toggles between “0” and “1” on each wraparound of the queue. The phase bit is typically used to determine the location of the producer. In one example implementation, the phase bit is initially set to “1” such that all commands submitted in the first cycle of the submission queue (until the first wraparound) have their phase bit set to “1”. At the end of the cycle, when the cyclic queue wraps-around, the phase bit is set to “0”, so that any new command is submitted with a “0” phase bit. Toggling continues in a similar manner for subsequent cycles. It is assumed that the queue is initially zeroed. Using this technique, a consumer can easily detect the position of the most-recently submitted command in the cyclic queue. The location of the consumer of a DPSQ is sent through the corresponding DPCQ, and vice versa.
  • A storage command issued by a DP 46 typically comprises (i) the type of command (e.g., read or write), (ii) relevant storage-device parameters such as LBA, (iii) remote staging RAM locations, and (iv) local data pointers for the RDMA transfer.
  • In an example embodiment, write commands are executed as follows:
      • Each DP 46 writes the data of the write commands it issues, using RDMA, to the locations in staging RAMS 43 specified in the commands.
      • Each DP 46 queues the write commands in its respective DPSQ 62. The queued command comprises parameters such as (i) storage-device ID, (ii) LBA and size, and (iii) staging RAM location ID.
      • I/O MUX 47 reads the write commands from the various DPSQs 62 and sends each write command to the SDSQ 64 of the appropriate storage device 28 (according to the storage device ID specified in the command). MUX 47 typically serves DPSQs 62 using a certain fair scheduling scheme, e.g., Round Robin.
      • Once a certain storage device 28 completes execution of a write command, the storage device writes a completion notification to its respective SDCQ 65.
      • MUX 47 reads the completion notifications from the various SDCQs 65 and sends each completion notification to the DPCQ 63 of the appropriate DP 46. In this direction, too, MUX 47 typically serves SDCQs 65 of the various storage devices 28 using Round Robin or other fair scheduling scheme.
      • Each DP 46 reads the completion notifications from its respective DPCQ 63, and performs completion processing accordingly. As part of the completion processing, the DP typically acknowledges completion to the appropriate application 44.
  • In an example embodiment, read commands are executed as follows:
      • Each DP 46 queues read commands in its respective DPSQ 62. The queued command comprises parameters such as (i) storage-device ID, (ii) LBA and size, and (iii) staging RAM location ID.
      • I/O MUX 47 reads the read commands from the various DPSQs 62 and sends each read command to the SDSQ 64 of the appropriate storage device 28.
      • Storage devices 28 read the read commands from their SDSQs, retrieve the data requested by each read command and store it in the respective staging RAM location.
      • Once a storage device 28 completes execution of a read command, the storage device writes a completion notification to its respective SDCQ 65.
      • MUX 47 reads the completion notifications from the various SDCQs 65 and sends each completion notification to the DPCQ 63 of the appropriate DP 46.
      • DP 46 reads the completion notification from its DPCQ 63, and reads the data of the read command from the staging RAM using RDMA to the buffer provided by the requesting application 44.
      • DP 46 then performs completion processing. As part of the completion processing, the DP typically acknowledges completion to application 44.
  • The write-command and read-command execution flows described above are example flows, which are depicted purely for the sake of conceptual clarity. In alternative embodiments, any other suitable flows can be used.
  • I/O Multiplexing and Access Control Implemented on Separate Gateways
  • FIG. 4 is a block diagram that schematically illustrates a computing system that uses distributed data storage, in accordance with an alternative embodiment of the present invention. In the present example, in contrast to the example of FIG. 1, I/O MUXs 47 are implemented in separate Gateways (GWs) 70 and not in servers 24. In the embodiment of FIG. 4, the system comprises a respective GW 70 for each storage device 28. GWs 70 may comprise, for example, dedicated computers or other processors, located at any suitable location in the system.
  • In some embodiments, each GW 70 further comprises a respective access control module (ACC) 74. Each ACC 74 permits or denies execution of storage commands on the respective storage device 28, in accordance with a certain access-control policy. For example, the policy may specify that each DP is allowed to access only volumes that are mapped thereto. Without access control, malicious hosts may illegitimately gain access to information they are not entitled to access on storage devices 28. In this example, each ACC 74 is notified of the volumes to which the various DPs 46 are mapped to, and is therefore able to permit or deny each storage command, as appropriate. In one embodiment, each ACC 74 is provided with a “reverse volume map” that maps RBA ranges to the (one or more) servers mapped to the volumes having these RBA ranges.
  • Additionally or alternatively, ACCs 74 may enforce any other suitable access-control policy. ACCs are typically configured with the policy by storage controllers 36.
  • In some embodiments, MUX 47 and ACC 74 are implemented as separate software processes on GW 70. In other embodiments, the functions of MUX 47 and ACC 74 on a given GW 70 may be carried out jointly by a single software module.
  • In the embodiment of FIG. 4, in which MUXs 47 are remote from agents 40, each agent 40 runs one or more “MUX DP” (MDP) modules, and each GW 70 runs one or more host-specific (i.e., server-specific) submission queues (HSQs) and one or more respective host-specific (i.e., server-specific) completion queues (HCQs). The MDPs, HSQs and HCQ are not shown in the figure for the sake of clarity.
  • An MDP in a certain agent 40 is defined as the consumer of some or all of DPSQs 62 of DPs 46 in that agent 40. Similarly, an MDP in an agent 40 is defined as the producer of some or all of DPCQs 63 of DPs 46 in that agent 40. In some embodiments, each agent comprises only a single MDP. In other embodiments, a given agent 40 may comprise two or more MDPs that run in different contexts, for enhancing performance. In such embodiments, each MDP is typically assigned to a respective subset of the DPs in agent 40. An MDP may run in the same context (e.g., same thread or same CPU core) as one of the DPs, or in a dedicated context.
  • Each MDP is configured to aggregate its assigned DPSQs in a fair manner, e.g., Round Robin, and to write each command (using RDMA) to the appropriate HSQ on the appropriate GW 70. In an embodiment, in order to preserve ordering, the same QPs are used for (i) RDMA writes that write data of write commands, and (ii) RDMA writes that write the write commands themselves.
  • In an example embodiment, write commands are executed in the system of FIG. 4 as follows:
      • Each DP 46 writes the data of the write commands it issues, using RDMA, to the locations in staging RAMS 43 specified in the commands.
      • Each DP 46 queues the write commands in its respective DPSQ 62. The queued command comprises parameters such as (i) storage-device ID, (ii) LBA and size, and (iii) staging RAM location ID.
      • Each MDP copies the write commands from its assigned DPSQs 62 to the appropriate HSQ on the appropriate GW 70. The MDPs keep track of the original commands, in order to be able to post the completion notifications to the correct DPSQs. An MDP may coalesce and send two or more commands to a given HSQ, possibly originating from different DPSQs, at the same time.
      • Each I/O MUX 47 reads the write commands from the HSQs on its respective GW 70, and sends each write command to the appropriate SDSQ 64 of the storage device 28. At this stage, ACC 74 may perform access-control checks, and selectively permit or deny access to the storage device, per the access-control policy.
      • Once a certain storage device 28 completes execution of a write command, the storage device writes a completion notification to its respective SDCQ 65.
      • I/O MUXs 74 read the completion notifications from the various SDCQs 65. Each MUX 74 posts the completion notifications it reads on the respective HCQ in the respective GW 70.
      • The MDPs read the completion notifications from the HCQs and send each completion notification to the DPCQ 63 of the appropriate DP 46.
      • Each DP 46 reads the completion notifications from its respective DPCQ 63, and performs completion processing accordingly. As part of the completion processing, the DP typically acknowledges completion to the appropriate application 44.
  • In an example embodiment, read commands in the system of FIG. 4 are executed as follows:
      • Each DP 46 queues read commands in its respective DPSQ 62. The queued command comprises parameters such as (i) storage-device ID, (ii) LBA and size, (iii) staging RAM location ID, and (iv) local address and rkey.
      • The MDPs copy the read commands from the various DPSQs 62 to the appropriate HSQs. The MDPs keep track of the original commands, in order to be able to post the completion notifications to the correct DPSQs. An MDP may coalesce and send two or more commands to a given HSQ, possibly originating from different DPSQs, at the same time.
      • Each I/O MUX 47 reads the read commands from the HSQs on its GW 70, and sends each read command to the SDSQ 64 of the appropriate storage device 28. At this stage, ACC 74 may perform access-control checks, and selectively permit or deny access to the storage device, per the access-control policy.
      • Storage devices 28 read the read commands from their SDSQs, retrieve the data requested by each read command, and store in the staging RAM.
      • Once a storage device 28 completes execution of a read command, the storage device writes a completion notification to its respective SDCQ 65.
      • Each MUX 47 reads the completion notifications from the various SDCQs 65 of its assigned storage device, and posts each completion notification on the appropriate HCQ.
      • MUX 47 reads the completion notifications from the HCQ and sends each completion notification to the DPCQ 63 of the appropriate DP 46.
      • DP 46 reads the completion notification from its DPCQ 63, and reads the data of the read command using RDMA to the buffer provided by the requesting application 44. Alternatively, the gateway may perform an RDMA write of the data directly to the buffer provided by application 44 along with posting a completion to the HCQ.
      • DP 46 then performs completion processing. As part of the completion processing, the DP typically acknowledges completion to application 44.
  • A given I/O MUX 47, running on a given GW 70, may be implemented as a single-threaded or multi-threaded (or otherwise multi-core) process. In a multi-threaded MUX 47, each thread of the MUX typically runs on a dedicated core of GW 70 and is responsible for a configurable subset of servers 24. MUX 47 typically polls the HSQs of its designated servers 24, issues storage commands (I/Os) to storage devices 28, and polls the CQs of the storage devices. The CQs are multiplexed back into HCQs, which are in turn RDMA-written back to the servers. Similarly to the example implementation of the DPSQs and DPSQs, a phase bit and a consumer location are passed in the HSQ and HCQ. Upon completion of a read command, the relevant data is either RDMA-written to the server by the gateway, or RDMA-read by the server from the staging RAM.
  • The write-command and read-command execution flows described above are example flows, which are depicted purely for the sake of conceptual clarity. In alternative embodiments, any other suitable flows can be used. For example, in some embodiments the MDPs can be omitted, meaning that each DP essentially serves as an MDP.
  • The system configurations shown in FIGS. 1-3 are example configurations, which are chosen purely for the sake of conceptual clarity. In alternative embodiments, any other suitable system configurations can be used. For example, the system may comprise only a single storage controller 36. In other embodiments, the functionality of storage controllers 36 may be distributed among servers 24.
  • Certain aspects of distributed storage systems of the sort shown in FIGS. 1 and 2 are also addressed in U.S. Pat. Nos. 9,112,890, 9,274,720, 9,519,666, 9,521,201, 9,525,737, 9,529,542 and 9,842,084, cited above.
  • The different system elements may be implemented using suitable hardware, using software, or using a combination of hardware and software elements. Each server 24 typically comprises a suitable network interface for communicating over network 32, e.g., with the storage devices and/or storage controllers, and a suitable processor that carries out the various server functions. Each storage controller 36 typically comprises a suitable network interface for communicating over network 32, e.g., with the storage devices and/or servers, and a suitable processor that carries out the various storage controller functions.
  • In some embodiments, servers 24 and/or storage controllers 36 comprise general-purpose processors, which are programmed in software to carry out the functions described herein. The software may be downloaded to the processors in electronic form, over a network, for example, or it may, alternatively or additionally, be provided and/or stored on non-transitory tangible media, such as magnetic, optical, or electronic memory.
  • In some embodiments, I/O MUXs 47 may also be used to add extra features missing from the underlying storage devices 28. Non-limiting examples of such features may comprise atomic operations, reservations and copy offload operations. When processing storage commands, a MUX 47 may coordinate such atomic operations, lock ranges to implement reservations or offload copying data from one range to another by performing the copy by itself, without involving servers 24 or storage controllers 36.
  • Although the embodiments described herein mainly address block storage applications, the methods and systems described herein can also be used in other applications, such as in file and object storage applications, as well as in database storage.
  • It will thus be appreciated that the embodiments described above are cited by way of example, and that the present invention is not limited to what has been particularly shown and described hereinabove. Rather, the scope of the present invention includes both combinations and sub-combinations of the various features described hereinabove, as well as variations and modifications thereof which would occur to persons skilled in the art upon reading the foregoing description and which are not disclosed in the prior art. Documents incorporated by reference in the present patent application are to be considered an integral part of the application except that to the extent any terms are defined in these incorporated documents in a manner that conflicts with the definitions made explicitly or implicitly in the present specification, only the definitions in the present specification should be considered.

Claims (12)

1. A method for data storage, comprising:
in a system that comprises multiple servers, multiple multi-queue storage devices and at least one storage controller that communicate over a network, running, in a server among the servers, multiple data-path instances (DPs) that operate independently of one another and issue storage commands for execution in the multi-queue storage devices;
using an Input-Output Multiplexer (I/O MUX) process, multiplexing the storage commands issued by the multiple DPs running in the server; and
executing the multiplexed storage commands in the multi-queue storage devices, including, in a given multi-queue storage device, queuing the storage commands issued by the multiple DPs running in the server in a single queue pair (QP) associated with the I/O MUX process.
2. (canceled)
3. The method according to claim 1, wherein multiplexing the storage commands comprises running the I/O MUX process in the server.
4. The method according to claim 1, wherein multiplexing the storage commands comprises running the I/O MUX process in a gateway separate from the server.
5. The method according to claim 4, further comprising running in the gateway an access-control process that enforces an access-control policy on the storage commands.
6. The method according to claim 1, wherein multiplexing and executing the storage commands comprises accessing the multi-queue storage devices using remote direct memory access, without running code on the at least one storage controller.
7. A computing system, comprising:
multiple multi-queue storage devices;
at least one storage controller;
multiple servers, wherein a server among the servers is configured to run multiple data-path instances (DPs) that operate independently of one another and issue storage commands for execution in the multi-queue storage devices; and
a processor configured to multiplex the storage commands issued by the multiple DPs running in the server using an Input-Output Multiplexer (I/O MUX) process, so as to execute the multiplexed storage commands in the multi-queue storage devices,
wherein a given multi-queue storage device is configured to queue the storage commands issued by the multiple DPs running in the server in a single queue pair (QP) associated with the I/O MUX process.
8. (canceled)
9. The system according to claim 7, wherein the processor, which is configured to multiplex the storage commands, is a processor in the server.
10. The system according to claim 7, wherein the processor, which is configured to multiplex the storage commands, is a processor in a gateway separate from the server.
11. The system according to claim 10, wherein the processor in the gateway is further configured to run an access-control process that enforces an access-control policy on the storage commands.
12. The system according to claim 7, wherein the server and the processor are configured to multiplex and execute the storage commands by accessing the multi-queue storage devices using remote direct memory access, without running code on the at least one storage controller.
US15/847,992 2017-01-23 2017-12-20 Storage in multi-queue storage devices using queue multiplexing and access control Expired - Fee Related US10031872B1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/847,992 US10031872B1 (en) 2017-01-23 2017-12-20 Storage in multi-queue storage devices using queue multiplexing and access control

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US201762449131P 2017-01-23 2017-01-23
US15/847,992 US10031872B1 (en) 2017-01-23 2017-12-20 Storage in multi-queue storage devices using queue multiplexing and access control

Publications (2)

Publication Number Publication Date
US10031872B1 US10031872B1 (en) 2018-07-24
US20180210848A1 true US20180210848A1 (en) 2018-07-26

Family

ID=62874188

Family Applications (1)

Application Number Title Priority Date Filing Date
US15/847,992 Expired - Fee Related US10031872B1 (en) 2017-01-23 2017-12-20 Storage in multi-queue storage devices using queue multiplexing and access control

Country Status (1)

Country Link
US (1) US10031872B1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10685010B2 (en) 2017-09-11 2020-06-16 Amazon Technologies, Inc. Shared volumes in distributed RAID over shared multi-queue storage devices
US10659469B2 (en) 2018-02-13 2020-05-19 Bank Of America Corporation Vertically integrated access control system for managing user entitlements to computing resources
US10607022B2 (en) * 2018-02-13 2020-03-31 Bank Of America Corporation Vertically integrated access control system for identifying and remediating flagged combinations of capabilities resulting from user entitlements to computing resources
JP2022048716A (en) * 2020-09-15 2022-03-28 キオクシア株式会社 Storage system

Family Cites Families (55)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6446220B1 (en) 1998-08-04 2002-09-03 International Business Machines Corporation Updating data and parity data with and without read caches
US6584517B1 (en) 1999-07-02 2003-06-24 Cypress Semiconductor Corp. Circuit and method for supporting multicast/broadcast operations in multi-queue storage devices
DE50014591D1 (en) 2000-11-27 2007-10-04 Siemens Ag Bandwidth reservation in data networks
US7225242B2 (en) 2001-01-26 2007-05-29 Dell Products L.P. System and method for matching storage device queue depth to server command queue depth
FR2820922B1 (en) 2001-02-12 2005-02-18 Thomson Csf METHOD FOR ENSURING LATENCY TIME OF COMMUNICATIONS BETWEEN AT LEAST TWO DATA PASSING POINTS
TW579463B (en) 2001-06-30 2004-03-11 Ibm System and method for a caching mechanism for a central synchronization server
US6671778B2 (en) 2001-08-03 2003-12-30 Hewlett-Packard Development Company, L.P. Atomic resolution storage device configured as a redundant array of independent storage devices
US20030105830A1 (en) * 2001-12-03 2003-06-05 Duc Pham Scalable network media access controller and methods
US20050050273A1 (en) 2003-08-27 2005-03-03 Horn Robert L. RAID controller architecture with integrated map-and-forward function, virtualization, scalability, and mirror consistency
US7975018B2 (en) 2004-07-07 2011-07-05 Emc Corporation Systems and methods for providing distributed cache coherence
US20060179197A1 (en) * 2005-02-10 2006-08-10 International Business Machines Corporation Data processing system, method and interconnect fabric having a partial response rebroadcast
US7743214B2 (en) 2005-08-16 2010-06-22 Mark Adams Generating storage system commands
US7500071B2 (en) 2005-08-31 2009-03-03 International Business Machines Corporation Method for out of user space I/O with server authentication
US8595313B2 (en) 2005-11-29 2013-11-26 Netapp. Inc. Systems and method for simple scale-out storage clusters
US8095763B2 (en) 2007-10-18 2012-01-10 Datadirect Networks, Inc. Method for reducing latency in a raid memory system while maintaining data integrity
US8195912B2 (en) 2007-12-06 2012-06-05 Fusion-io, Inc Apparatus, system, and method for efficient mapping of virtual and physical addresses
WO2010030996A1 (en) 2008-09-15 2010-03-18 Virsto Software Storage management system for virtual machines
US9164689B2 (en) 2009-03-30 2015-10-20 Oracle America, Inc. Data storage system and method of processing a data access request
US8364923B2 (en) 2009-03-30 2013-01-29 Oracle America, Inc. Data storage system manager and method for managing a data storage system
US9973446B2 (en) 2009-08-20 2018-05-15 Oracle International Corporation Remote shared server peripherals over an Ethernet network for resource virtualization
US8601222B2 (en) 2010-05-13 2013-12-03 Fusion-Io, Inc. Apparatus, system, and method for conditional and atomic storage operations
EP2476079A4 (en) 2009-09-09 2013-07-03 Fusion Io Inc Apparatus, system, and method for allocating storage
JP5358736B2 (en) 2009-11-10 2013-12-04 株式会社日立製作所 Storage system with multiple controllers
US8510265B1 (en) 2010-03-31 2013-08-13 Emc Corporation Configuration utility for a data storage system using a file mapping protocol for access to distributed file systems
EP2553872A1 (en) 2010-04-01 2013-02-06 Research In Motion Limited Methods and apparatus to collboratively manage a client using multiple servers
US8725934B2 (en) 2011-12-22 2014-05-13 Fusion-Io, Inc. Methods and appratuses for atomic storage operations
US8468318B2 (en) 2010-09-15 2013-06-18 Pure Storage Inc. Scheduling of I/O writes in a storage environment
US8775868B2 (en) 2010-09-28 2014-07-08 Pure Storage, Inc. Adaptive RAID for an SSD environment
US20120144110A1 (en) 2010-12-02 2012-06-07 Lsi Corporation Methods and structure for storage migration using storage array managed server agents
US8812450B1 (en) 2011-04-29 2014-08-19 Netapp, Inc. Systems and methods for instantaneous cloning
EP2541416B1 (en) 2011-06-27 2019-07-24 Alcatel Lucent Protection against a failure in a computer network
US9223502B2 (en) 2011-08-01 2015-12-29 Infinidat Ltd. Method of migrating stored data and system thereof
US8806160B2 (en) 2011-08-16 2014-08-12 Pure Storage, Inc. Mapping in a storage system
WO2013024485A2 (en) 2011-08-17 2013-02-21 Scaleio Inc. Methods and systems of managing a distributed replica based storage
US8966172B2 (en) 2011-11-15 2015-02-24 Pavilion Data Systems, Inc. Processor agnostic data storage in a PCIE based shared storage enviroment
US8897315B1 (en) 2012-01-06 2014-11-25 Marvell Israel (M.I.S.L) Ltd. Fabric traffic management in a network device
US9817582B2 (en) 2012-01-09 2017-11-14 Microsoft Technology Licensing, Llc Offload read and write offload provider
US9251052B2 (en) 2012-01-12 2016-02-02 Intelligent Intellectual Property Holdings 2 Llc Systems and methods for profiling a non-volatile cache having a logical-to-physical translation layer
US10360176B2 (en) 2012-01-17 2019-07-23 Intel Corporation Techniques for command validation for access to a storage device by a remote client
JP2014130420A (en) 2012-12-28 2014-07-10 Hitachi Ltd Computer system and control method of computer
US8875295B2 (en) * 2013-02-22 2014-10-28 Bitdefender IPR Management Ltd. Memory introspection engine for integrity protection of virtual machines
US20150212752A1 (en) 2013-04-08 2015-07-30 Avalanche Technology, Inc. Storage system redundant array of solid state disk array
US8595385B1 (en) 2013-05-28 2013-11-26 DSSD, Inc. Method and system for submission queue acceleration
US20150012699A1 (en) 2013-07-02 2015-01-08 Lsi Corporation System and method of versioning cache for a clustering topology
US10365858B2 (en) 2013-11-06 2019-07-30 Pure Storage, Inc. Thin provisioning in a storage device
US9916248B2 (en) 2013-12-12 2018-03-13 Hitachi, Ltd. Storage device and method for controlling storage device with compressed and uncompressed volumes and storing compressed data in cache
US9658782B2 (en) 2014-07-30 2017-05-23 Excelero Storage Ltd. Scalable data using RDMA and MMIO
US9112890B1 (en) * 2014-08-20 2015-08-18 E8 Storage Systems Ltd. Distributed storage over shared multi-queued storage device
US9274720B1 (en) 2014-09-15 2016-03-01 E8 Storage Systems Ltd. Distributed RAID over shared multi-queued storage devices
US9519666B2 (en) 2014-11-27 2016-12-13 E8 Storage Systems Ltd. Snapshots and thin-provisioning in distributed storage over shared storage devices
US20160162209A1 (en) 2014-12-05 2016-06-09 Hybrid Logic Ltd Data storage controller
US9529542B2 (en) 2015-04-14 2016-12-27 E8 Storage Systems Ltd. Lockless distributed redundant storage and NVRAM caching of compressed data in a highly-distributed shared topology with direct memory access capable interconnect
US9525737B2 (en) 2015-04-14 2016-12-20 E8 Storage Systems Ltd. Lockless distributed redundant storage and NVRAM cache in a highly-distributed shared topology with direct memory access capable interconnect
US10496626B2 (en) 2015-06-11 2019-12-03 EB Storage Systems Ltd. Deduplication in a highly-distributed shared topology with direct-memory-access capable interconnect
US9842084B2 (en) 2016-04-05 2017-12-12 E8 Storage Systems Ltd. Write cache and write-hole recovery in distributed raid over shared multi-queue storage devices

Also Published As

Publication number Publication date
US10031872B1 (en) 2018-07-24

Similar Documents

Publication Publication Date Title
US11604746B2 (en) Presentation of direct accessed storage under a logical drive model
US9800661B2 (en) Distributed storage over shared multi-queued storage device
US10901847B2 (en) Maintaining logical to physical address mapping during in place sector rebuild
US11243837B2 (en) Data storage drive rebuild with parity generation offload using peer-to-peer data transfers
US10296486B2 (en) Write cache and write-hole recovery in distributed raid over shared multi-queue storage devices
US11455289B2 (en) Shared volumes in distributed RAID over shared multi-queue storage devices
US9384065B2 (en) Memory array with atomic test and set
US9519666B2 (en) Snapshots and thin-provisioning in distributed storage over shared storage devices
US10031872B1 (en) Storage in multi-queue storage devices using queue multiplexing and access control
US20160308968A1 (en) Lockless distributed redundant storage and nvram cache in a highly-distributed shared topology with direct memory access capable interconnect
US20150312337A1 (en) Mirroring log data
US11163452B2 (en) Workload based device access
CN106021147B (en) Storage device exhibiting direct access under logical drive model
US20150127975A1 (en) Distributed virtual array data storage system and method
US9146780B1 (en) System and method for preventing resource over-commitment due to remote management in a clustered network storage system
WO2018217317A1 (en) Parity generation offload using peer-to-peer data transfers in data storage system
US10872036B1 (en) Methods for facilitating efficient storage operations using host-managed solid-state disks and devices thereof
US10275175B2 (en) System and method to provide file system functionality over a PCIe interface
US10157020B1 (en) Optimizing copy processing between storage processors
US11579976B2 (en) Methods and systems parallel raid rebuild in a distributed storage system
US10437471B2 (en) Method and system for allocating and managing storage in a raid storage system
US8966173B1 (en) Managing accesses to storage objects
US20230359359A1 (en) Elastic request handling technique for optimizing workload performance
JP2009258825A (en) Storage system, virtualization device and computer system

Legal Events

Date Code Title Description
AS Assignment

Owner name: E8 STORAGE SYSTEMS LTD., ISRAEL

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:FRIEDMAN, ALEX;REEL/FRAME:044443/0730

Effective date: 20171219

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

STCF Information on status: patent grant

Free format text: PATENTED CASE

AS Assignment

Owner name: AMAZON TECHNOLOGIES, INC., WASHINGTON

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:E8 STORAGE SYSTEMS LTD.;REEL/FRAME:051014/0168

Effective date: 20190730

FEPP Fee payment procedure

Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

LAPS Lapse for failure to pay maintenance fees

Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY

STCH Information on status: patent discontinuation

Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362

FP Lapsed due to failure to pay maintenance fee

Effective date: 20220724