EP3295321A1 - Accessing multiple storage devices from multiple hosts without remote direct memory access (rdma) - Google Patents
Accessing multiple storage devices from multiple hosts without remote direct memory access (rdma)Info
- Publication number
- EP3295321A1 EP3295321A1 EP16793169.0A EP16793169A EP3295321A1 EP 3295321 A1 EP3295321 A1 EP 3295321A1 EP 16793169 A EP16793169 A EP 16793169A EP 3295321 A1 EP3295321 A1 EP 3295321A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- data
- storage
- data storage
- queue
- compute
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0602—Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
- G06F3/061—Improving I/O performance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0659—Command handling arrangements, e.g. command buffers, queues, command scheduling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0668—Interfaces specially adapted for storage systems adopting a particular infrastructure
- G06F3/067—Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
Definitions
- the disclosed subject matter relates to the field of data access storage methods and systems.
- FIGS. 1 through 3 illustrate a Clustered Direct Attach Storage (Clustered DAS) configuration of conventional systems
- FIGS 4 and 5 illustrate a Network Attached Storage (NAS) or Storage Area
- SAN Network (SAN) configuration of conventional systems
- Figure 6 illustrates an example embodiment of a switched direct attached shared storage architecture
- Figure 7 illustrates the physical storage media assignment to compute nodes in an example embodiment
- Figure 8 illustrates how each storage slice is the physical unit of abstraction that can be plugged into a storage media container in an example embodiment
- NVMe non-volatile memory express
- Figure 10 illustrates the process in an example embodiment for device management
- Figure 1 1 illustrates the procedure in an example embodiment for data flow from a compute node to one or more storage slices
- Figure 12 illustrates the procedure in an example embodiment for storage slice sharing:
- Figure 13 illustrates the data flow in a sw itched DAS architecture of an example embodiment using Ethernet as the transport fabric protocol
- Figure 14 illustrates the encapsulation of an 10 operation into a standard Ethernet frame in an example embodiment
- Figures 15 and 16 illustrate an example embodiment for implementing instrumentation hooks to monitor, measure, and enforce performance metrics into the compute, memory, network and storage resources;
- Figures 1 7 and 18 illustrate an example embodiment for continuous monitorin of the health of all resources to predict failures and proactively adjust/update the cluster resources;
- Figure 19 illustrates the standard NVM Express 1. 1 specification wherein an example embodi ment implements input output ( 10) acceleration by use of an Ethernet connection;
- 100211 Figure 20 illustrates a server to server configuration of the messaging protocol of an example embodiment;
- Figure 2 1 illustrates the data flow for a sample message using the messaging protocol of an example embodiment
- Figure 22 show s the basic organization of the current flash media
- Figure 23 illustrates the object tag format for the object store of the example embodi ment
- Figure 24 shows a specific example of the conventional system shown in Figure 4. where storage is attached via Ethernet using conventional protocols;
- Figure 25 illustrates how NVM Express devices are accessed when locally installed in a server
- Figure 26 illustrates a typical RDM A hardware and softw are stack required to implement remote access of NVM Express devices
- Figure 27 illustrates an embodiment of the data storage access system of the example embodiments described herein showing the savings in complexity to be gained by use of the example embodiments over the conventional implementation shown in Figure 26;
- Figure 28 illustrates the configuration of queues in the host bus adapter (HBA) or host netw ork interface controller (NIC) in an example embodiment;
- HBA host bus adapter
- NIC host netw ork interface controller
- Figure 29 illustrates a detail of the configuration of queues in the host bus adapter
- HBA host network interface controller
- NIC host network interface controller
- Figure 30 illustrates an architectural view of the storage controller of an example embodiment in netw ork communication with a plurality of host server systems via a storage netw ork;
- Figure 31 illustrates an example of a method for a host/server to communicate I/O requests to devices installed within the data storage access system of an example embodiment
- Figure 32 illustrates example contents of a single Shadow Queue Element in the data storage access system of an example embodiment
- Figures 33 and 34 illustrate example register sets of the data storage access system of an example embodiment used to set up and control the various request and completion queues as described herein;
- Figures 35 and 36 illustrate examples of how a host 1 0 request flow s through the data storage access system of an example embodiment
- Figure 37 illustrates a node to node protocol in an example embodiment providin the ability for a plurality of data storage access systems to inter-communicate via unicast. multicast, or broadcast data transmissions using the queues described herein;
- Figure 38 illustrates an example embodiment of a component of the data storage access system of an example embodiment as used within an existing host/server;
- Figure 39 is a flow diagram illustrating the basic processing flow for a particular example embodiment of the data storage access system as described herein;
- Figure 40 shows a diagrammatic representation of a machine in the example form of a data processor within which a set of instructions, for causing the machine to perform any one or more of the methodologies described herein, may be executed.
- Cluster of nodes with integrated storage In the storage industry parlance, this topology is often referred to as "Clustered Direct Attached Storage” (Clustered DAS or DAS) configuration;
- VS AN Virtual Storage Area Network
- NAS Network Attached Storage
- SAN Storage Area Networks
- FIG. 1 illustrates an example of the conventional Clustered DAS topology.
- Clustered DAS is typically dedicated to a single server and is not sharable among multiple servers.
- Figure 2 illustrates a softw are representation of the Clustered DAS with a user-space distributed file system.
- Figure 3 illustrates a softw are representation of this Clustered DAS with a kernel- space distributed file system.
- VSAN virtual storage area netw ork
- a VSAN enables management softw are to serve data storage on cluster nodes to other cluster nodes.
- Figure 4 illustrates a softw are representation of the NAS. SAN.
- Figure 5 illustrates an example of the conventional NAS. SA topology.
- NAS. SAN can be shared among several server applications.
- sockets and 12 drives for sparsely populated are sparsely populated.
- based storage media is used one or two controller heads,
- compute node is down or of access until after a coarse
- controller head become a
- a cluster represents a cluster of nodes, wherein each node has integrated compute capabilities and data storage.
- the architecture of the various embodiments described herein leverages among the follow ing features:
- the Sw itched DAS architecture of an example embodiment has the flexibility to adapt to numerous underlying storage media interface protocols, and can also be extended to other clustering interconnect technologies via protocol encapsulation.
- the various embodiments described herein can be implemented with the most popular and standards based native storage media protocols, such as: NVMe (NVM Express).
- NVMe NVM Express
- SOP SCSI over PCIe
- NVM is an acronym for non-volatile memory, as used in SSDs.
- NVM Express is a specification for accessing solid-state drives ( SSDs) attached through the PCI Express (PCIe)bus.
- PCIe Component Interconnect Express
- SATA Serial Advanced Technology Attachment
- SAS Serial Attached Small Computer System Interface - SCSI
- Fibre Channel for interfacing with the rest of a computer system.
- SATA has been the most typical way for connecting
- SSDs in personal computers how ever. SAT A was designed for mechanical hard disk drives, and has become inadequate with SSDs. For example, unlike hard disk drives, some SSDs are limited by the maximum throughput of SATA.
- Serial Attached SCSI ( SAS) is a point-to-point serial protocol that moves data to and from computer storage devices such as hard drives and tape drives.
- a data store switch fabric is implemented using Ethernet protocol and Ethernet data encapsulation. The following sections detail the specific procedures used in an example embodiment for: physical storage media assignment to compute nodes; data flow to/from the compute nodes and storage slices; and sharin of storage media in a Switched DAS cluster via a data store sw itch fabric.
- Figures 6 and 7 illustrate the physical storage media assignment to compute nodes in an example embodiment is illustrated.
- Figures 6 and 7 illustrate the physical configuration of the system hardw are in an example embodiment.
- the plurality of compute nodes 150 can be interconnected with one or more data storage slices 1 71 of the physical storage media pool or storage media container 170 via a data store sw itch fabric 160.
- the compute nodes or servers 150 can also be in data communication with each other via a local area netw ork 165 as shown in Figure 6.
- each data storage slice 1 71 is the physical unit of abstraction that can be plugged into or otherw ise connected with a storage media container 170.
- each storage slice 1 7 1 can be associated with the storage controller 1 72 residin on or in data communication with the storage slice 1 7 1 .
- Figure 9 illustrates a procedure 801 for assignin storage slices to compute nodes with NVMe storage.
- the procedure includes a cluster manager that distributes storage slice resources by assignin them to one or multiple Virtual Devices or NVMe Logic Units (NLUN) on one or multiple compute nodes.
- NLUN Virtual Devices or NVMe Logic Units
- Each compute node will have an NLUN that consists of physical storage on one or multiple storage slices. Any portion of a storage slice can be shared by one or multiple compute nodes ( processing block 810).
- the storage slice represented by a combination of NVMe storage devices and a corresponding storage controller, can be identified usin a media access control address (MAC address).
- MAC address media access control address
- BIOS basic input output system
- a sw itched DAS architecture of an example embodiment allow s multiple compute nodes to have access to storage slices from di fferent storage containers to increase the data accessibility in the presence of hardw are failures.
- three compute nodes (902, 904, and 906) are shown in Figure 7. Each of these compute nodes can be assigned with storage slices (912, 914, and 916), respectively, from two di fferent storage containers 920 and 930.
- Each of the storage containers 920 and 930 and compute nodes (902, 904, and 906) can be configured with the location of the physical hardware.
- Storage container to compute node assignment can use the physical location as required to manage the data accessibility in the presence of hardw are failures.
- the same architecture, implemented with an Ethernet infrastructure as described herein, can be extended to use protocol speci fic identifiers and assignment with S AS SAT A protocols connected over an SAS expander, and SOP protocol connected over a PCIe switch.
- Figure 10 illustrates the process in an example embodiment for device management.
- a switched DAS storage system with a pool of readily available driver shelves allows the flexibility of removing and adding storage to the pool of drives. This type of system needs to track each drive as they get moved throughout the system and identify them as unique.
- a hash is calculated based on a unique device identifier ( I D). This hash is used to address into a device ID table. The table entry is marked as being occupied and the device I D is placed into the table. This is show n in Figure 10. The table has additional information along with the Device I D to identify the device location within the sw itched DAS storage network.
- the management entity of the local storage controller will hash into the device I D table removing the special location of the device from the table, but leaving the Device I D information in the table so the device can be identified if the device is returned to the storage pool.
- Figure I I illustrates the procedure 1201 in an example embodiment for data flow from a compute node to one or more storage slices.
- a file system or block access layer sends native storage commands through the disk device driver that is attached to a storage slice ( processing block 1210).
- the native storage command and results are encapsulated in a transport protocol (e.g.. Ethernet. PCIe, etc. ) per the respective protocols.
- the storage slice responds to the native storage command per native storage standards.
- Figure 12 illustrates a procedure 1300 in an example embodiment for storage slice sharing.
- the compute node w rites to the storage slice to which it is assigned (processing block 1305).
- a virtual function (VF) associated with the same physical function v irtual function (PF VF) of the compute node is assigned to the remote compute node looking to share the data (processing block 13 1 ).
- the remote compute node is informed of the storage slice location, identity, offset, and length of the data ( processing block 1325).
- the remote compute node accesses the data.
- the remote compute node informs the originating compute node of the task completion ( processing block 1335).
- the originating compute node reclaims control and continues w ith operations ( processing block 1345 ).
- a virtual drive or NLU is used to distribute and share portions of the physical data devices or drives of multiple data storage slices ( processing block 1355 ).
- a logical unit number (LUN) is used as a shared object between compute nodes ( processing block 1365).
- One of the key advantages of centralizing storage media is to enable dynamic sharing by cooperating compute nodes.
- the sw itched DAS architecture of the example embodiments enables this feature.
- the example embodiment show s a basic data storage configuration that represents the common compute and storage interconnection scheme.
- the various example embodiments described herein use this basic topology and improve the way that data is moved through the system. The improvements lead to a drastic improvement in overall system performance and response time without impacting system reliabil ity and availability.
- the disclosed architecture reduces protocol layers in both the compute server and storage device end of the system.
- the architecture of the various example embodiments described herein eliminates complicated high latency I P ( Internet Protocol ) based storage protocol and its softw are based retries with long 10 ( input output) time-outs. These protocols are used to w ork around Ethernet ' s lossy nature to create a reliable storage protocol. 10064]
- the architecture of the various example embodiments described herein uses a data store switch fabric 160 to tunnel directly between nodes using server-based 10 protocols across the network, resulting in directly exposing high performance storage devices 1 7 1 to the netw ork. As a result, all the performance of the storage devices is made available to the network. This greatly benefits the compute server appl ications.
- Figure 1 3 illustrates a data flow 1301 in a switched DAS architecture of an example embodi ment using Ethernet as the transport fabric protocol.
- an 10 operation is initiated in the same manner as i f the storage device 1 7 1 were internal to the compute server 150.
- Compute node sends nati ve storage commands through the disk device driver, as i f the storage slice was directly attached ( processing block 1310).
- This 10 operation, data request, or native storage operation e.g. commands, data, etc.
- the Ethernet frame is then shipped via the data store switch fabric 160 to a storage device 1 7 1 at the other end of the network ( processing block 1330).
- the Ethernet tunnel is undone, the Ethernet encapsulation is removed, leaving native storage operations, and the 10 protocol is passed to the storage device 1 7 1 as i the storage device 1 7 1 were connected via a direct method to the compute server 1 0 ( processing block 1340).
- the storage slice responds to the native storage command, as i f the compute node was directly attached ( processing block 1 350).
- the data store switch fabric 160 enables data communications betw een a plurality of compute nodes 150 and the pl urality of data storage devices 171 in a manner to emulate a di rect data connection.
- the storage device 1 7 1 can be solid-state dri ve ( SSD).
- SSD solid-state drive
- a solid-state drive ( SSD) is a type of data storage device, such as a flash memory device, which uses memory technology rather than conventional rotating media.
- the encapsulation of 10 operations into a standards based Layer 2 Ethernet frame is show n in Figure 14.
- Ethernet frame is shown.
- the architecture of the example embodiment uses standard Ethernet protocol as an integral part of the storage system of a particular embodiment. As a result, it is extremely efficient and effective to use VLAN ( virtual local area netw ork) features to segregate and prioriti/e the storage traffic that is built with Ethernet as its core fabric. It will be apparent to those of ordinary skill in the art in view of the disclosure herein that many other alternati ve i mplementations can be used to segregate and prioriti/e storage traffic.
- the architecture of the example embodiment can utilize information available in the creation of the 10 traffic where the tunnel is constructed to decide how to prioriti/e or segment the Ethernet flow s.
- the architecture also provides a hardware-based packet loss detection and recovery feature. Moving the packet loss detection and recovery to a fast, close-to-the-netw ork mechanism improves the performance of the overall system over previous implementations.
- the example embodiment provides a very novel approach with significant benefits over today ' s storage architectures. Due to the high performance and small form factor of solid state memory devices currently on the market, old methods of external storage based on devices behind a single controller or banks of 10 controllers, typically Intel ® based motherboards, are too costly and woefully under provisioned.
- the data storage architecture of an example embodiment described herein moves the SAN/NAS type of storage processing softw are onto the compute nodes. This removes both cost from the system as well as performance bottlenecks of the external SAN/NAS or object storage architecture.
- the architecture of the example embodiments utilizes externally sw itched DAS storage that exposes the performance of the drivers directly to a storage netw ork. This allow s for SAN/NAS type reliability, manageability, and availability that internal storage cannot offer.
- Removing storage from the compute servers now allow s the compute environment and storage to scale independently.
- the removal of storage from the compute server allow s for a more dense performance point.
- the density of the distributed storage solution of the example embodiments is far greater than that of internal storage, thereby reducing both pow er and footprint of the implementation.
- the various example embodiments provide technology and a softw are platform for: instrumentation hooks to monitor, measure, and enforce performance metrics into the compute, memory, network and storage resources; and continuous monitoring of the health of all resources to predict failures and proactively adjust/update the cluster resources. Details of the software platform in an example embodiment are provided below .
- Instrumentation hooks to monitor, measure, and enforce performance metrics into the compute. memory, network and storage resources.
- a first step in an example embodiment is to perform resource awareness flow. This includes creating a catalog of available hardw are and their respective performance levels (e.g., flash devices or device types, number of NIC links per compute node, throughput and IOPS of storage devices, switch fabric infrastructure, connectivity, and timing, etc. ).
- a second step is to perform predictive Service Level Agreement (SLA) requirement analysis. All resources that are required to run a job are virtuali/ed. namely Central Processing Unit (CPU ), memory, network, and storage. Jobs can be implemented as Hadoop jobs.
- Hadoop is a well-known open-source softw are framew ork from Apache Softw are Foundation for storage and large-scale processing of data-sets on clusters of commodity hardw are.
- Apache Hadoop is a registered trademark of the Apache Softw are Foundation.
- Platform softw are is made aw are of the performance capabilities such as throughput. IOPS ( input output operations per second), latency, number of queues, command queue-depth, etc. of all the underlying hardw are resources in the storage platform.
- the platform softw are will run matching algorithms to align the resource usage of a specific job against the hardw are capabilities, and assign virtuali/ed resources to meet a specific job. As cluster usage changes, the platform softw are continuously maps delivered SLAs against predicted SLAs. and adjusts predicted SLAs.
- an example embodiment illustrates a process 1500 to perform resource aw areness flow .
- cluster management applications are made aware of the raw performance capabilities of all hardware resources in the cluster (e.g.. number of NIC (netw ork interface controller) links per compute node; throughput and IOPS of underlying storage devices, switch fabric infrastructure, connectivity, and timing, etc.); 2) the cluster manager creates a catalog of available hardware and their respective performance levels (e.g., flash devices or device types, number of NIC links per compute node, throughput and IOPS of storage devices, switch fabric infrastructure, connectivity, and timing, etc.); and 3) the cluster manager creates and manages IO usage statistics (processing block 1510).
- NIC networkw ork interface controller
- an example embodiment illustrates a process 1700 to perform predicti ve service level agreement requirement processing.
- a job is submitted into the cluster with job meta data (processing block 1710).
- the process can review and/or initialize statistics based on the job performance or the job profile (processing block 1720).
- the process can predict the expected time it would take for the job to complete on the cluster based on the job's statistics, available resources, and profiling results (processing block 1730).
- the process can match the job's statistics and profiling results against the hardware catalog performance metrics and provide an estimated amount of time to complete the job at the assigned priority level and an expected amount of standard deviation seen on the cluster (processing block 1740).
- the process can monitor job progress and periodically assess the completion time and match it against the predicted job completion time.
- the process can adjust the resource assignment of the job to meet the predicted completion times.
- the process can warn an operator or a cluster management application of excessive delays (processing block 1750).
- the process can store the job's resource requirements and track the job's actual execution time.
- the process can adjust the predicted time as the job gets executed and update statistics (processing block 1760).
- the platform softw are of an example embodiment continuously monitors the health of all critical hardware components across various compute nodes and storage containers.
- the system also performs resource monitoring to avoid failures.
- Platform software is made aware of the failure characteristics such as wear-levels of flash storage, failure ratings of power supplies, fans, network and storage errors, etc. of all the underlying hardw are resources in the storage platform.
- the platform softw are implements hooks to monitor the health of hardw are resources into the respective softw are control blocks.
- the platform softw are runs continuous fail ure models and proactively informs alerts an operator or a cluster management application to attend update the hardw are resource that is in question. When a change in resource is i mminent, the platform softw are proactively reduces the usage of affected hardw are, rebalances the storage, netw ork and compute tasks, and isolates the affected hardw are for quick and easy replacement.
- an example embodiment illustrates a process 1800 to perform platform softw are resource monitoring for failure avoidance.
- the platform softw are periodically polls the health, usage, w ear-level of flash, error levels on NIC interfaces, and performance levels of all hardw are components ( processing block 1810).
- the process runs failure prediction analysis on components that are heavily used (processing block 1820). For components that are closer to failing based on a pre-configured probability and earlier than a pre-configured time limit - start the resource mitigation activity and don ' t take any new usage on the affected component( s) ( processin block 1830). After resource migration is complete, the process automatically marks the affected components as off-line ( processing block 1840).
- the process automatically re-adj usts the projected completion ti mes for outstanding jobs ( processing block 1850) and generates alerts to an operator or a cluster management application for any needed corrective actions ( processing block 1860).
- areas of the flash drives which are showing high levels of wearing ( or bad cell sites) can be used for the storage of lightly w ritten data (e.g.. cold data storage). In this manner, the worn areas of the flash drives can still be used w ithout w asting storage.
- Step 2 of the 10 flow shown in Fig. 1 9 identi fies a host w rite of a doorbell .
- the Host NIC 1 56 network interface controller show n in Figure 6) of an example embodi ment forw ards the doorbell down the Ethernet connection of the data store sw itch fabric 160 to the storage controller 1 72 as show n in Figures 6 and 8 w here the doorbell eventually gets passed to the storage device 1 7 1 (e.g.. a flash drive or other SSD).
- the Host NIC 1 56 acts on the doorbell and fetches the command from the submission Queue as identi fied in step 3 of Figure 19.
- the Host N IC can start to process the command before the storage device has seen the command.
- the Host N IC 1 56 can send the relevant information across the data store switch fabric 160 (e.g.. the Ethernet connection ) to the storage controller 1 72.
- the storage controller 1 72 When the storage device 1 7 1 sees the doorbell, the head information of the command has already been fetched and is either on the way or has arrived in the local packet buffer or the storage controller 1 72.
- This method of prefetching commands and data and overlapping processing operations effectively hides latency and improves performance of the 10 system. Additionally, by being 10 aware, the hardware can handle the lossy nature of Ethernet and more reliably handle packet drops.
- the example embodi ment shows the basic system interconnect w here a Host 150 w ith an Ethernet N IC 1 56 is connected via an Ethernet connection infrastructure of data store sw itch fabric 160, w hich is then connected to an Ethernet based storage controller 1 72.
- the storage controller 1 72 is connected to an SSD 1 7 1 .
- This is the basic physical configuration of the storage system of an example embodiment.
- the Host N IC 156 presents a virtual SSD to the server 150.
- the storage controller 1 72 presents a virtuali/ed root complex to the SSD 1 71 .
- the Host NIC 156 presents an endpoint to the compute node 150.
- the storage protocol is tunneled across the Ethernet connection infrastructure.
- Tunneling the protocol limits the complexity, power and latency of the Host NIC 156 and storage controller 1 72.
- the virtuali/ation allows any host to be able to communicate to any number of storage controllers to utilize a portion of or the entire addressable space of the S SDs to which it is connected.
- Virtuali/ing the devices allows the example embodiments to use host resident storage management software 155 that can then implement features common to enterprise SAN and NAS systems at a much higher performance level, lower power level, and lower system cost.
- a low latency reliable secure messaging protocol is an important part of the data storage architecture described herein.
- the messaging protocol provided in an example embodiment uses the same connectivity infrastructure that customer 10 operations use.
- the architecture of the protocol permits a responding compute server to directly send indexes and meta data to the locations where a requesting compute server will use the data, eliminating any memory copies. This saves valuable system bandw idth as well as increasing storage software performance.
- the messaging protocol also reduces system response latencies. Performance is also optimized as hardware can snoop the message entries while moving the data to obtain information used to ensure the memory integrity of the system receiving the indexes and meta data, thereby eli minating another queue or table.
- FIG. 100801 Figure 20 illustrates a compute server to compute server configuration of the messaging protocol of an example embodiment.
- compute nodes or servers can be in data communication with each other via a local area netw ork 165.
- the messaging protocol of an example embodiment can be used to facilitate this data communication.
- the term Initiator is used to identify the server that is sending a Request Message to get information from a Target server that sends a Response.
- a response is a generic term for the data that is being used by the storage system softw are of an example embodiment. This data can include index data or other meta data or system status.
- the messaging protocol described herein is a peer to peer (P2P) protocol.
- P2P peer to peer
- the Initiator starts 301 a conversation by placing an entry into a work queue 320.
- the initiator then rings a doorbell telling the target a w ork queue entry is available.
- the Target reads 302 the work queue entry.
- a side effect 303 of the work queue entry read moves check information into the Address Translation Unit (ATU ) 330 of the hardw are.
- the Target receives 304 the work queue entry, processes the work queue entry, and builds the appropriate response packet.
- the response packet is then sent 305 to the Initiator where the response packet is processed by the Address Translation Unit ( ATU) 330.
- the example embodiment show s the basic structure of distributed storage netw ork connectivity in an example embodiment.
- the example embodiment utilizes this netw ork topology to implement storage features without impacting the compute servers and the links to the compute servers. Examples of these features include mirroring disks and building or rebuilding replicas of drives. Again, this is all done independently of the compute servers. This saves valuable bandw idth resources of the compute servers. These features also increase overall storage performance and efficiencies as well as low er the overall pow er of the storage implementation.
- FIG. 100841 Figure 22 shows the basic organization of the current flash media.
- An enterprise class SSD is made up of many assembled chips of flash devices. The devices could be assemblies of multiple die in one package. Each die is made up of multiple blocks with many pages per block. The memory is address at a logical block boundary.
- Flash media is a media that does not allow di ect w rites. I f new data is to be written, a blank area must be found or an existing area must be erased. The unit of space that is bulk erased at one time is generally called the erase block. Because of this lack of direct write capability for this type of memory device, there is a management overhead. This management overhead includes managing the logic data blocks as virtual in that they don ' t exist in a speci fic physical location; but. over time are moved around the physical memory as various writes and reads occur to the die. Additionally, the media will wear out over time. Spare area is maintained to allow for user physical locations to fail and not lose user data.
- an example embodi ment provides an 10 layer that virtual i/es the storage from the appl ication or operating system and then optimizes that storage to get the best performance out of the media, particularly flash memory devices.
- the example embodiment enables the i mplementation to avoid the performance pitfalls, which can occur when the media is not used opti mally.
- This virtuali/ation softw are layer that is flash memory device aw are formats the physical media to optimize w rites so as to limit the need for the flash memory devices to perform garbage collection. This is done by ensuring all files or records are flash erase bank aligned and a multiple of the erase bank si/e. Additionally, block si/e is a multiple of the erase bank si/e.
- the abil ity to format a drive and write records with an erase buffer in mind also help reduce the need for spare pages. This frees up the pages from the spare pool and makes the pages available to customer appl ications.
- the example embodiment increases the density of a current flash device due to the optimized usage of the device. This creates a more cost effective solution for customers.
- Today ' s storage slacks are developed to provide the optimal performance for an average 10 and storage w orkload the system will see. or the user can force the system to use preferred settings. Some systems will allow the user to characterize their w orkloads and then the user can set the systems to use a given set of settings.
- the various embodiments of the data storage system described herein are designed to enable adjusting to the 10 traffic and storage characteristics as the traffic profile changes.
- the various embodiments can also be programmed to alert the operator or cluster management application when the traffic pattern is seen to cross preset limits.
- the various embodiments allow different segments of the storage to utilize completel different 10 and storage logical block settings to optimize performance.
- the various embodiments described herein maintain real-time know ledge statistics of flash drives, which allows the system to avoid failures. Areas of the (lash drives which are showing high levels of w earing (or bad cell sites) can be avoided w hen w riting data. The cell use and the latency are monitored to determine w ear. To monitor wear, data can be re-allocated to alternate drives and the storage met a data maintained on the compute nodes can be updated. 100931 As individual flash drives near preset w ear leveling targets, data can be slotted to other drives and meta data updated. I f the user selects this feature, data can also be moved to alternate SSD ' s autonomously w hen these target thresholds are crossed.
- areas of the flash drives w hich are showing high levels of w earing (or bad cell sites) can be used for the storage of lightly w ritten data (e.g.. cold data storage). In this manner, the w orn areas of the flash drives can still be used w ithout w asting storage.
- Storage Meta Data Structure 100941 Referring again to Figure 6, the example embodiment shows a basic compute environment where compute servers are attached to storage devices. Applications can run on the servers and the application data as well as operating data can reside on the storage devices.
- the environment enables object storage devices to perform at comparable or greater levels to compute servers with internal storage and vastly outperform other methods of external storage devices and storage systems, such as SAN and NAS storage as described above. This improved efficiency frees up the user to independently scale the compute and storage needs of their compute clusters without adversely impacting the performance.
- the distributed object store will have unmatched performance density for cluster based computing with the availability features of SAN or NAS storage.
- Figure 23 show s the object tag format for the object store of the example embodiment.
- the type field is used to define what fields are present in the rest of the tag as some files are optional and some fields can be duplicated. This is done to enable and disable storage of each object stored.
- the object source is a netw ork pointer to where the object resides in the network. This object source is generated to allow current commercial sw itches to locate the object source in an Ethernet netw ork with hardw are speed or the smallest possible latency.
- the object tag is used to move that 10 command to the correct location for the command to be processed.
- the object locater field is used to find the data object the command is processing or accessing.
- the object feature field is used to track any special requirement or actions an object requires. It is also used to determine any special requirements of the object. Agents can use this field to make decisions or perform actions related to the object.
- the Sw itched DAS architecture of an example embodiment has a wide variety of use cases.
- the follow ing list presents a few of these use cases:
- RDBMS relational database management system
- OLTP/OLAP online transaction processing / online analytical processing
- VDI virtual device interface
- Figure 4 depicts conventional host server interconnections using existing protocols, such as Fibre Channel or serially attached Small Computer System Interface (SCSI ). These protocols add significant overhead to each input output ( 1 0) operation, because the rotating physical media on the storage devices is slow. Historically, this was not a major issue as the devices were then sufficient to fulfill application needs.
- existing protocols such as Fibre Channel or serially attached Small Computer System Interface ( SCSI ).
- Figure 5 depicts similar interconnections using existing networks instead of protocol-specific interconnects, such as in Figure 4. While such implementations make the interconnects non-proprietary and less dependent on a specific protocol, additional netw orking overhead is introduced, so that the benefits of longer attachment distances, sharing, and use of existing network infrastructure can be lost due to the even slower access.
- Figure 24 shows a specific example of the conventional system shown in Figure 4. where storage is attached via Ethernet using conventional protocols, such as Internet SCSI (iSCSI ). Fibre Channel over Ethernet ( FCoE). Advanced Technology Attachment (ATA) over Ethernet ( AoE). and so forth.
- iSCSI Internet SCSI
- FCoE Fibre Channel over Ethernet
- ATA Advanced Technology Attachment
- AoE Advanced Technology Attachment
- FIG 25 illustrates how NVM Express devices are accessed when locally installed in a server.
- NVMe, or Non-Volatile Memory Host Controller Interface S peci fi cat i on ( V M H CI ). is a specification for accessing solid-state drives (SSDs) attached through the PCI Express ( Peripheral Component Interconnect Express or PCIe) bus.
- SSDs solid-state drives
- PCIe Peripheral Component Interconnect Express or PCIe
- NVM is an acronym for non-volatile memory, which is used in SSDs.
- 1 0 operation requests, or "submissions. "" are placed on producer queues. When data are transferred and the operation is completed, the results are posted to a "completion "" queue.
- This protocol results in very fast, low latency operations, but does not lend itself to being used in a multi-server environment; because, the protocol is designed to be point-to-point solution.
- PCIe has bus length, topology, reconfiguration
- Figure 26 illustrates a typical RDM A hardw are and software stack required to implement remote access of NVM Express devices. Note the large number o f layers of software in order to transport a request. The multiple layer overhead far exceeds the native device speeds.
- One purpose of the data storage access system of various example embodiments is to allow a plurality of hosts servers to access a plurality of storage devices efficiently, while minimizing the hardware, firmware, softw are, and protocol overhead and cost. This results in the follow ing benefits:
- Figure 27 illustrates an embodiment 2700 of the data storage access system of the example embodiments described herein show ing the savings in complexity to be gained by use of the example embodiments over the conventional implementation, for example, show n in Figure 26. Many layers of protocol and additional messages are no longer needed, resulting in much improved performance.
- Figure 27 illustrates an embodiment 2700 of the data storage access system, which includes a host system 2710 in data communication with a data storage controller system 27 12.
- the data communication betw een one or more host systems 2710 and the data storage controller system 2712 is provided by an NVMe tunnel 2714.
- the NVMe tunnel 27 14 can effect the high-speed transfer of data to from the data storage controller system 27 12 using an Ethernet data transfer fabric.
- the NVMe tunnel 2714 provides a high-speed (e.g.. 40 Gigabit Ethernet) Layer 2 data conduit betw een the one or more host systems 271 0 and the data storage controller system 27 12.
- the details of an embodiment of the data storage access system 2700 and the NVMe tunnel 27 14 are provided below and in the referenced figures.
- Figure 28 illustrates the configuration of queues in the host bus adapter (HBA) or host network interface controller (N IC) in an example embodiment.
- Figure 28 also illustrates the positioning of the HBA or NIC between the network endpoint (e.g.. PCIe endpoint) and the data transmission fabric (e.g.. Ethernet).
- HBA host bus adapter
- N IC host network interface controller
- Figure 29 illustrates a detail of the configuration of queues in the host bus adapter ( HBA) or host netw ork interface controller (NIC) in an example embodiment.
- the set of queues of the HBA or Host NIC in an example embodiment includes a set of management queues and a set of data path queues.
- the management queues include a set of administrative submission queues, a set of administrative receive queues, and a set of administrative completion queues.
- the management queues enable the transfer of control or configuration messages betw een nodes (e.g.. servers hosts, storage controllers, or other fabric-connected components) w ithout interruption of the operational data flows transferred via the set of data path queues.
- the set of data path queues includes an Input Output ( I O) submission queue and a set of completion queues corresponding to each of a plurality of processing logic components or cores.
- I O Input Output
- the set of queues of the HBA or Host NIC in an example embodiment enable the high speed transfer of data across the NVMe tunnel to/from nodes on the other side of the data communication fabric.
- Figure 30 illustrates an architectural view of the storage controller 27 10 of an example embodiment in netw ork communication with a plurality of host server systems 27 10 via a storage netw ork 3010.
- the storage controller 27 12 in an example embodiment can be configured with sets of queues to handle the flow of data traffic betw een a node (e.g...
- the set of queues of the storage controller 27 12 in an example embodi ment includes a set of management queues and a set of data path queues.
- the management queues include a set of administrative submission queues and a set of administrative completion queues.
- the management queues enable the transfer of control or configuration messages betw een nodes (e.g.. servers hosts, storage controllers, or other fabric-connected components) and the data storage repository 3012 without interruption of the operational data flows transferred via the set of data path queues.
- the set of data path queues includes an Input Output ( 1 0) submission queue and a completion queue.
- a context cache is provided to cache this context information.
- the example embodiment can retain information needed to instruct the data storage controller of an example embodiment how to present the ending status of an operation.
- the context information can assist in defining the disposition of the request.
- the request disposition can represent the number or identifier of a completion queue to which the ending status is posted (e.g., Completion Queue Context).
- a request may direct the data storage controller to post ending status as soon as the request is transmitted, for example, to signify that a stateless broadcast was sent (e.g., submission Queue Context).
- the context information can be used to differentiate among a plurality of outbound data paths and correspondin outbound data path queue sets.
- the set of queues of the storage controller 2712 in an example embodiment enable the high speed transfer of data across the NVMe tunnel between nodes on the other side of the data communication fabric 3010 and the data storage repository 3012.
- Figure 3 1 illustrates an example of a method for a host server to communicate 1 0 requests to devices installed within the data storage access system of an example embodiment.
- the 1 0 requests are handled by a host bus adapter (HBA) or network interface controller (NIC) on the host system.
- HBA host bus adapter
- NIC network interface controller
- Requests to the data storage access system of the example embodiment can be placed on the first two Queues (SQO and SQ I ). one queue to transmit requests, the other queue to transmit completions. These requests take priority over any other requests, allow ing for a path to issue error-recovery directives, such as component resets or overall system configuration updates.
- Queue two (SQ2) is used to direct administrative requests to devices.
- Administrative requests are used to control the physical aspects of a device, such as formatting its media, or to issue error-recovery commands, such as individual operation aborts or device resets. Administrative queue requests take priority over device 10 operations.
- the remaining queues (SQ3 through SQ7) are used to issue application-related 1 0 operations (e.g.. reading and writing of application data). Multiple queues may exist to allow ordering of operations to devices or to alter the priority of queued operations.
- Each queue contains a variable number of elements, each of which represents a unit of work for the data storage access system of the example embodiment to perform.
- Each element can be comprised of the follow ing two parts:
- Shadow queue element containing information over and above the request itself.
- submission queue element containing sufficient compatible information necessary to execute an individual operation.
- the submission queue contents exactly match those defined in the NVM Express Specification for NVM Express devices.
- the shadow and submission queues may be distinct regions in memoiy. or may be combined into a single element. How ever, the request receiving component of the data storage access system can receive both the shadow queue element and the submission queue element together. When transmitted across a netw ork, unused or reserved fields of the request may be omitted to save time and netw ork bandwidth.
- Figure 32 illustrates example contents of a single Shadow Queue Element in the data storage access system of an example embodiment.
- This element includes:
- a destination Ethernet Media Access Control (MAC) address or other network-specific addresses can be provided.
- the network-specific address can address a single component via a unicast, multiple components via a multicast, or all components via a broadcast.
- a destination MAC address and a source MAC address can be provided.
- the destination MAC address can address a single component via a unicast, multiple components via a multicast, or all components via a broadcast.
- the source MAC address can be the sending port's MAC address. It will be apparent to those of ordinary skill in the art in view of the disclosure herein that the network-specific addresses can be Ethernet MAC addresses or other types of device addresses compatible with a particular data communication fabric.
- the "Command" field designates the current request as a request submitted to the data storage access system, or to an individual device.
- the class of request field can override the implicit type by queue number.
- this element defines the disposition of the request.
- the request disposition can represent the number or identifier of a completion queue to which the ending status is posted (e.g., Completion Queue Context).
- a request may direct the data storage access system to post ending status as soon as the request is transmitted, for example, to signify that a stateless broadcast was sent(e.g., submission Queue Context).
- VLAN Virtual Local Area Network
- each issued request results in a completion event, which is placed into the next available slot in a completion queue.
- the format of the completion queue can be identical to the format defined in the NVMe Specification. As such, the format is not discussed further here.
- Figures 33 and 34 illustrate example register sets of the data storage access system of an example embodiment used to set up and control the various request and completion queues as described herein.
- Figure 33 illustrates the submission Queue (SQ) Registers in the example embodiment.
- Figure 34 illustrates the Completion Queue (CQ) Registers in the example embodiment.
- these register sets one per queue in the example embodiment, define the storage address of that particular queue, the queue length in number of elements, and "producer " and "consumer " queue pointers for the task adding requests to the queue (the "producer " ) and the task servicing requests (the "consumer " ).
- Other fields in the register sets define the type of queue and provide other debugging and statistical information.
- Figures 35 and 36 illustrate examples of how a host 1 0 request flows through the data storage access system of an example embodiment.
- Figure 35 illustrates a typical data read transaction flow in an example embodiment.
- Figure 36 illustrates a typical data write transaction flow in an example embodiment.
- a host server-specific tag is included with the transaction in order to identi fy which host server memory is used to transfer data betw een the host server and the device(s) being addressed.
- FIG. 35 a typical data read transaction flow (read data transferred from storage device to Host) in an example embodiment is illustrated.
- the basic sequence of processing operations for handling a read data transaction in an example embodiment is set forth below with the processing operation numbers listed below corresponding to the operation numbers shown in Figure 35:
- Host writes a new queue index tail pointer to HBA submission Queue Tail Doorbell register.
- HBA generates a read request to access a Host memory submission Queue Shadow entry using queue index head pointer and submission Queue Shadow base address information.
- HBA generates a read request to access a Host memory submission NVMe
- Shadow and NVMe command entries are used to generate a message that
- the message is encapsulated within a tunnel header and fabric (e.g., Ethernet)
- the storage controller receives the message and stores off the NVMe command entry and fabric information to be used later in the NVMe I/O.
- the storage controller tags the upper bits of the Physical Region Page (PRP)
- the tag field of the address is used to determine to which I/O and host the data phase request belongs.
- the storage controller writes NVMe submission Queue doorbell register of the drive and the drive reads the local NVMe entry.
- the tag field bits of the address are used to perform an I/O context lookup.
- the tag field of the address is restored to its original value (stored in step 5c) and the Transaction Layer Packet (TLP) is directed back to the requesting host based on the fabric information from the I/O context (stored in step 5b).
- TLP Transaction Layer Packet
- the storage controller intercepts the NVMe completion and directs it back to the requesting host's proper completion queue based on the fabric information stored off in step 5b.
- HBA writes a Message Signaled Interrupt (MSI-X) based on the completion shadow data.
- MSI-X Message Signaled Interrupt
- Storage controller writes queue index head pointer to the drive Completion Queue Head Doorbell register.
- FIG. 36 a typical data write transaction flow (written data transferred from Host to storage device) in an example embodiment is illustrated.
- the basic sequence of processin operations for handlin a write data transaction in an example embodiment is set forth below with the processing operation numbers listed below correspondin to the operation numbers shown in Figure 36:
- Host writes new queue index tail pointer to HBA submission Queue Tail Doorbell register. .
- HBA generates a read request to access a Host memory submission Queue Shadow entry using queue index head pointer and submission Queue Shadow base address information.
- Host returns the requested submission Queue Shadow entry.
- HBA generates a read request to access a Host memory submission NVMe
- Host returns the requested submission Queue NVMe command entry.
- Shadow and NVMe entries are used to generate a message that contains the proper fabric information to reach the storage controller and drive.
- the message is encapsulated within a tunnel header and fabric (e.g., Ethernet) header, and sent across the fabric (e.g., Ethernet)
- fabric e.g., Ethernet
- the storage controller receives the message and stores off the NVMe command entry and fabric information to be used later in the NVMe I/O.
- the storage controller tags the upper bits of the Physical Region Page (PRP) addresses of the NVMe command with an I/O context tag and saves off the replaced bits.
- PRP Physical Region Page
- the tag field of the address is used to determine to which I/O and host the data phase request belongs.
- the storage controller writes NVMe submission Queue doorbell register of the drive and the drive reads the local NVMe entry.
- SSD generates Transaction Layer Packet (TLP) read requests for the NVMe write data.
- TLP Transaction Layer Packet
- the tag field bits of the address are used to perform an I/O context lookup.
- the tag field of the address is restored to its original value (stored in step 5 c) and the TLP is directed back to the requesting host based on the fabric information from the I/O context (stored in step 5 b).
- the HBA receives the TLP read request and stores off fabric information to be used for the TLP read completion.
- Host returns TLP read completion data for the NVMe write.
- the HBA intercepts the TLP read completion and uses the information stored off in step 7b to direct the TLP back to the proper drive.
- SSD writes NVMe completion to the storage controller Completion queue.
- the storage controller intercepts the NVMe completion and directs it back to the requesting host's proper completion queue based on the fabric information stored off in step 5b. 11.
- HBA writes a Message Signaled Interrupt (MSI-X) based on the completion shadow data.
- MSI-X Message Signaled Interrupt
- Storage controller writes queue index head pointer to the drive Completion Queue Head Doorbell register.
- FIG. 37 illustrates a node to node protocol in an example embodiment providing the ability for a pl urali y of data storage access systems to inter-communicate via unicast. multicast, or broadcast data transmissions using the queuing methodologies described herein.
- nodes can be servers hosts, storage controllers, or other fabric- connected components. Requests and completions can be submitted and posted, generally using the first two queues. This allows for in formation to be moved amongst the servers hosts, storage controllers, and other components connected together or sharing the same interconnection netw ork.
- the protocol is beneficial and useful for a variety of reasons, incl uding:
- Figure 38 ill ustrates an example embodiment of a component of the data storage access system as used within an existing host server.
- Figure 38 illustrates the Host Server Bus Adapter component of an example embodi ment. This component i mplements the queues, tagging, and data transfer for host to array device and or host to host communications.
- FIG 39 is a flow diagram ill ustrating the basic processing flow 401 for a particular embodi ment of a method for accessing multiple storage devices from multiple hosts without use of remote direct memory access (RDMA).
- an example embodiment includes: providing a data store switch fabric enabling data communications betw een a data storage access system and a plurality of compute nodes, each compute node having integrated compute capabilities, data storage, and a network interface controller (Host NIC) (processing block 410); providing a plurality of physical data storage devices (processing block 420); providing a host bus adapter (HBA) in data communication with the plurality of physical data storage devices and the plurality of compute nodes via the data store sw itch fabric, the HBA including at least one submission queue and a corresponding shadow queue (processing block 430); receiving an input output ( 1 0) request from the plurality of compute nodes ( processing block 440); including an element of the I/O request to the at least one submission queue ( processing block 450); and
- Figure 40 show s a diagrammatic representation of a machine in the example form of a mobile computing and or communication system 700 within which a set of instructions when executed and or processing logic when activated may cause the machine to perform any one or more of the methodologies described and or claimed herein.
- the machine operates as a standalone device or may be connected (e.g.. netw orked) to other machines.
- the machine may operate in the capacity of a server or a client machine in server-client network environment, or as a peer machine in a peer-to-peer (or distributed) netw ork environment.
- the machine may be a server computer, a personal computer (PC), a laptop computer, a tablet computing system, a Personal Digital Assistant ( PDA), a cellular telephone, a smartphone.
- a web appliance a set-top box (STB), a netw ork router, sw itch or bridge, or any machine capable of executing a set of instructions (sequential or otherwise) or activating processing logic that specify actions to be taken by that machine.
- STB set-top box
- sw itch or bridge or any machine capable of executing a set of instructions (sequential or otherwise) or activating processing logic that specify actions to be taken by that machine.
- the example mobile computing and or communication system 700 includes a data processor 702 (e.g.. a System-on-a-Chip ( SoC). general processing core, graphics core, and optionally other processing logic) and a memor 704, which can communicate with each other via a bus or other data transfer system 706.
- the mobile computing and or communication system 700 may further include various input output ( I O) devices and or interfaces 710, such as a touchscreen display, an audio jack, and optionally a network interface 7 12.
- I O input output
- the netw ork interface 7 12 can include a standard wired netw ork interface, such as an Ethernet connection, or one or more radio transceivers configured for compatibility with any one or more standard w ireless and or cellular protocols or access technologies (e.g.. 2nd (2G). 2.5, 3rd ( 3G), 4th (4G) generation, and future generation radio access for cellular systems.
- 2nd (2G) 2.5, 3rd ( 3G), 4th (4G) generation, and future generation radio access for cellular systems.
- GSM Global System for Mobile communication
- GPRS General Packet Radio Services
- EDGE Enhanced Data GSM Environment
- WC DMA Wideband Code Division Multiple Access
- LTE CDMA2000, WLAN, Wi reless Router (WR) mesh, and the like).
- Network interface 7 1 2 may also be configured for use with various other wired and or wireless communication protocols, includin TCP/IP, UDP, SIP, SMS, RTP, WAP, CDMA, TDMA, UMTS, UWB, W'i Fi. WiMax, Bluetooth, IEEE 802. 1 lx, and the like.
- network interface 7 1 2 may include or support virtually any wired and or wireless communication mechanisms by which information may travel betw een the mobile computing and/or communication system 700 and another computing or communication system via network 7 14.
- Sensor logic 720 provides the sensor hardware and or software to capture sensor input from a user action or system event that is used to assist i n the con figuration of the data storage system as described above.
- the memory 704 can represent a machine-readable medium on which is stored one or more sets of instructions, software, firmware, or other processing logic (e.g.. logic 708) embodying any one or more of the methodologies or functions described and or claimed herein.
- the logic 708, or a portion thereof may also reside, completely or at least partially w ithin the processor 702 during execution thereof by the mobile computing and. or communication system 700.
- the memory 704 and the processor 702 may also constitute machine-readable media.
- the logic 708, or a portion thereof may also be configured as processing logic or logic, at least a portion of which is partially implemented in hardw are.
- the logic 708, or a portion thereof, may further be transmitted or recei ved over a netw ork 7 14 via the netw ork interface 7 1 2.
- machine-readable medium of an example embodi ment can be a single medi um
- machine- readable medium should be taken to include a single non-transitory medium or multiple non- transitory media (e.g.. a centralized or distributed database, and or associated caches and computing systems) that store the one or more sets of instructions.
- machine-readable medium can also be taken to include any non-transitory medium that is capable of storing, encoding or carrying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies of the various embodiments, or that is capable of storing, encoding or carrying data structures utilized by or associated with such a set of instructions.
- machine-readable medium can accordingly be taken to include, but not be l imited to, solid-state memories, optical media, and magnetic media.
- Applications that may include the apparatus and systems of various embodi ments broadly include a variety of electronic devices and computer systems. Some embodiments implement functions in two or more speci fic interconnected hardw are modules or devices with related control and data signals communicated betw een and through the modules, or as portions of an application-speci fic integrated ci rcuit. Thus, the example system is applicable to softw are, firmw are, and hardw are implementations. 100126
- module that is configured and operates to perform certain operations as described herein.
- the "module” may be implemented mechanically or electronically.
- a module may comprise dedicated circuitry or logic that is permanently configured (e.g.. within a special-purpose processor) to perform certain operations.
- a module may also comprise programmable logic or circuitry (e.g., as encompassed within a general-purpose processor or other programmable processor) that is temporarily configured by software to perform certain operations. It will be appreciated that the decision to implement a module mechanically, in the dedicated and permanently configured circuitry, or in temporarily configured circuitry (e.g., configured by software) may be driven by cost and time considerations.
- module should be understood to encompass a functional entity, be that an entity that is physically constructed, permanently configured (e.g., hardwired) or temporarily configured (e.g., programmed) to operate in a certain manner and/or to perform certain operations described herein.
- machine-readable medium 704 or 708 is shown in an example embodiment to be a single medium, the term “machine-readable medium” should be taken to include a single medium or multiple media (e.g., a centralized or distributed database, and/or associated caches and servers) that store the one or more sets of instructions.
- the term “machine- readable medium” shall also be taken to include any non-transitory medium that is capable of storing, encoding or embodying a set of instructions for execution by the machine and that cause the machine to perform any one or more of the methodologies described herein.
- the term “machine-readable medium” shall accordingly be taken to include, but not be limited to, solid- state memories, optical media, and magnetic media.
- the software and/or related data may be transmitted over a network using a transmission medium.
- transmission medium shall be taken to include any medium that is capable of storing, encoding or carrying instructions for transmission to and execution by the machine, and includes digital or analog communication signals or other intangible media to facilitate transmission and communication of such software and/or data.
- first”, “second”, etc. that are used for descriptive purposes only and are not to be construed as limiting.
- the elements, materials, geometries, dimensions, and sequence of operations may all be varied to suit particular applications. Parts of some embodiments may be included in, or substituted for, those of other embodiments. While the foregoing examples of dimensions and ranges are considered typical, the various embodiments are not limited to such dimensions or ranges.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US14/712,372 US9483431B2 (en) | 2013-04-17 | 2015-05-14 | Method and apparatus for accessing multiple storage devices from multiple hosts without use of remote direct memory access (RDMA) |
PCT/US2016/029856 WO2016182756A1 (en) | 2015-05-14 | 2016-04-28 | Accessing multiple storage devices from multiple hosts without remote direct memory access (rdma) |
Publications (2)
Publication Number | Publication Date |
---|---|
EP3295321A1 true EP3295321A1 (en) | 2018-03-21 |
EP3295321A4 EP3295321A4 (en) | 2019-04-24 |
Family
ID=57248375
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP16793169.0A Withdrawn EP3295321A4 (en) | 2015-05-14 | 2016-04-28 | Accessing multiple storage devices from multiple hosts without remote direct memory access (rdma) |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP3295321A4 (en) |
WO (1) | WO2016182756A1 (en) |
Families Citing this family (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10452316B2 (en) | 2013-04-17 | 2019-10-22 | Apeiron Data Systems | Switched direct attached shared storage architecture |
CN108228082B (en) * | 2016-12-21 | 2021-04-02 | 伊姆西Ip控股有限责任公司 | Storage system and method for storage control |
US11023275B2 (en) * | 2017-02-09 | 2021-06-01 | Intel Corporation | Technologies for queue management by a host fabric interface |
US10733137B2 (en) | 2017-04-25 | 2020-08-04 | Samsung Electronics Co., Ltd. | Low latency direct access block storage in NVME-of ethernet SSD |
US11366610B2 (en) | 2018-12-20 | 2022-06-21 | Marvell Asia Pte Ltd | Solid-state drive with initiator mode |
WO2020186270A1 (en) * | 2019-03-14 | 2020-09-17 | Marvell Asia Pte, Ltd. | Ethernet enabled solid state drive (ssd) |
EP3939237B1 (en) | 2019-03-14 | 2024-05-15 | Marvell Asia Pte, Ltd. | Transferring data between solid state drives (ssds) via a connection between the ssds |
WO2020183246A2 (en) | 2019-03-14 | 2020-09-17 | Marvell Asia Pte, Ltd. | Termination of non-volatile memory networking messages at the drive level |
CN114138178B (en) * | 2021-10-15 | 2023-06-09 | 苏州浪潮智能科技有限公司 | IO processing method and system |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
DE10123821A1 (en) * | 2000-06-02 | 2001-12-20 | Ibm | Switched Ethernet network has a method for assigning priorities to user groups so that a quality of service guarantee can be provided by ensuring that packets for one or more groups are given priority over other groups |
US7290086B2 (en) * | 2003-05-28 | 2007-10-30 | International Business Machines Corporation | Method, apparatus and program storage device for providing asynchronous status messaging in a data storage system |
US7633955B1 (en) * | 2004-02-13 | 2009-12-15 | Habanero Holdings, Inc. | SCSI transport for fabric-backplane enterprise servers |
US7602774B1 (en) * | 2005-07-11 | 2009-10-13 | Xsigo Systems | Quality of service for server applications |
US20090080428A1 (en) * | 2007-09-25 | 2009-03-26 | Maxxan Systems, Inc. | System and method for scalable switch fabric for computer network |
US8677027B2 (en) * | 2011-06-01 | 2014-03-18 | International Business Machines Corporation | Fibre channel input/output data routing system and method |
US9727501B2 (en) * | 2011-10-31 | 2017-08-08 | Brocade Communications Systems, Inc. | SAN fabric online path diagnostics |
US9176799B2 (en) * | 2012-12-31 | 2015-11-03 | Advanced Micro Devices, Inc. | Hop-by-hop error detection in a server system |
US9756128B2 (en) * | 2013-04-17 | 2017-09-05 | Apeiron Data Systems | Switched direct attached shared storage architecture |
US9430412B2 (en) | 2013-06-26 | 2016-08-30 | Cnex Labs, Inc. | NVM express controller for remote access of memory and I/O over Ethernet-type networks |
-
2016
- 2016-04-28 EP EP16793169.0A patent/EP3295321A4/en not_active Withdrawn
- 2016-04-28 WO PCT/US2016/029856 patent/WO2016182756A1/en unknown
Also Published As
Publication number | Publication date |
---|---|
EP3295321A4 (en) | 2019-04-24 |
WO2016182756A1 (en) | 2016-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9898427B2 (en) | Method and apparatus for accessing multiple storage devices from multiple hosts without use of remote direct memory access (RDMA) | |
US9756128B2 (en) | Switched direct attached shared storage architecture | |
US10452316B2 (en) | Switched direct attached shared storage architecture | |
US11269518B2 (en) | Single-step configuration of storage and network devices in a virtualized cluster of storage resources | |
US11580041B2 (en) | Enabling use of non-volatile media—express (NVME) over a network | |
EP3295321A1 (en) | Accessing multiple storage devices from multiple hosts without remote direct memory access (rdma) | |
US11922070B2 (en) | Granting access to a storage device based on reservations | |
US9934194B2 (en) | Memory packet, data structure and hierarchy within a memory appliance for accessing memory | |
KR102318477B1 (en) | Stream identifier based storage system for managing array of ssds | |
US9720606B2 (en) | Methods and structure for online migration of data in storage systems comprising a plurality of storage devices | |
WO2016196766A2 (en) | Enabling use of non-volatile media - express (nvme) over a network | |
US20150106578A1 (en) | Systems, methods and devices for implementing data management in a distributed data storage system | |
US20160132541A1 (en) | Efficient implementations for mapreduce systems | |
CN111722786A (en) | Storage system based on NVMe equipment | |
KR20160037827A (en) | Offload processor modules for connection to system memory | |
CN103595799A (en) | Method for achieving distributed shared data bank | |
US11416176B2 (en) | Function processing using storage controllers for load sharing | |
KR20210124082A (en) | Systems and methods for composable coherent devices | |
EP3679478A1 (en) | Scalable storage system | |
US20230328008A1 (en) | Network interface and buffer control method thereof | |
US20200310658A1 (en) | Machine learning for local caching of remote data in a clustered computing environment | |
US20190129855A1 (en) | Cache Sharing in Virtual Clusters | |
WO2014077451A1 (en) | Network distributed file system and method using iscsi storage system | |
US11921658B2 (en) | Enabling use of non-volatile media-express (NVMe) over a network | |
Krevat et al. | Understanding Inefficiencies in Data-Intensive Computing |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE |
|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE |
|
17P | Request for examination filed |
Effective date: 20171114 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
AX | Request for extension of the european patent |
Extension state: BA ME |
|
RIN1 | Information on inventor provided before grant (corrected) |
Inventor name: LOMELINO, LAWRENCE W. Inventor name: CHRIST, CHRISTOPHER Inventor name: LAHR, STEVEN R. Inventor name: BERGSTEN, JAMES R. |
|
DAV | Request for validation of the european patent (deleted) | ||
DAX | Request for extension of the european patent (deleted) | ||
A4 | Supplementary search report drawn up and despatched |
Effective date: 20190327 |
|
RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06F 12/02 20060101ALI20190321BHEP Ipc: G06F 15/16 20060101ALI20190321BHEP Ipc: G06F 3/06 20060101ALI20190321BHEP Ipc: H04L 29/08 20060101AFI20190321BHEP Ipc: G06F 13/16 20060101ALI20190321BHEP Ipc: G06F 13/38 20060101ALI20190321BHEP |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
17Q | First examination report despatched |
Effective date: 20200708 |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: MONETA VENTURES FUND II LP |
|
RAP1 | Party data changed (applicant data changed or rights of an application transferred) |
Owner name: WHITE ROOK TECHNOLOGIES, INC. |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20210119 |