WO2023244948A1 - Gestion de stockage basée sur un graphe - Google Patents

Gestion de stockage basée sur un graphe Download PDF

Info

Publication number
WO2023244948A1
WO2023244948A1 PCT/US2023/068250 US2023068250W WO2023244948A1 WO 2023244948 A1 WO2023244948 A1 WO 2023244948A1 US 2023068250 W US2023068250 W US 2023068250W WO 2023244948 A1 WO2023244948 A1 WO 2023244948A1
Authority
WO
WIPO (PCT)
Prior art keywords
volume
storage
graph
node
function
Prior art date
Application number
PCT/US2023/068250
Other languages
English (en)
Inventor
Jaspal Kohli
Shwetashree Virajamangala
Sudip Chandra TALUKDER
Stimit Kishor OAK
Hari Krishna MUDALIAR
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Priority claimed from US18/332,461 external-priority patent/US20230409215A1/en
Publication of WO2023244948A1 publication Critical patent/WO2023244948A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • G06F3/0605Improving or facilitating administration, e.g. storage management by facilitating the interaction with a user or administrator
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0614Improving the reliability of storage systems
    • G06F3/0617Improving the reliability of storage systems in relation to availability
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0631Configuration or reconfiguration of storage systems by allocating resources to storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/0644Management of space entities, e.g. partitions, extents, pools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0646Horizontal data movement in storage systems, i.e. moving data in between storage devices or systems
    • G06F3/065Replication mechanisms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0662Virtualisation aspects
    • G06F3/0664Virtualisation aspects at device level, e.g. emulation of a storage device or system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0662Virtualisation aspects
    • G06F3/0665Virtualisation aspects at area level, e.g. provisioning of virtual or logical volumes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0688Non-volatile semiconductor memory arrays
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/0671In-line storage system
    • G06F3/0683Plurality of storage devices
    • G06F3/0689Disk arrays, e.g. RAID, JBOD

Definitions

  • a data center may comprise a facility that hosts applications and services for subscribers, e.g., customers or tenants of the data center.
  • the data center may, for example, host all of the infrastructure equipment, such as compute nodes, networking and storage systems, power systems, and environmental control systems.
  • clusters of storage systems and application servers are interconnected via a highspeed switch fabric provided by one or more tiers of physical network switches and routers.
  • Data centers vary greatly in size, with some public data centers containing hundreds of thousands of servers and are usually distributed across multiple geographies for redundancy.
  • this disclosure describes operations performed by a compute node, storage node, computing system, network device, and/or storage cluster in accordance with one or more aspects of this disclosure.
  • this disclosure describes a method comprising allocating, by a storage cluster having a plurality of storage nodes, a volume of storage within the storage cluster; generating a volume graph of the volume, wherein the volume graph represents one or more functional elements in a data plane of the volume; and managing the volume based on the volume graph.
  • this disclosure describes a computing system comprising: a network interface for interconnecting the computing system with at least one other computing system to form a plurality of computing systems over a network; and at least one storage node, wherein the at least one storage node is part of a storage cluster formed by the plurality of computing systems, and wherein the computing system is configured to: allocate a volume of storage within the storage cluster; generate a volume graph of the volume, wherein the volume graph represents one or more functional elements in a data plane of the volume; and manage the volume based on the volume graph.
  • this disclosure describes a computer-readable storage medium comprising instructions that, when executed, cause one or more processors to: allocate a volume of storage within a storage cluster; generate a volume graph of the volume, wherein the volume graph represents one or more functional elements in a data plane of the volume; and manage the volume based on the volume graph.
  • FIG. 1 A is a block diagram illustrating an example system including one or more computing devices configured to support graph-based storage management, in accordance with one or more aspects of the present disclosure.
  • FIG. IB is a simplified block diagram illustrating an example storage cluster, in accordance with one or more aspects of the present disclosure.
  • FIG. 2A is a block diagram illustrating a system having a data processing unit (DPU) configured to support graph-based storage management, in accordance with the techniques described in this disclosure.
  • FIG. 2B is a block diagram illustrating hardware components of an example DPU, in accordance with the techniques of this disclosure.
  • DPU data processing unit
  • FIG. 5 illustrates an example graph-based representation of snapshots of a volume, in accordance with the techniques described in this disclosure.
  • FIG. 6 is a block diagram illustrating an example interaction of volume with a clone of a user volume for read and write operations, in accordance with the techniques described in this disclosure.
  • FIGS. 7A-7B are block diagrams illustrating an example failover operation using a volume graph, in accordance with the techniques described in this disclosure.
  • FIG. 8 is a flow diagram of an example method for graph-based storage management, in accordance with the techniques described in this disclosure.
  • the scale-out model is enabled by a data plane of the volume hosted on a full-mesh of interconnected storage nodes, and management and control planes of the volume may logically tie together operation of the storage nodes.
  • Storage nodes within a storage cluster can be configured to generate a graphbased representation of functional elements in a data plane of the volume for which the storage nodes are allocated.
  • the graph representation of the volume also referred to as a “volume graph,” can be used to manage the volume.
  • the storage nodes allocated for a volume may perform various functions, such as storage functions and/or functions offloaded from servers, such as security functions (e.g., encryption), compression and regular expression processing, data durability functions, data storage functions and network operations.
  • a volume graph of the volume can include nodes in the graph (referred to as “function nodes”) that represent the functions of the storage nodes.
  • the volume graph also includes leaf nodes that represent the resources (e.g., storage nodes and/or storage devices) allocated for the functions.
  • the volume graph further includes root nodes that represent the tenants or customers of the storage cluster associated with the volume.
  • Techniques described herein may provide one or more technical advantages. For example, by generating a graph-based representation of functional elements in a data plane of a volume, the complexity in management of the volume is reduced. For example, to achieve a different kind of function (e.g., data durability scheme) for the volume, a function node that represents one data durability scheme (e.g., erasure coding) in the volume graph may simply replace the function node with another function node that represents another data durability scheme (e.g., replication) within the volume graph and map one or more leaf nodes that represent resources allocated for the new data durability scheme.
  • data durability scheme e.g., erasure coding
  • FIG. 1 A is a block diagram illustrating an example system 108 including one or more storage nodes configured to support graph-based storage management, in accordance with one or more aspects of the present disclosure. Techniques described herein may enable storage nodes to generate a graph-based representation of the functional elements in a data plane of a volume in a storage cluster. Management of the volume can be performed using the graph representation of said volume.
  • System 108 includes a data center 101 capable of providing data processing and data storage.
  • data center 101 may represent one of many geographically distributed network data centers.
  • data center 101 provides an operating environment for applications and services for tenants 11 (e.g., customers) coupled to the data center 101.
  • Data center 101 may host infrastructure equipment, such as compute nodes, networking and storage systems, redundant power supplies, and environmental controls.
  • tenants 11 are coupled to the data center 101 by service provider network 7 and gateway device 20.
  • Service provider network 7 may be coupled to one or more networks administered by other providers and may thus form part of a large-scale public network infrastructure, e.g., the Internet.
  • data center 101 is a facility that provides information services for tenants 11.
  • Tenants 11 may be collective entities, such as enterprises and governments, or individuals.
  • a network data center may host web services for several enterprises and end users.
  • Other exemplary services may include data storage, virtual private networks, file storage services, data mining services, scientific- or super- computing services, and so on.
  • Controller 130 may also be responsible for allocating and accounting for resources for a volume, which may refer to a conceptual abstraction of a storage unit within the storage cluster.
  • a volume may be a storage container divided into fixed size blocks and be capable of being allocated and deallocated by controller 130 as well as being written to and read from by nodes or other devices within the data center 101.
  • NCSU 40A through NCSU 40N which represent any number of NCSUs.
  • N may represent any number and may vary across and/or within the Figures and their descriptions.
  • nodes 17A-17N and NCSU 40A-40N may have a different “N ”
  • data center 101 may include many NCSUs, and multiple NCSUs 40 may be organized into racks 70, which may be logical racks and/or physical racks within data center 101.
  • racks 70 which may be logical racks and/or physical racks within data center 101.
  • two NCSUs may compose a logical rack
  • four NCSUs may compose a physical rack.
  • Such other arrangements may include nodes 17 within a rack 70 being relatively independent and not logically or physically included within any node group or NCSUs 40.
  • FIG. 1 A illustrates rack 70B, which includes nodes 17A through 17N (representing any number of nodes 17).
  • Nodes 17 of rack 70B may be configured to store data within one or more storage devices 127 (included within or connected to such nodes 17) in accordance with techniques described herein.
  • Nodes 17 within rack 70B may be viewed as network interface subsystems that serve as a data storage node configured to store data across storage devices 127.
  • nodes 17 within rack 70B are not organized into groups or units, but instead, are relatively independent of each other, and are each capable of performing storage functions described herein. In other examples, however, nodes 17 of rack 70B may be logically or physically organized into groups, units, and/or logical racks.
  • Rack 70C is illustrated as being implemented in a manner similar to rack 70B, with nodes 17 serving as storage nodes configured to store data within storage devices 127 (included within or connected to such nodes 17).
  • nodes 17 serving as storage nodes configured to store data within storage devices 127 (included within or connected to such nodes 17).
  • FIG. 1 A illustrates one rack 70A with nodes 17 that support servers 12 and other racks 70B, 70C with nodes 17 serving as storage nodes, any number and combination of racks may be implemented.
  • any of racks 70 may include a mix of nodes 17 supporting servers 12 and nodes 17 serving as storage nodes.
  • Nodes 17 of rack 70B may be devices or systems that are the same as or similar to nodes 17 of rack 70A.
  • nodes 17 of rack 70B may have different capabilities than those of rack 70Aand/or may be implemented differently.
  • nodes 17 of rack 70B may be somewhat more capable than nodes 17 of rack 70A (e.g., more computing power, more memory capacity, more storage capacity, and/or additional capabilities).
  • each of nodes 17 of rack 70B may be implemented by using a pair of nodes 17 of rack 70A.
  • nodes 17 of rack 70B and 70C are illustrated in FIG. 1A as being larger than nodes 17 of rack 70 A.
  • each node 17 may be a highly programmable I/O processor specially designed for performing storage functions and/or for offloading certain functions from servers 12.
  • Each node 17 may be implemented as a component (e.g., electronic chip) within a device (e.g., compute node, application server, or storage server), and may be deployed on a motherboard of the device or within a removable card, such as a storage and/or network interface card.
  • each node 17 may be implemented as one or more application-specific integrated circuit (ASIC) or other hardware and software components, each supporting a subset of storage devices 127 or a subset of servers 12.
  • ASIC application-specific integrated circuit
  • each node 17 includes a number of internal processor clusters, each including two or more processing cores and equipped with hardware engines that can offload certain functions from servers 12, such as security functions (e.g., encryption), acceleration (e.g., compression) and regular expression (RegEx) processing, data durability functions (e.g., erasure coding, replication, etc.), data storage functions, and network operations.
  • security functions e.g., encryption
  • acceleration e.g., compression
  • data durability functions e.g., erasure coding, replication, etc.
  • data storage functions e.g., erasure coding, replication, etc.
  • network operations e.g., erasure coding, replication, etc.
  • One or more nodes 17 may include a data durability module or unit, referred to as an “accelerator” unit, which may be implemented as a dedicated module or unit for performing data durability functions.
  • one or more computing devices may include a node including one or more data durability, data
  • One or more nodes 17 may generate a graph-based representation of the functional elements in a data path of a volume of the storage cluster.
  • the volume graph may include various layers of abstraction of the volume, which represents one or more data storage schemes (e.g., data durability, data reliability, etc.).
  • One or more nodes 17 and storage devices 127 may be used to implement the one or more data storage schemas. In the example of FIG.
  • controller 130 may include a graph generation module 131 to generate volume graphs for target nodes, e.g., volume graph 135 that represents storage functions and/or functions offloaded from servers 12 and the allocated resources for the functions, such as nodes 17 and/or storage devices 127.
  • a graph generation module 131 to generate volume graphs for target nodes, e.g., volume graph 135 that represents storage functions and/or functions offloaded from servers 12 and the allocated resources for the functions, such as nodes 17 and/or storage devices 127.
  • Node 17A of rack 70B may use volume graph 135 for managing the storage cluster.
  • node 17A of rack 70B may convey information based on volume graph 135 to each DPU to manage the storage cluster.
  • node 17Aof rack 70B may use volume graph 135 for modifying existing volumes, resource allocation, event management, dynamic rebalancing of resources, and/or volume property modification for a storage cluster.
  • An example of a volume graph and use of the volume graph is described herein with respect to FIG. 4.
  • volume 121 J and 12 IK dotted lines from each of volumes 121 J and 12 IK are intended to illustrate that such volumes 121 are each stored across multiple storage nodes 120. Although only two volumes are illustrated in FIG. IB, storage cluster 102 may support any number of volumes 121 for any number of tenants. Moreover, while a single tenant is illustrated in FIG. IB for each of volumes 121, a volume may be allocated for use by a plurality of tenants.
  • controller 130 in FIG. IB can provide cluster management orchestration of storage resources within storage cluster 102.
  • controller 130 in FIG. IB may be implemented through any suitable computing system, including one or more compute nodes within data center 101 or storage cluster 102. Although illustrated as a single system within storage cluster 102 in FIG. IB, controller 130 may be implemented as multiple systems and/or as a distributed system that resides both inside and outside data center 101 and/or storage cluster 102. In other examples, some or all aspects of controller 130 may be implemented outside of data center 101, such as in a cloud-based implementation.
  • controller 130 includes storage services module 132 and data store 133.
  • Storage services module 132 of controller 130 may perform functions relating to establishing, allocating, and enabling read and write access to one or more volumes 121 within storage cluster 102.
  • storage services module 132 may perform functions that can be characterized as “cluster services” or “storage services,” which may include allocating, creating, and/or deleting volumes.
  • storage services module 132 may also provide services that help with compliance with quality of service standards for volumes 121 within storage cluster 102.
  • storage services module 132 may also manage input from one or more administrators (e.g., operating administrator device 134).
  • storage services module 132 may have a full view of all resources within storage cluster 102 and how such resources are allocated across volumes 121.
  • Data store 133 may represent any suitable data structure or storage medium for storing information related to resources within storage cluster 102 and how such resources are allocated within storage cluster 102 and/or across volumes 121. Data store 133 may be primarily maintained by storage services module 132.
  • Initiator nodes 110 illustrated in FIG. IB may be involved in causing or initiating a read and/or write operation with the storage cluster represented by storage cluster 102.
  • DPUs 118 within each of initiator nodes 110 may serve as the data-path hub for each of initiator nodes 110, connecting each of initiator nodes 110 (and storage nodes 120) through switch fabric 114.
  • one or more of initiator nodes 110 may be an x86 server that may execute NVMe (Non-Volatile Memory Express) over a communication protocol, such as TCP.
  • NVMe Non-Volatile Memory Express
  • other protocols may be used, including “FCP” as described in United States Patent No. 11,178,262, entitled “FABRIC CONTROL PROTOCOL FOR DATA CENTER NETWORKS WITH PACKET SPRAYING OVER MULTIPLE ALTERNATE DATA PATH”.
  • Each of storage nodes 120 may be implemented by the nodes 17 and storage devices 127 that are illustrated in FIG. 1A. Accordingly, the description of such nodes 17 and storage devices 127 in FIG. 1A may therefore apply to DPUs 117 and storage devices 127 of FIG. IB, respectively.
  • Storage nodes 120 are illustrated in FIG. IB such as to emphasize that, in some examples, each of storage nodes 120 may serve as storage targets for initiator nodes 110.
  • FIG. IB also includes conceptual illustrations of volumes 121 J and 121K.
  • volumes 121 may serve as storage containers for data associated with tenants of storage cluster 102, where each such volume is an abstraction intended to represent a set of data that is stored across one or more storage nodes 120.
  • each of volumes 121 may be divided into fixed size blocks and may support multiple operations.
  • such operations generally include a read operation (e.g., reading one or more fixed-size blocks from a volume) and a write operation (e.g., writing one or more fixed-size blocks to a volume).
  • Other operations are possible and are within the scope of this disclosure.
  • controller 130 may receive a request to allocate a volume. For instance, in an example that can be described with reference to FIG. IB, controller 130 can detect input that it determines corresponds to a request to create a new volume.
  • the input originates from one or more of initiator nodes 110, seeking to allocate new storage for a tenant of storage cluster 102 (e.g., tenant “J” or tenant “K” depicted in FIG. IB).
  • the input may originate from an administrator device (e.g., administrator device 134), which may be operated by an administrator seeking to allocate new storage on behalf of a tenant of storage cluster 102.
  • the input may originate from a different device.
  • Controller 130 may allocate a volume based on one or more rules (or criteria).
  • the rules can be based on one or more metrics, such as input/output operations per second (“IOPs”) availability, storage capacity availability, failure or fault domains, quality of service standards, and/or volume type, such as a durability schema (e.g., erasure coding, replication).
  • IOPs input/output operations per second
  • controller 130 may receive information describing the one or more rules, where the information is from or derived from input originating from an administrator (e.g., through administrator device 134). In other examples, such input may originate from a representative of the tenant (e.g., through a client device, not specifically shown in FIG. IB), where the representative selects or specifies rules for storage cluster 102.
  • Controller 130 can output the information about the request to allocate a new volume to storage services module 132, which evaluates the information and determines that the request is for a new volume that is to be allocated for a specific tenant (e.g., tenant “J” in the example being described).
  • Storage services module 132 further determines, based on the input received by controller 130, information about the volume type and the quality of service to be associated with the new volume.
  • Storage services module 132 accesses data store 133 and determines which of storage nodes 120 may be allocated to support the new volume. Any number of storage nodes may be allocated to support a volume. In the depicted example of FIG. IB, volume 121J is allocated using three storage nodes 120.
  • such a determination may involve evaluating which DPUs 117 and storage devices 127 within storage nodes 120 are available to be involved in serving read and write requests to the new volume. For example, storage services module 132 may determine which of DPUs 117 and/or storage devices 127 have enough IOPs needed for the volume. Storage services module 132 may additionally, or alternatively, determine which of storage devices 127 have enough storage capacity for the volume.
  • storage services module 132 may determine which DPUs 117 and storage devices 127 within storage nodes 120 may provide data protection for the new volume. For example, to determine which DPUs 117 and storage devices 127 within storage nodes 120 may be allocated to support the new volume, storage services module 132 may determine whether the DPUs 117 and storage devices 127 are located in different failure domains or fault domains to reduce the likelihood that more than one DPU and storage devices 127 will be lost or unavailable at the same time.
  • storage services module 132 may determine the usage of DPUs 117 and storage devices 127 within storage nodes 120 and allocate the DPUs 117 and storage devices 127 for the new volume in a manner that load balances the usage of DPUs 117 and storage devices 127 within storage nodes 120. For example, storage services module 132 may determine the lOPs usage of DPUs 117 and storage devices 127, apply a cost to each of DPUs 117 and storage devices 127 based on their lOPs usage, and allocate the DPUs 117 and storage devices for the new storage volume based on the cost of IOP usage (e.g., DPUs 117 and storage devices 127 with the lowest cost can be allocated). Other criteria can be utilized.
  • storage services module 132 may determine the cost of each storage device 127 based on their storage capacity usage, and DPUs 117 and storage devices 127 can be allocated based on the determined cost (e.g., DPUs 117 and storage devices 127 with the lowest cost can be allocated).
  • storage services module 132 may designate storage node 120 A as a “primary” target node that serves as a primary target or interaction node for operations involving the volume, with one or more of storage nodes 120 A, 120B, and 120D (the storage nodes that are included within the volume) serving as plex nodes that are used to store data associated with the volume.
  • Plex nodes may be used to store the data associated with a volume and may be managed by the primary target node.
  • a “plex” may represent a unit of data (e.g., located on an individual drive) that is a member of a particular volume (e.g., erasure coded volume).
  • volume 121 J may include one or more plex nodes local and/or remote to a storage node (e.g., storage nodes 120A, 120B, and 120D).
  • a storage node 120 may have plex nodes for a plurality of volumes.
  • storage node 120B may have one or more plex nodes for volume 121 J and one or more plex nodes for volume 121K.
  • volume 121 J may provide journaling to provide data reliability in which an intent log (i.e., journal) including data and meta-data of the primary target node (e.g., storage node 120A) is replicated to the secondary target node (e.g., storage node 120B) such that any write that is acknowledged to the host server for the application (e.g., servers 112) can be reliably performed to the underlying storage media in response to failure to the primary target node.
  • an intent log i.e., journal
  • the primary target node e.g., storage node 120A
  • the secondary target node e.g., storage node 120B
  • storage services module 132 ensures that the designated primary target node (e.g., storage node 120A) and the secondary target node (e.g., storage node 120B) are assigned to different storage nodes 120 or fault domains.
  • Plex nodes can also be stored across different storage nodes 120 or fault domains.
  • the same storage node 120 may be used for both a plex node and the primary target (or, alternatively, as a plex node and the secondary target node).
  • a volume graph of the functional elements in the data plane of a volume (e.g., volume 121 J) is used to manage the volume within a storage cluster (e.g., storage cluster 102).
  • the volume graph can be generated in various ways.
  • controller 130 may include a graph generation module 131 to generate a volume graph for target nodes that represents storage functions and/or functions offloaded from servers 112 and the allocated resources for the functions, such as storage nodes 120 and/or storage devices 127.
  • volume generation module 131 may generate volume graph 135A for a primary target node for volume 121 J (e.g., storage node 120A), where volume graph 135 A may represent the functional elements in the data plane of volume 121 J.
  • a volume graph 135 can be configured in various graph structures.
  • volume graph 135 may represent a tree structure including function nodes representing the functions associated with a volume with leaf nodes representing resources allocated for the functions.
  • volume graph 135 A may include a root node representing host servers associated with volume 121 J, an intermediate node (e.g., function node) that represents a data durability operation implemented by volume 121 J with one or more leaf nodes that represent the resources allocated for the data durability operation.
  • Storage nodes 120 may be allocated to a volume for various functions.
  • volume graph 135 A may include the storage nodes 120 allocated for a data durability scheme of volume 121 J, such as an erasure coding scheme.
  • an erasure coding block size of volume 121 J may be represented as m + //, where the variable m is the original amount of data and the variable n is the extra or redundant data added to provide protection from failures.
  • storage services module 132 may allocate DPUs 117 and storage devices 127 within storage nodes 120A, 120B, and 120D in accordance with the erasure coding scheme (e.g., m + ri).
  • graph generation module 131 may generate volume graph 135 A that represents the DPUs 117 and storage devices 127 within storage nodes 120A, 120B, and 120D that are allocated for the erasure coding scheme for volume 121 J.
  • volume graph 135A may include a function node in volume graph 135 A that represents an erasure coded volume for volume 121 J with leaf nodes representing the allocated resources for the erasure coded volume.
  • a volume graph can include one or more root nodes, each representing a host server executing an application that initiates read and write operations to a volume.
  • volume graph 135 A can include a root node representing DPU 118A of initiator node 110A that may initiate read and write requests for an application executing on server 112A that correspond to volume 121 J.
  • a volume graph 135 may be used for management of a volume.
  • volume graph 135 A may be used to allocate resources for volume 121 J, modify allocated resources for volume 121 J, manage events associated with resources allocated for volume 121 J, dynamically rebalance resources allocated for volume 121 J, and/or manage volume property modification for volume 121 J.
  • a volume graph 135 may be used to modify a data durability scheme of a volume. For instance, volume 121 J may originally implement erasure coding.
  • graph generation module 131 may modify volume 121 J by replacing the function node representing erasure coded volume and its leaf nodes with a function node that represents a replication volume with leaf nodes representing the allocated resources for the replication volume.
  • volume graph 135 may be used to dynamically rebalance resources for a volume 121.
  • storage devices and/or nodes may obtain one or more metrics including storage usage, lOPs usage, health of the storage devices, bandwidth of the nodes, etc.
  • Storage devices and/or nodes may compare the metrics with a threshold and generate alerts if the metrics reach a certain threshold(s).
  • volume graph 135 may be used to modify volume 121 when one or more parameters used to allocate the volume 121 are changed.
  • Volume 121 can be allocated with a specified set of one or more parameters (e.g., block size, encryption keys, compression scheme, volume size, data protection scheme, etc.).
  • a volume can be allocated with a specified data protection scheme, such as erasure coding replication, none, etc.
  • the one or more parameters may be changed after creation of the volume 121.
  • new parameters are validated, and a clone of the volume is created with the new parameters.
  • Storage node 120 may then attach to the clone of the volume created with the new parameters.
  • Modules illustrated in FIG. IB may perform operations described using software, hardware, firmware, or a mixture of hardware, software, and firmware residing in and/or executing at one or more computing devices.
  • a computing device may execute one or more of such modules with multiple processors or multiple devices.
  • a computing device may execute one or more of such modules as a virtual machine executing on underlying hardware.
  • One or more of such modules may execute as one or more services of an operating system or computing platform.
  • One or more of such modules may execute as one or more executable programs at an application layer of a computing platform.
  • functionality provided by a module could be implemented by a dedicated hardware device.
  • each module, data store, component, program, executable, data item, functional unit, or other item illustrated within a storage device may be implemented in various ways.
  • each module, data store, component, program, executable, data item, functional unit, or other item illustrated within a storage device may be implemented as a downloadable or pre-installed application or “app.”
  • each module, data store, component, program, executable, data item, functional unit, or other item illustrated within a storage device may be implemented as part of an operating system executed on a computing device.
  • FIG. 2A is a block diagram illustrating a system 201 having a data processing unit (DPU) 210 configured to support graph-based storage management, in accordance with the techniques described in this disclosure.
  • system 201 also includes CPU 240 communicatively coupled to DPU 210.
  • DPU 210 and CPU 240 generally represents a hardware chip implemented in digital logic circuitry.
  • DPU 210 may operate substantially similar to any of the nodes 17 of FIG. 1A and DPUs 117 of FIG. IB.
  • DPU 210 can be implemented as a highly programmable I/O processor with a plurality of processing cores (as discussed below, e.g., with respect to FIG. 2B).
  • DPU 210 includes a network interface (e.g., an Ethernet interface) to connect directly to a network, and a plurality of host interfaces (e.g., PCI-e interfaces) to connect directly to one or more application processors (e.g., CPU 240) and one or more storage devices (e.g., SSDs).
  • DPU 210 also includes a data plane operating system (OS) 212 executing on two or more of the plurality of processing cores.
  • OS data plane operating system
  • Data plane OS 212 provides data plane 214 as an execution environment for a run-to-completion software function invoked on data plane OS 212 to process a work unit.
  • a work unit is associated with one or more stream data units (e.g., packets of a packet flow) and specifies the software function for processing the stream data units and at least one of the plurality of processing cores for executing the software function.
  • the software function invoked to process the work unit may be one of a plurality of software functions for processing stream data.
  • the software functions can be included in a library 220 provided by data plane OS 212.
  • library 220 includes network functions 222, storage functions 224, security functions 226, and analytics functions 228.
  • Network functions 222 may, for example, include network I/O data processing functions related to Ethernet, network overlays, networking protocols, encryption, and firewalls.
  • Storage functions 224 may, for example, include storage VO data processing functions related to NVME (non-volatile memory express), compression, encryption, replication, erasure coding, and pooling.
  • Security functions 226 may, for example, include security data processing functions related to encryption, regular expression processing, and hash processing.
  • Analytics functions 228 may, for example, include analytical data processing functions related to a customizable pipeline of data transformations.
  • data plane OS 212 can be implemented as a low level, run-to- completion operating system running on bare metal of DPU 212 that runs hardware threads for data processing and manages work units.
  • data plane OS 212 can include the logic of a queue manager to manage work unit interfaces, enqueue and dequeue work units from queues, and invoke a software function specified by a work unit on a processing core specified by the work unit.
  • data plane OS 212 is configured to dequeue a work unit from a queue, process the work unit on the processing core, and return the results of processing the work unit to the queues.
  • DPU 210 also includes a multi-tasking control plane operating system 232 executing on one or more of the plurality of processing cores.
  • the multitasking control plane operating system 232 may comprise Linux, Unix, or a special-purpose operating system.
  • control plane OS 232 provides a control plane 216 including a control plane software stack executing on data plane OS 212.
  • the control plane software stack includes a hypervisor 230, the multi-tasking control plane OS 232 executing on hypervisor 230, and one or more control plane service agents 234 executing on control plane OS 232.
  • Hypervisor 230 may operate to isolate control plane OS 232 from the work unit and data processing performed on data plane OS 212.
  • Control plane service agents 234 executing on control plane OS 232 comprise application-level software configured to perform set up and tear down of software structures to support work unit processing performed by the software function executing on data plane OS 212.
  • control plane service agents 234 are configured to set up the packet flow for data packet processing by the software function on data plane OS 212 and tear down the packet flow once the packet processing is complete.
  • DPU 210 comprises a highly programmable processor that can run applicationlevel processing while leveraging the underlying work unit data structure for parallelized stream processing.
  • the multi-tasking control plane OS may run on one or more independent processing cores that are dedicated to the control plane OS, where the one or more independent processing cores are different than the processing cores executing data plane OS 212.
  • a hypervisor may not be included in the control plane software stack.
  • the control plane software stack running on the independent processing core may include the multi-tasking control plane OS and one or more control plane service agents executing on the control plane OS.
  • CPU 240 is an application processor with one or more processing cores for computing-intensive tasks.
  • CPU 240 includes a plurality of host interfaces (e.g., PCI-e interfaces) to connect directly to DPU 210.
  • CPU 240 includes a hypervisor / OS 242 that supports one or more service agents 246 and one or more drivers 247.
  • CPU 240 may also include a virtual machine (VM) OS 244 executing on top of hypervisor / OS 242 that supports one or more drivers 248.
  • VM virtual machine
  • Application-level software such as agents 246 or drivers 247 executing on OS 242 or drivers 248 executing on VM OS 244 of CPU 240, may determine which data processing tasks to offload from CPU 240 to DPU 210. For example, hypervisor / OS 242 of CPU 240 may offload data processing tasks to DPU 210 using physical functions (PFs) and/or virtual functions (VFs) of PCIe links. In some implementations, VM OS 244 of CPU 240 may offload data processing tasks to DPU 210 using VFs of PCIe links.
  • PFs physical functions
  • VFs virtual functions
  • system 201 also includes a controller 200 in communication with both DPU 210 and CPU 240 via a control application programming interface (API).
  • Controller 200 may provide a high-level controller for configuring and managing application-level software executing on a control plane OS of each of DPU 210 and CPU 240.
  • controller 200 may configure and manage which data processing tasks are to be offloaded from CPU 240 to DPU 210.
  • controller 200 may comprise a software-defined networking (SDN) controller, which may operate substantially similar to controller 130 of FIGS. 1A and IB.
  • SDN software-defined networking
  • controller 200 may operate in response to configuration input received from a network administrator via an orchestration API.
  • Data plane OS 212 of DPU 210 is configured to receive stream data units for processing on behalf of the application-level software executing on hypervisor / OS 242 of CPU 240.
  • the stream data units may comprise data packets of packet flows.
  • the received packet flows may include any of networking packet flows, storage packet flows, security packet flow, analytics packet flows, or any combination thereof.
  • Data plane OS 212 executing on one of the processing cores of DPU 210 may receive each of the packet flows in the form of one or more work units from a networking unit, host unit, or another one of the processing cores (as discussed below, e.g., with respect to FIG. 2B) of DPU 210.
  • Each of the work units for a received packet flow may be associated with one or more data packets of the packet flow.
  • data plane OS 212 can perform a lookup in a flow table to determine that the packet flow is legitimate and can map the packet flow to one of the processing cores of DPU 210 for serialized processing of the packets of the packet flow.
  • the flow table may comprise a hardware implemented flow table that is updated and maintained with legitimate packet flows by control plane 216 and used to assign processing cores to packet flows.
  • data plane OS 212 may send the packet flow through the slow path in control plane 216 for set up.
  • Control plane service agents 224 executing on control plane OS 232 can then determine whether the packet flow is legitimate and send an instruction to data plane OS 212 to set up the packet flow in the flow table.
  • data plane OS 212 may assign the packet flow to a particular processing core of DPU 210 that can do stream processing for the packet flow.
  • data plane OS 212 may execute a queue manager configured to receive a work unit associated with one or more data packets of the packet flow, enqueue the work unit to a work unit queue associated with the processing core for the packet flow, dequeue the work unit from the work unit queues to the processing core, and invoke the software function specified by the work unit on the processing core for processing the work unit.
  • Data plane OS 212 also provides interfaces to one or more hardware accelerators of DPU 212 (as discussed below, e.g., with respect to FIG. 2B) configured to perform acceleration for various data processing functions.
  • Data plane OS 212 may use the hardware accelerators to process one or more portions of the packet flow, i.e., one or more work units, arranged as a work unit (WU) stack.
  • WU work unit
  • a work unit can include an identifier of a subsequent work unit within the WU stack for further processing of the packets upon completion of the work unit.
  • a hardware accelerator can be configured to perform one or more hardware commands included in the WU stack as input parameters of the first work unit and, upon completion of the one or more hardware commands, proceed to the subsequent work unit within the WU stack identified by the current work unit.
  • the DPU utilizes fine-grain work units, work unit queues, and a queue manager executed on the data plane operating system of each processing core to serialize packet processing such that data packets of a same packet flow are processed by a same processing core.
  • the DPU is capable of processing any type of packet flow with fine granularity between processing cores and low processing overhead.
  • other multi-core systems may communicate using shared memory and locking to provide coherency in memory.
  • the locking schemes may be an order(s) of magnitude larger grain than the work unit scheme described herein.
  • the processing overhead associated with the work unit scheme can be less than 100 clock cycles in some implementations.
  • controller 200 may include a graph generation module 280 for generating a graph-based representation of the functional elements in data plane 214 of a volume (e.g., volume 121 J of FIG. IB).
  • Graph generation module 280 may operate substantially similar to graph generation module 131 of FIGS. 1 A and IB.
  • graph generation module 280 may generate a volume graph including function nodes that represent one or more functions, such as storage functions 224, and leaf nodes that represent resources allocated for said function(s).
  • Control plane 216 of DPU 210 may include graph module 284 configured to convey information based on a volume graph to each DPU to manage the storage cluster and to use the volume graph to manage the volume.
  • the volume graph may be used to change the functionality of the volume by replacing a function node in the volume graph with another function node to achieve a different scheme of data protection, data availability, data compression, and/or any other function.
  • the volume graph enables the implementation of failover for the volume.
  • DPU 210 may operate as a secondary storage node for the volume.
  • graph generation module 280 may generate the volume graph for DPU 210 to assume the role as the primary storage node for the volume.
  • graph module 284 may use the volume graph for resource allocation.
  • graph module 284 may apply one or more rules to the leaf nodes of the volume graph to allocate resources to the volume.
  • the rules may include resource availability (e.g., IOPS availability, storage capacity availability), data protection (e.g., resources in different fault domains), and/or load balancing rules (e.g., based on IOP usage, storage usage, etc.).
  • graph module 284 may use the volume graph for event management. Events may include a storage node going down, a storage device being pulled out, the deletion of a volume, and/or any other event associated with the allocated volume represented by the volume graph. For example, graph module 284 may determine that an event generated by a leaf node of the graph (e.g., storage device) is propagated up the layers of the volume stack of the volume graph to a function node that represents a data durability operation of the volume. In response, graph module 284 may instruct DPU 210 to send a message to controller 200 to inform controller 200 of the event. In response to receiving the message, graph generation module 280 of the controller 200 may rebuild the volume graph by replacing the one or more leaf nodes of the function node representing the data durability operation with another storage device.
  • a leaf node of the graph e.g., storage device
  • graph module 284 may instruct DPU 210 to send a message to controller 200 to inform controller 200 of the event.
  • graph generation module 280 of the controller 200 may rebuild the volume graph
  • graph module 284 may use the volume graph to modify the volume based on changes in volume parameters.
  • a volume can be created with a specified set of one or more parameters (e.g., block size, encryption keys, compression scheme, volume size, data protection scheme, etc.).
  • a volume can be created with a specified data protection scheme, such as erasure coding replication, none, etc.
  • volume parameters may be changed after creation of the volume.
  • graph module 284 may use the volume graph to generate a clone of the volume with the modified parameters and to switch the host connection to the clone of the volume.
  • graph module 284 may use the volume graph to dynamically rebalance resources for the volume. For example, graph module 284 may rebalance resources for the volume based on alerts generated by leaf nodes of the volume graph. For example, the storage usage of a storage device represented by a leaf node in the volume graph may exceed a storage usage threshold and may generate an alert that is propagated up the layers of the volume stack of the volume graph to a function node that represents a data durability operation of the volume and, in response, graph module 284 may move the load from the storage device to a new storage device.
  • DPU 210 represents a high performance, hyperconverged network, storage, and data processor and input/output hub.
  • networking unit 252 may be configured to send and receive stream data units with one or more external devices, e.g., network devices.
  • Networking unit 252 may perform network interface card functionality, packet switching, and the like.
  • Networking unit 252 may use large forwarding tables and offer programmability.
  • Networking unit 252 may expose network interface (e.g., Ethernet) ports for connectivity to a network, such as network 7 and/or switch fabric 114 of FIG. 1A.
  • network interface e.g., Ethernet
  • Each of host units 256 may expose one or more host interface (e.g., PCI-e) ports to send and receive stream data units with application processors (e.g., an x86 processor of a server device or a local CPU or GPU of the device hosting DPU 210) and/or data storage devices (e.g., SSDs).
  • DPU 210 may further include one or more high bandwidth interfaces for connectivity to off-chip external memory (not illustrated in FIG. 2B).
  • Cores 250 may comprise one or more of MIPS (microprocessor without interlocked pipeline stages) cores, ARM (advanced RISC (reduced instruction set computing) machine) cores, PowerPC (performance optimization with enhanced RISC - performance computing) cores, RISC-V (RISC five) cores, or complex instruction set computing (CISC or x86) cores.
  • MIPS microprocessor without interlocked pipeline stages
  • ARM advanced RISC (reduced instruction set computing) machine
  • PowerPC performance optimization with enhanced RISC - performance computing
  • RISC-V RISC five
  • CISC or x86 complex instruction set computing
  • Each of cores 250 may be programmed to process one or more events or activities related to a given packet flow such as, for example, a networking packet flow, a storage packet flow, a security packet flow, or an analytics packet flow.
  • Each of cores 250 may be programmable using a high-level programming language, e.g., C, C++, or the like
  • receiving a work unit can be signaled by receiving a message in a work unit receive queue (e.g., one of WU queues 262).
  • a work unit receive queue e.g., one of WU queues 262.
  • Each of WU queues 262 is associated with at least one of cores 250 and is addressable in the header of the work unit message.
  • queue manager 260 Upon receipt of the work unit message from networking unit 252, one of host units 256, or another one of cores 250, queue manager 260 enqueues a work unit in the one of WU queues 262 associated with the one of cores 250 specified by the work unit.
  • queue manager 260 After queue manager 260 dequeues the work unit from a WU queue 262, queue manager 260 delivers the work unit to the associated core 250. Queue manager 260 then invokes the software function specified by the work unit on the associated core 250 for processing the work unit.
  • the allocated volume can configure user volume 362 to strip/concatenate the user volume to multiple logical volumes, thereby providing a 1 :N mapping of user volume to logical volumes. This may, for example, facilitate scaling of user volumes across multiple DPUs as well as scale recovery when a storage device fails.
  • FIG. 3B More information about each of the layers of abstraction (from application to device) is set forth in FIG. 3B and as described in U.S. Patent No. 10,949,303, entitled “DURABLE BLOCK STORAGE IN DATA CENTER ACCESS NODES WITH INLINE ERASURE CODING,” filed 10 December 2018, which claims the benefit of U.S. Provisional Patent Application No. 62/597,185, filed 11 December 2017.
  • the user volume may, in some implementations, map to 1:N logical volumes.
  • a given node may play one or all roles for a given volume.
  • One type of role that a node may implement is an attachment node 364.
  • the attachment node 364 is the node where an application running as a virtual machine or container on a server attaches to the volume.
  • the attachment node 364 may be the node where the PCIe link to the server is attached or where the NVMEoF (non-volatile memory express over fabrics) connection is terminated for the volume.
  • the user volume 362 function runs on the attachment node.
  • the storage node 368 is the node to which a storage device is attached.
  • the volume may include a plurality of storage nodes 368.
  • SSDs 350 can be partitioned into extents 352 (e.g., 1 GB) and accessed via the storage node 368 that is attached to the SSDs 350 via PCIe.
  • Extents 252 are provisioned into a raw volume 354 that is remotely accessible by other nodes interconnected in a cluster. In the illustrated example of FIG. 3 A, raw volume 354 and extent 352 functions run on the storage nodes 368.
  • a volume can be configured with the following steps via a management plane and control plane.
  • Each configuration step includes a communication from the management plane to one or more nodes instructing the node(s) about their role(s) relative to the volume being created.
  • Volumes can each have a globally unique identifier that is used in the communication so that each node can identify the correct volume.
  • the management plane may use a variety of methods to determine which nodes to select to play the different roles for the given volume. In general, the management plane may select nodes that are outside of a same fault zone within a cluster so that multiple nodes used to support the volume are not likely to fail together. An example method for configuring a volume is described below.
  • the management plane receives a top-level specification from a management console (e.g., Openstack Cinder) that defines volume parameters including block size, volume size (number of blocks) (otherwise referred to as “capacity”), quality of service (QoS), encryption, compression, fault domains, and durability scheme (e.g., replication factor or erasure coding scheme).
  • a management console e.g., Openstack Cinder
  • volume parameters including block size, volume size (number of blocks) (otherwise referred to as “capacity”), quality of service (QoS), encryption, compression, fault domains, and durability scheme (e.g., replication factor or erasure coding scheme).
  • the management plane creates raw volumes 354 on each storage node.
  • Raw volumes 354 can be created by assigning extents 352 from available SSDs 350.
  • Extents 352 may be statically sized (e.g., 1 GB) during deployment. This step may be done statically or dynamically (e.g., thin provisioning) as the storage space is accessed by the storage node.
  • the management plane creates raw volume sets 356 on each controller node.
  • the number of raw volumes per raw volume set 356 may depend on the durability scheme specified in the top-level specification for the volume (e.g., X for replication factor and m + n for erasure coding).
  • the number of raw volume sets 356 may depend on the size of the raw volumes 354 and the size specified in the top-level specification for the volume.
  • the management plane creates durable volume 356 on each controller node.
  • Parameters for durable volume 356 can include durability scheme (e.g., replication or erasure coding) and/or volume size (including additional space to allow for log compaction).
  • the management plane creates user volume 362 on each attachment node.
  • User volume 362 can receive the read and write requests for data blocks from an application running on an attached server. The read and write requests can be passed to log structured volume 360 for processing. Parameters for user volume 362 can include block size, encryption keys, compression scheme, and volume size.
  • the volume may rely on a distribution protocol to exchange data between the associated nodes. For example, NVMEoF may be used as the base protocol.
  • the network binding may be based on transmission control protocol (TCP) or some form of reliable datagram. In some examples, the network binding may be TCP with fabric control protocol (FCP) based congestion control.
  • TCP transmission control protocol
  • FCP fabric control protocol
  • Various objects may have a universally unique identifier (UUTD) that allows them to be addressable across the network via the distribution protocol.
  • UUTD universally unique identifier
  • log structured logical volume 360 may be accessed using an UUTD from the attachment node or directly via an NVMEoF client.
  • raw volumes 354 may receive VO requests from raw volume sets 356 for replicas or erasure coded pieces storage in raw volumes 354 identified by UUIDs.
  • authentication may be included as a part of NVMEoF so that a “bad actor” (non-authorized party) on the network cannot access these remotely addressable entities.
  • the volume designs described in this disclosure may support a scale-down model all the way down to a single node.
  • the raw volumes 354 can be allocated from the local node resulting in a device that is similar to a RAID (redundant array of independent disks) or an embedded erasure code implementation that is still tolerant of SSD failures.
  • FIG. 4 illustrates an example graph-based representation of functional elements in a data plane of a volume, in accordance with the techniques described in this disclosure.
  • Volume graph 400 of FIG. 4 is described with respect to an example implementation of a volume, such as volume 121 J as described above with respect to FIG. IB.
  • Volume graph 400 may be used for management of the volume, such as resource allocation, event management, and recovery at scale.
  • Volume graph 400 may span multiple nodes enabling scale-out capacity, redundancy, and performance.
  • volume graph 400 provides a graphical representation of the layers of abstraction of the volume.
  • volume graph 400 may include an allocation layer 402, durability layer 404, schema layer 406, and aggregation layer 408 of the volume.
  • Allocation layer 402 of volume graph 400 may graphically represent the resources allocated for one or more durable volumes, e.g., erasure coded volume 416 and journal volume 418.
  • Journal volume 418 provides crash resilience for the volume.
  • log structured logical volume 422 may use an intent log (i.e., journal) stored in NVM 420 of a DPU so that any write that is acknowledged to the host server for the application can be reliably performed to the underlying storage media in the presence of component failures.
  • Volume graph 400 may graphically represent the nodes allocated for journal volume 418.
  • volume graph 400 may include a leaf node of journal volume 418 that represents a node operating as a primary storage node (e.g., dpu_0) including a first copy of the journal stored in NVM 420A.
  • Volume graph 400 may include another leaf node of journal volume 418 that represents a node operating as a secondary storage node (e.g., dpu l) including a second copy of the journal in NVM 420B.
  • the primary storage node e.g., dpu_0
  • the secondary storage node may represent DPU 117B of FIG. IB having a copy of the journal of volume 121 J stored in NVM of storage node 120B.
  • volume graph 400 may include one or more nodes for erasure coded volume 416 that represent remote datagram sockets (RDS), e.g., RDS 414, that each provides remote access to raw storage volumes.
  • RDS remote datagram sockets
  • erasure coded volume 416 is illustrated as including RDS 414, in some examples, erasure coded volume 416 may include RDS 414 and BLT 412, only BLT 412, or a combination of the two.
  • journal volume 418 may include one or more leaf nodes that represent RDS that provide remote access to the copies of the journal stored in NVM 420.
  • Durability layer 404 of volume graph 400 may graphically represent one or more durable volumes of a volume, such as, in the example of FIG. 4, erasure coded volume 416 and journal volume 418. Although volume graph 400 is illustrated with erasure coded volume 416 and journal volume 418, volume graph 400 may additionally or alternatively include other durable volumes, such as a replication volume in which data blocks are replicated a number of times based on a replication factor and distributed across the storage devices to provide high data availability to protect against device and/or node failures.
  • volume graph 400 includes a function node that represents erasure coded volume 416 and is connected to the leaf nodes that represent the raw volumes created from storage devices 410 (or to the RDS that provides remote access to the raw volumes). Volume graph 400 also includes a function node that represents journal volume 418 that is connected to nodes that represent NVMs 420 including copies of the journal (or to the RDS that provides remote access to a copy of the journal).
  • Schema layer 406 of volume graph 400 may graphically represent a log structured logical volume 422 created on each node operating as a controller node. As described above, log structured logical volume 422 can be used to gather multiple data blocks into larger chunks for inline erasure coding by erasure coded volume 416 prior to storage across multiple storage nodes. Volume graph 400 includes a function node that represents log structured logical volume 422 and is connected to nodes that represent the durable volumes, e.g., erasure coded volume 416 and journal volume 418. Although volume graph 400 is illustrated with log structured logical volume 422, volume graph 400 may include other volumes, such as a direct mapping volume or the like.
  • Aggregation layer 408 of volume graph 400 may graphically represent a partitioned volume group 424.
  • Partitioned volume group 424 may group a plurality of log structured logical volumes (not shown) to create a storage volume or split a log structured logical volume into a plurality of storage volumes.
  • Volume graph 400 includes a function node that represents partitioned volume group 424 that is connected to a node that represents the log structured logical volume 422.
  • the volume create intent APIs may include a data protection parameter that specifies the parameters for data protection techniques to be implemented for the volume, such as a number of storage nodes going down that can be tolerated (Num redundant dpus), a number of data disks for an erasure coded volume (Num data disks), a number of media failures that can be tolerated (Num failed disks), or the like.
  • the volume create intent APIs may include a compression parameter that specifies whether compression is enabled for the volume.
  • the volume create intent APIs may, in some examples, include an encryption parameter that specifies whether encryption is enabled for the volume.
  • the volume create intent APIs may include a capacity parameter that specifies the size of the volumes.
  • volume graph 400 enables the implementation of failover for the volume.
  • the control plane may, in response to determining that a primary storage node (e.g., dpu_0) has failed, reconfigure volume graph 400 to replace a connection to the failed primary storage node with a connection to a secondary storage node (e.g., dpu l) to assume the role as a primary storage node for the volume.
  • a primary storage node e.g., dpu_0
  • a secondary storage node e.g., dpu l
  • control plane may use volume graph 400 for resource allocation.
  • control plane may apply one or more rules to the leaf nodes of volume graph 400 to allocate resources to the volume.
  • the rules may include a resource availability, data protection, and/or load balancing rules.
  • the control plane may apply a resource availability rule to allocate resources based on one or more rules.
  • the rules can be based on one or more metrics, such as input/output operations per second (lOPs) availability (e.g., how fast the system can read and write commands in a second).
  • lOPs input/output operations per second
  • the control plane may determine whether a given storage device or given storage node has enough lOPs available for erasure coded volume 416, such as by comparing the IOP of the given storage device or given storage node to an IOP availability threshold.
  • the control plane may apply a resource availability rule to allocate resources based on one or more rules, such as storage capacity availability. For example, to allocate storage devices for erasure coded volume 416, the control plane may determine whether a given storage device has enough storage capacity available for erasure coded volume 416, such as by comparing the storage capacity availability of the given storage device to a storage capacity availability threshold. The control plane may select one or more storage devices determined to have enough storage capacity available (e.g., satisfies the storage capacity availability threshold) to be allocated for erasure coded volume 416. Volume graph 400 can be configured such that leaf node(s) of erasure coded volume 416 represents the allocated storage device(s) determined to have storage capacity available.
  • the control plane may apply a data protection rule to allocate storage nodes that are in different fault zones within a cluster so that multiple storage nodes used to support the volume are not likely to fail together. For example, to allocate storage nodes for journal volume 418, the control plane may determine the fault zones (e.g., power zones or chassis) to which a storage node belongs. In response to determining that certain storage nodes belong to different fault zones, the control plane may select one storage node as a primary storage node and select another storage node in a different fault zone as a secondary storage node. In the illustrated example of FIG. 4, dpu_0 and dpu l may be determined to belong to different fault zones.
  • the fault zones e.g., power zones or chassis
  • the control plane may select the dpu_0 as a primary storage node for journal volume 418 and select dpu l as a secondary storage node for journal volume 418.
  • Volume graph 400 can be configured such that leaf nodes of journal volume 418 represent the selected primary storage node and the secondary storage node.
  • the control plane may determine the fault zones to which a storage node belongs.
  • the control plane may apply a load balancing rule to allocate resources based on one or more rules.
  • the rules can be based on one or more metrics, such as lOPs usage and/or storage usage.
  • the control plane may determine the lOPs usage of a given storage device and/or given storage node.
  • the control plane may add a cost value based on the IOP usage of the given storage device and/or given storage node.
  • the control plane may select storage devices and/or storage nodes with the lowest cost of lOPs usage for erasure coded volume 416.
  • the control plane may use volume graph 400 for event management and to rebuild volumes based on an event.
  • Events may include a storage node going down, a storage device being pulled out, the deletion of a volume, and/or any other event associated with the allocated volume represented by volume graph 400.
  • a network device including storage device 410A may detect a failure of storage device 410A and generate an event indicating storage device 410A has failed.
  • the event can be propagated up the layers of the volume stack of volume graph 400, such as to BLT 412, which in turn propagates the event to erasure coded volume 416.
  • a network device including a storage node may detect a failure to the storage node and generate an event indicating the storage node has failed.
  • the event can be propagated up the layers of the volume stack of volume graph 400 to journal volume 418, which in turn propagates the event to log structured logical volume 422.
  • the control plane may instruct the DPU to send a message to the controller (e.g., controller 200 of FIG. 2A) to inform controller 200 of the event.
  • volume graph 400 may use volume graph 400 to modify the volume based on changes in volume parameters.
  • a volume can be allocated with a specified set of one or more parameters (e.g., block size, encryption keys, compression scheme, volume size, data protection scheme, etc.).
  • a volume can be allocated with a specified data protection scheme, such as erasure coding replication, none, etc.
  • volume parameters may be changed after creation of the volume.
  • new parameters are validated, and a clone of the volume can be created with the modified parameters.
  • a clone of a volume is an independent volume but relies on the source volume for its reads until the clone of the volume is fully hydrated.
  • the control plane may use volume graph 400 to rebalance resources when storage devices and/or storage nodes are added or removed from the system. For example, the control plane may determine, in response to the addition of a storage device, the storage usage of the storage devices. For instance, the control plane may determine whether there are storage devices that have a low usage (e.g., less than 20% storage usage), a medium usage (e.g., greater than 50% storage usage), or a high usage (e.g., greater than 80% storage usage). The data of the storage devices with the highest usage may be relocated.
  • a low usage e.g., less than 20% storage usage
  • a medium usage e.g., greater than 50% storage usage
  • a high usage e.g., greater than 80% storage usage
  • the control plane may arrange the usage of the extents in a list (e.g., [400, 400, 400, 400, 10, 10, 10, 10, 10, 10, 10, 10, 10]) and computes a ratio of the physical usage of each of the extents (rl) (e.g., [1.0, 0.5, 0, 0, 1.0, 1.0, 1.0, 0.5, 0. 0. 0. 0. 0. 0]).
  • the extent with the highest value (e.g., 0.44) may be selected for relocation.
  • control plane may use volume graph 400 to rebalance volumes when failed storage nodes come back online. For example, when a storage node operating as a primary storage node (e.g., dpu_0) fails or is degraded (or is otherwise unavailable), the durable volumes on the storage node can be moved to a secondary storage node (e.g., dpu l).
  • a storage node operating as a primary storage node e.g., dpu_0
  • the durable volumes on the storage node can be moved to a secondary storage node (e.g., dpu l).
  • volume graph 400 may be used to rebalance the durable volumes back to the original primary storage node, such as by resynchronizing NVM 420A with a copy of the journal, configuring the old primary storage node (e.g., dpu_0) as a new secondary storage node, resynchronizing the leaf nodes of the new backup controller node, changing the log structured logical volume 422 to an online state, unmapping the log structured logical volume 422 from the current primary storage node (e.g., dpu l), and mapping the log structured logical volume 422 to the new secondary controller node.
  • the partitioned volume group 424, user groups 426, and/or snapshots can then be mapped to the log structured logical volume 422.
  • the control plane may use volume graph 400 to monitor volumes that are in the process of being deleted, and a storage service or operating system executed on a storage node (e.g., as micro- service) is restarted.
  • the control plane may determine whether there are any volume(s) that were marked in a database, including the state of a volume as in the process of being deleted, and whether the volume has been deleted in the operating system. If the volume is deleted in the operating system, the control plane may remove the volume from the database. If the volume has not been deleted in the operating system, the control plane may revert back the deletion process (e.g., revert back the deletion in progress flag in the database).
  • FIG. 8 is a flow diagram of an example method for graph-based storage management, in accordance with the techniques described in this disclosure.
  • the method 800 includes allocating a volume of storage within a storage cluster.
  • a volume represents a logical storage device and provides a level of abstraction from physical storage.
  • the volume can be allocated by a storage cluster having a plurality of storage nodes. Resources for the volume can be allocated based on various rules.
  • one or more resources can be allocated for the volume based on one or more metrics associated with the one or more resources.
  • Example metrics include input and output operations per second (lOPs) capacity, storage capacity, fault/failure domains, lOPs usage, or storage usage.
  • lOPs input and output operations per second
  • FIG. 1 Another aspect provides a computer-readable storage medium for graphbased storage management, the computer-readable storage medium comprising instructions that, when executed, cause one or more processors to: allocate a volume of storage within a storage cluster; generate a volume graph of the volume, wherein the volume graph represents one or more functional elements in a data plane of the volume; and manage the volume based on the volume graph.
  • the volume graph of the volume comprises a tree graph, the tree graph comprising: a function node that specifies a function implemented by one or more resources allocated for the volume; and one or more leaf nodes to the function node, wherein the one or more leaf nodes each specifies a resource of the one or more resources allocated for the function.
  • Computer- readable media may include computer-readable storage media, which corresponds to a tangible medium such as data storage media, or communication media including any medium that facilitates transfer of a computer program from one place to another (e.g., pursuant to a communication protocol).
  • computer-readable media generally may correspond to (1) tangible computer-readable storage media, which is non-transitory or (2) a communication medium such as a signal or carrier wave.
  • Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementation of the techniques described in this disclosure.
  • a computer program product may include a computer-readable medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Des techniques sont décrites dans lesquelles des nœuds de stockage à l'intérieur d'une grappe de stockage sont configurés pour prendre en charge une gestion de stockage basée sur un graphe. Par exemple, une grappe de stockage comprend un réseau et une pluralité de systèmes informatiques, chacun interconnecté sur le réseau, la pluralité de systèmes informatiques comprenant une pluralité de nœuds de stockage. Le système informatique de la pluralité de systèmes informatiques est configuré pour attribuer un volume de stockage à l'intérieur de la grappe de stockage, générer un graphe de volume du volume, le graphe de volume représentant un ou plusieurs éléments fonctionnels dans un plan de données du volume, et gérer le volume sur la base du graphe de volume.
PCT/US2023/068250 2022-06-14 2023-06-09 Gestion de stockage basée sur un graphe WO2023244948A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
IN202241033908 2022-06-14
IN202241033908 2022-06-14
US18/332,461 US20230409215A1 (en) 2022-06-14 2023-06-09 Graph-based storage management
US18/332,461 2023-06-09

Publications (1)

Publication Number Publication Date
WO2023244948A1 true WO2023244948A1 (fr) 2023-12-21

Family

ID=87280750

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2023/068250 WO2023244948A1 (fr) 2022-06-14 2023-06-09 Gestion de stockage basée sur un graphe

Country Status (1)

Country Link
WO (1) WO2023244948A1 (fr)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7363457B1 (en) * 2005-07-21 2008-04-22 Sun Microsystems, Inc. Method and system for providing virtualization data services for legacy storage devices
US7818517B1 (en) * 2004-03-26 2010-10-19 Emc Corporation Architecture for virtualization of networked storage resources
EP2821925A1 (fr) * 2012-08-09 2015-01-07 Huawei Technologies Co., Ltd Procédé et appareil de traitement de données distribuées
US20160306574A1 (en) * 2015-04-14 2016-10-20 E8 Storage Systems Ltd. Lockless distributed redundant storage and nvram caching of compressed data in a highly-distributed shared topology with direct memory access capable interconnect
US20160349993A1 (en) * 2015-05-29 2016-12-01 Cisco Technology, Inc. Data-driven ceph performance optimizations
US20180196817A1 (en) * 2017-01-06 2018-07-12 Oracle International Corporation Cloud gateway for zfs snapshot generation and storage
US10540288B2 (en) 2018-02-02 2020-01-21 Fungible, Inc. Efficient work unit processing in a multicore system
US10565112B2 (en) 2017-04-10 2020-02-18 Fungible, Inc. Relay consistent memory management in a multiple processor system
US10659254B2 (en) 2017-07-10 2020-05-19 Fungible, Inc. Access node integrated circuit for data centers which includes a networking unit, a plurality of host units, processing clusters, a data network fabric, and a control network fabric
US10841245B2 (en) 2017-11-21 2020-11-17 Fungible, Inc. Work unit stack data structures in multiple core processor system for stream data processing
US10949303B2 (en) 2017-12-11 2021-03-16 Fungible, Inc. Durable block storage in data center access nodes with inline erasure coding
US20210294775A1 (en) * 2020-03-18 2021-09-23 EMC IP Holding Company LLC Assignment of longevity ranking values of storage volume snapshots based on snapshot policies
US11178262B2 (en) 2017-09-29 2021-11-16 Fungible, Inc. Fabric control protocol for data center networks with packet spraying over multiple alternate data paths

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7818517B1 (en) * 2004-03-26 2010-10-19 Emc Corporation Architecture for virtualization of networked storage resources
US7363457B1 (en) * 2005-07-21 2008-04-22 Sun Microsystems, Inc. Method and system for providing virtualization data services for legacy storage devices
EP2821925A1 (fr) * 2012-08-09 2015-01-07 Huawei Technologies Co., Ltd Procédé et appareil de traitement de données distribuées
US20160306574A1 (en) * 2015-04-14 2016-10-20 E8 Storage Systems Ltd. Lockless distributed redundant storage and nvram caching of compressed data in a highly-distributed shared topology with direct memory access capable interconnect
US20160349993A1 (en) * 2015-05-29 2016-12-01 Cisco Technology, Inc. Data-driven ceph performance optimizations
US20180196817A1 (en) * 2017-01-06 2018-07-12 Oracle International Corporation Cloud gateway for zfs snapshot generation and storage
US10565112B2 (en) 2017-04-10 2020-02-18 Fungible, Inc. Relay consistent memory management in a multiple processor system
US10659254B2 (en) 2017-07-10 2020-05-19 Fungible, Inc. Access node integrated circuit for data centers which includes a networking unit, a plurality of host units, processing clusters, a data network fabric, and a control network fabric
US11178262B2 (en) 2017-09-29 2021-11-16 Fungible, Inc. Fabric control protocol for data center networks with packet spraying over multiple alternate data paths
US10841245B2 (en) 2017-11-21 2020-11-17 Fungible, Inc. Work unit stack data structures in multiple core processor system for stream data processing
US10949303B2 (en) 2017-12-11 2021-03-16 Fungible, Inc. Durable block storage in data center access nodes with inline erasure coding
US10540288B2 (en) 2018-02-02 2020-01-21 Fungible, Inc. Efficient work unit processing in a multicore system
US20210294775A1 (en) * 2020-03-18 2021-09-23 EMC IP Holding Company LLC Assignment of longevity ranking values of storage volume snapshots based on snapshot policies

Similar Documents

Publication Publication Date Title
US10949303B2 (en) Durable block storage in data center access nodes with inline erasure coding
US11941279B2 (en) Data path virtualization
US20220334725A1 (en) Edge Management Service
US20220019350A1 (en) Application replication among storage systems synchronously replicating a dataset
US10613779B1 (en) Determining membership among storage systems synchronously replicating a dataset
US11652884B2 (en) Customized hash algorithms
US11349917B2 (en) Replication handling among distinct networks
US9817721B1 (en) High availability management techniques for cluster resources
US20220091771A1 (en) Moving Data Between Tiers In A Multi-Tiered, Cloud-Based Storage System
US9542215B2 (en) Migrating virtual machines from a source physical support environment to a target physical support environment using master image and user delta collections
US20210232331A1 (en) System having modular accelerators
US11689610B2 (en) Load balancing reset packets
US11301162B2 (en) Balancing resiliency and performance by selective use of degraded writes and spare capacity in storage systems
WO2021113488A1 (fr) Création d'une réplique d'un système de stockage
WO2021195187A1 (fr) Gestion de mises en correspondance d'hôtes à des fins de réplication de points d'extrémité
US20220232075A1 (en) Distributed protocol endpoint services for data storage systems
US11573736B2 (en) Managing host connectivity to a data storage system
EP4139782A1 (fr) Fourniture d'une gestion de données en tant que service
US20230231912A1 (en) Mesh-aware storage systems
WO2022076856A1 (fr) Virtualisation de chemin de données
US20220317912A1 (en) Non-Disruptively Moving A Storage Fleet Control Plane
US11733874B2 (en) Managing replication journal in a distributed replication system
US11068192B1 (en) Utilizing mutiple snapshot sources for creating new copy of volume in a networked environment wherein additional snapshot sources are reserved with lower performance levels than a primary snapshot source
US20230409215A1 (en) Graph-based storage management
WO2023244948A1 (fr) Gestion de stockage basée sur un graphe

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23741546

Country of ref document: EP

Kind code of ref document: A1