US20190312925A1 - Time-based congestion discounting for i/o fairness control - Google Patents

Time-based congestion discounting for i/o fairness control Download PDF

Info

Publication number
US20190312925A1
US20190312925A1 US15/947,313 US201815947313A US2019312925A1 US 20190312925 A1 US20190312925 A1 US 20190312925A1 US 201815947313 A US201815947313 A US 201815947313A US 2019312925 A1 US2019312925 A1 US 2019312925A1
Authority
US
United States
Prior art keywords
storage
class
ratio
requests
congestion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
US15/947,313
Other versions
US10965739B2 (en
Inventor
Enning XIANG
Eric KNAUFT
Yiqi XU
Xiaochuan Shen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
VMware LLC
Original Assignee
VMware LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by VMware LLC filed Critical VMware LLC
Priority to US15/947,313 priority Critical patent/US10965739B2/en
Assigned to VMWARE, INC. reassignment VMWARE, INC. ASSIGNMENT OF ASSIGNORS INTEREST (SEE DOCUMENT FOR DETAILS). Assignors: KNAUFT, Eric, SHEN, XIAOCHUAN, XIANG, ENNING, Xu, Yiqi
Publication of US20190312925A1 publication Critical patent/US20190312925A1/en
Application granted granted Critical
Publication of US10965739B2 publication Critical patent/US10965739B2/en
Assigned to VMware LLC reassignment VMware LLC CHANGE OF NAME (SEE DOCUMENT FOR DETAILS). Assignors: VMWARE, INC.
Active legal-status Critical Current
Adjusted expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/101Server selection for load balancing based on network conditions
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0876Network utilisation, e.g. volume of load or congestion level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0604Improving or facilitating administration, e.g. storage management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0629Configuration or reconfiguration of storage systems
    • G06F3/0635Configuration or reconfiguration of storage systems by changing the path, e.g. traffic rerouting, path reconfiguration
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0653Monitoring storage devices or systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0662Virtualisation aspects
    • G06F3/0664Virtualisation aspects at device level, e.g. emulation of a storage device or system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0876Network utilisation, e.g. volume of load or congestion level
    • H04L43/0882Utilisation of link capacity
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/20Arrangements for monitoring or testing data switching networks the monitoring system or the monitored elements being virtualised, abstracted or software-defined entities, e.g. SDN or NFV
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing
    • H04L67/1008Server selection for load balancing based on parameters of servers, e.g. available memory or workload
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/131Protocols for games, networked simulations or virtual reality
    • H04L67/322
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources
    • H04L67/61Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources taking into account QoS or priority requirements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2206/00Indexing scheme related to dedicated interfaces for computers
    • G06F2206/10Indexing scheme related to storage interfaces for computers, indexing schema related to group G06F3/06
    • G06F2206/1012Load balancing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers

Definitions

  • a distributed storage system allows a cluster of host computers to aggregate local storage devices, which may be located in or attached to each host computer, to create a single and shared pool of storage. This pool of storage is accessible by all host computers in the cluster, including any virtualized instances running on the host computers, such as virtual machines. Because the shared local storage devices that make up the pool of storage may have different performance characteristics, such as capacity, input/output per second (IOPS) capabilities, etc.), usage of such shared local storage devices to store data may be distributed among the virtual machines based on the needs of each given virtual machine.
  • IOPS input/output per second
  • This approach provides enterprises with cost-effective performance. For instance, distributed storage using pooled local storage devices is inexpensive, highly scalable, and relatively simple to manage. Because such distributed storage can use commodity storage devices, e.g., disk drives, in the cluster, enterprises do not need to invest in additional storage infrastructure. However, one issue that arises with this approach relates to contention between multiple clients, such as virtual machines on different host computers, accessing the shared storage resources. In particular, reduced overall performance and higher latency occur when multiple clients and/or other software processes need to simultaneously access the same local storage devices.
  • FIG. 1 is a block diagram of a distributed storage system in accordance with an embodiment of the invention.
  • FIG. 2 is a block diagram of a virtual storage array network (VSAN) module in each host computer of the distributed storage system in accordance with an embodiment of the invention.
  • VSAN virtual storage array network
  • FIG. 3A illustrates a congestion signal for a non-resync I/O request being generated and transmitted to sources of storage I/O requests in accordance with an embodiment of the invention.
  • FIG. 3B illustrates a congestion signal for a resync I/O request being generated, adjusted and transmitted to the sources of storage I/O requests in accordance with an embodiment of the invention.
  • FIG. 4 is a process flow diagram of a storage request management operation executed in each host computer of the cluster in the distributed data storage system in accordance with an embodiment of the invention.
  • FIG. 5 is a process flow diagram of a storage request management operation executed in each host computer of the cluster in the distributed data storage system in accordance with an embodiment of the invention.
  • FIG. 6 is a flow diagram of a method of managing storage requests in a distributed storage system in accordance with an embodiment of the invention.
  • FIG. 1 illustrates a distributed storage system 100 in accordance with an embodiment of the invention.
  • the distributed storage system 100 provides a software-based “virtual storage area network” (VSAN) 102 that leverages local storage resources of host computers 104 , which are part of a logically defined cluster 106 of host computers that is managed by a cluster management server 108 .
  • VSAN virtual storage area network
  • the VSAN 102 allows local storage resources of the host computers 104 to be aggregated to form a shared pool of storage resources, which allows the host computers 104 , including any software entities running on the host computers, to use the shared storage resources.
  • the cluster management server 108 operates to manage and monitor the cluster 106 of host computers.
  • the cluster management server may be configured to allow an administrator to create the cluster 106 , add host computers to the cluster and delete host computers from the cluster.
  • the cluster management server may also be configured to allow an administrator to change settings or parameters of the host computers in the cluster regarding the VSAN 102 , which is formed using the local storage resources of the host computers in the cluster.
  • the cluster management server may further be configured to monitor the current configurations of the host computers and any virtual instances running on the host computers, for example, virtual machines (VMs).
  • the monitored configurations may include hardware configuration of each of the host computers and software configurations of each of the host computers.
  • the monitored configurations may also include virtual instance hosting information, i.e., which virtual instances (e.g., VMs) are hosted or running on which host computers.
  • the monitored configurations may also include information regarding the virtual instances running on the different host computers in the cluster.
  • the cluster management server 108 may also perform operations to manage the virtual instances and the host computers 104 in the cluster 106 .
  • the cluster management server may be configured to perform various resource management operations for the cluster, including virtual instance placement operations for either initial placement of virtual instances and/or load balancing.
  • the process for initial placement of virtual instances, such as VMs may involve selecting suitable host computers for placement of the virtual instances based on, for example, memory and CPU requirements of the virtual instances, the current memory and CPU loads on all the host computers in the cluster and the memory and CPU capacity of all the host computers in the cluster.
  • the cluster management server 108 may be a physical computer. In other embodiments, the cluster management server may be implemented as one or more software programs running on one or more physical computers, such as the host computers 104 in the cluster 106 , or running on one or more virtual machines, which may be hosted on any host computers. In an implementation, the cluster management server is a VMware vCenterTM server with at least some of the features available for such a server.
  • each host computer 104 in the cluster 106 includes hardware 110 , a hypervisor 112 , and a VSAN module 114 .
  • the hardware 110 of each host computer includes hardware components commonly found in a physical computer system, such as one or more processors 116 , one or more system memories 118 , one or more network interfaces 120 and one or more local storage devices 122 (collectively referred to herein as “local storage”).
  • Each processor 116 can be any type of a processor, such as a central processing unit (CPU) commonly found in a server.
  • each processor may be a multi-core processor, and thus, includes multiple independent processing units or cores.
  • Each system memory 118 which may be random access memory (RAM), is the volatile memory of the host computer 104 .
  • the network interface 120 is an interface that allows the host computer to communicate with a network, such as the Internet.
  • the network interface may be a network adapter.
  • Each local storage device 122 is a nonvolatile storage, which may be, for example, a solid-state drive (SSD) or a magnetic disk.
  • the hypervisor 112 of each host computer 104 which is a software interface layer that, using virtualization technology, enables sharing of the hardware resources of the host computer by virtual instances 124 , such as VMs, running on the host computer. With the support of the hypervisor, the VMs provide isolated execution spaces for guest software.
  • the VSAN module 114 of each host computer 104 provides access to the local storage resources of that host computer (e.g., handle storage input/output (I/O) operations to data objects stored in the local storage resources as part of the VSAN 102 ) by other host computers 104 in the cluster 106 or any software entities, such as VMs 124 , running on the host computers in the cluster.
  • the VSAN module of each host computer allows any VM running on any of the host computers in the cluster to access data stored in the local storage resources of that host computer, which may include virtual disks (or portions thereof) of VMs running on any of the host computers and other related files of those VMs.
  • the VSAN module may handle other types of storage I/Os, such as namespace I/Os, resync I/Os, and internal metadata I/O.
  • Namespace I/Os are writes and read operations for configuration files for VMs, such as vmx files, log files, digest files and memory snapshots.
  • Resync I/Os are writes and read operations for data related to failed disks, host computers, racks or clusters.
  • Internal metadata I/Os writes and read operations that are performed on internal data structures other than actual data, such as operations to read from logs, bitmaps, or policies.
  • the VSAN module is designed to provide fairness among these different classes of storage I/O requests, which may have different I/O patterns due to their different workloads.
  • the resync I/O traffic is one type of internal I/O traffic that needs to get it's fair share compared to VM I/Os, but not too much as to significantly affect the throughput of the VM I/Os, which may be detectable by the VM users.
  • VSAN systems there are two typical I/O workloads.
  • the first is the external guest VM I/O workload, which can have very high OIO (outstanding IO).
  • the second is system internal inter-component data resynchronization IO workload, which is sequential from the perspective of the resynchronization job and always only has one OM from the perspective of one VSAN object.
  • OIO outstanding IO
  • system internal inter-component data resynchronization IO workload which is sequential from the perspective of the resynchronization job and always only has one OM from the perspective of one VSAN object.
  • For each I/O workload there are different kinds of resource constraint in different layers in a VSAN system.
  • resource constraints For the lowest data persistent layer, generally speaking, there are two kinds of resource constraints, one is the shared resource constraint (e.g., the constraint is shared among all components within one disk group or a host computer), and the other is non-shared constraint exclusively and individually operated on a data unit (e.g., VSAN object or VSAN data component), which has no impact on other data components in the same disk group or host computer.
  • a data unit e.g., VSAN object or VSAN data component
  • a conventional VSAN system may have a congestion-based flow control mechanism to propagate resource constraint notification from the lowest data persistent layer to upper data path layers, which is used especially when the data persistent layer is close to or reaches its maximum resource constraint.
  • the congestion-based flow control mechanism will ultimately translate the resource constraint into a delay time, and the incoming I/O requests will be delayed at the VSAN I/O distributed coordinator (distributed object manager (DOM) Owner) or at VSAN I/O interface layer (DOM client).
  • VSAN I/O distributed coordinator distributed object manager (DOM) Owner
  • DOM client VSAN I/O interface layer
  • each I/O workload will be totally determined by its OIO, which will cause I/O unfairness between guest VM I/Os and VSAN resynchronization I/Os, as well as other type of storage I/Os.
  • the VSAN module 114 of each host computer 104 in the distributed storage system 100 addresses the I/O fairness issue when the congestion or delay is caused by the per-component resource constraint.
  • the VSAN module 114 is designed to fairly process non-shared resource fullness, also known as component congestions, as opposed to diskgroup congestion. This is a challenging problem because when only a small number of components receive large amounts of storage I/O requests, a component could be under heavy VM I/O workload along with a resync I/O workload. In this scenario, component congestion will be more significant than diskgroup congestion, dominating per I/O latency delay.
  • the VSAN module 114 uses the ratio of resync/non-resync I/O bandwidth to drive a subsequent throttling action, which adjusts resync I/O discount since resync I/O's are susceptible to using low (e.g., down to 1) OIOs during the straggler phase.
  • the resync discounting process is a feedback control loop to minimize resync I/O's unfairness, which is more likely to happen than VM I/O unfairness because VM I/O workload can always use more OIO more easily, but resync OIO is controlled to be fixed (e.g., 1) for each component.
  • VM I/O throughput is determined by the latency of each resync I/O, which includes the delay converted from component congestion.
  • the VSAN module includes a cluster level object manager (CLOM) 202 , a distributed object manager (DOM) 204 , a local log structured object management (LSOM) 206 , a reliable datagram transport (RDT) manager 208 , a time-based congestion adjuster 210 and a cluster monitoring, membership and directory service (CMMDS) 212 .
  • CLOM cluster level object manager
  • DOM distributed object manager
  • LSOM local log structured object management
  • RDT reliable datagram transport
  • CMMDS cluster monitoring, membership and directory service
  • the CLOM 202 operates to validate storage resource availability, and DOM 204 operates to create components and apply configuration locally through the LSOM 206 .
  • the DOM also operates to coordinate with counterparts for component creation on other host computers 104 in the cluster 106 . All subsequent reads and writes to storage objects funnel through the DOM 204 , which will take them to the appropriate components.
  • the LSOM operates to monitor the flow of storage I/O operations to the local storage 122 , for example, to report whether a storage resource is congested.
  • the LSOM generates a congestion signal that indicates current storage usage, such as the current tier-1 device resource fullness, which indicates the current congestion at the local storage 122 .
  • the RDT manager 208 is the communication mechanism for storage I/Os in a VSAN network, and thus, can communicate with the VSAN modules in other host computers in the cluster.
  • the RDT manager uses transmission control protocol (TCP) at the transport layer and it is responsible for creating and destroying TCP connections (sockets) on demand.
  • TCP transmission control protocol
  • the time-based congestion adjuster 210 operates to selectively adjust or modify congestion signals from the LSOM 206 using time-based rolling average bandwidths of different classes of storage I/O requests, which is computed by the DOM 204 , to ensure fairness between the different classes of storage I/O requests, e.g., between resync storage I/O requests and non-resync storage I/O requests, with respect to management of the storage I/O requests, as described in detail below.
  • the CMMDS 212 is responsible for monitoring the VSAN cluster's membership, checking heartbeats between the host computers in the cluster, and publishing updates to the cluster directory. Other software components use the cluster directory to learn of changes in cluster topology and object configuration. For example, the DOM uses the contents of the cluster directory to determine the host computers in the cluster storing the components of a storage object and the paths by which those host computers are reachable.
  • the components of the VSAN module 114 of a host computer operate to generate and transmit congestion signals to sources 330 of storage I/O requests.
  • the sources 330 of storage I/O requests may include the host computers 104 of the cluster 106 , the VMs 124 running on the host computers 104 and software processes or routines (not shown) operating in the host computers 104 .
  • Each congestion signal transmitted from the VSAN module 114 of the host computer 104 to the sources 330 provides information on the current fullness of the local storage 122 of that host computer for one or more classes of storage I/O requests.
  • Each host computer that receives a congestion signal from the VSAN module 114 may implement a delay based on the received congestion signal, which may be a time-averaged latency-based delay. Since each congestion signal is associated with one or more classes of storage I/O requests, the congestion signals from the VSAN module 114 may be used to selectively control the issuance of different classes of storage I/O requests. Thus, if one class of storage I/O requests is indicated as being heavily congested by the received congestion signals, the host computers in the cluster may use that information to apply more backpressure on that class of storage I/O requests. However, less backpressure may be applied to other less backlogged classes of storage I/O requests so that the different classes of storage I/O requests may be processed in a fair manner.
  • congestions signals for different classes of storage I/O requests are processed differently by the components of the VSAN 114 .
  • resync storage I/O requests and non-resync storage I/O requests are handled differently with respect to the congestion signals.
  • congestion signals generated by the LSOM 206 for resync storage I/O requests may be adjusted by the time-based congestion adjuster 210 .
  • congestion signals generated by the LSOM 206 for non- resync storage I/O requests e.g., VM I/O requests, namespace I/O requests and internal metadata I/O requests, are not adjusted by the time-based congestion adjuster 210 .
  • Each congestion signal for resync storage I/O requests may be adjusted or discounted depending on the current time-based rolling average bandwidth for resync storage I/O requests and the current time-based rolling average bandwidth for storage I/O requests of another class, such as VM storage I/O requests, which are calculated by the DOM 204 , as described in detail below.
  • congestion signals for resync storage I/O requests may be discounted so that more resync storage I/O requests are processed than other non-resync storage I/O requests, such as VM storage requests, when storage constraint conditions warrant such action.
  • a congestion signal CS 1 when conditions in the local storage 122 warrants that such congestion signal be issued.
  • a congestion signal may be generated by the LSOM 206 when write requests in a write buffer (not shown) exceeds certain threshold.
  • the value of a congestion signal may vary depending on how much that threshold is exceeded by the write requests in the write buffer. For example, the value of a congestion signal may be from zero (0) to two hundred fifty-five (255), where 0 indicates the minimal congestion for the local storage and 255 indicates the maximum congestion for the local storage.
  • the congestion signal CS 1 is transmitted from the LSOM 206 to the time-based congestion adjuster 210 with a returned storage I/O request. Because the congestion signal CS 1 is associated with a non-resync storage I/O request, the congestion signal CS 1 is transmitted to the sources 330 without being adjusted by the time-based congestion adjuster 210 so that delay may be applied to non-resync storage I/O requests.
  • the LSOM 206 For a resync storage I/O request, the LSOM 206 generates a congestion signal CS 2 when conditions in the local storage 122 warrants that such congestion signal be issued.
  • the congestion signal CS 2 is transmitted from the LSOM 206 to the time-based congestion adjuster 210 with a returned storage I/O request. Because the congestion signal CS 2 is associated with a resync storage I/O request, the congestion signal CS 2 may be first adjusted or discounted before being transmitted to the sources 330 .
  • the amount or percentage that the congestion signal C 2 is discounted depends on the current time-based rolling average bandwidth for resync storage I/O requests (resyncAB) and the current time-based rolling average bandwidth for storage I/O requests of another class (non-resyncAB), which are computed and provided to the time-based congestion adjuster 210 , and the desired ratio between these two average bandwidths, as described in more detail below. If discount is applied to the congestion signal CS 2 by the time-based congestion adjuster 210 , the resulting discounted congestion signal D-CS 2 is then transmitted to the sources 330 so that less delay may be applied to resync storage I/O requests so that different classes of storage I/O requests may be processed in a more fair manner.
  • the operation executed by the DOM 204 of the VSAN module 114 in each host computer 104 of the distributed storage system 100 to compute time-based rolling average bandwidths in accordance with an embodiment of the invention is now described with reference to a process flow diagram of FIG. 4 .
  • This operation is performed after the processing of each storage I/O request by the DOM 204 has been completed, i.e., the storage I/O request has been processed by the DOM and passed to the LSOM 206 .
  • rolling average bandwidths are computed based on the elapsed time between consecutive storage I/O requests. As the elapsed time between consecutive storage I/O requests increases, the rolling average bandwidth is reduced further.
  • the rolling average bandwidth is reset to zero.
  • the elapse of time between storage I/O requests is measured using a slot gap mechanism, which determines the elapse of time using the gap between time slots for the two consecutive storage I/O requests.
  • the timestamp at the moment when the processing of a current storage I/O request by the DOM has completed is recorded.
  • the timestamp may be a numerical value that corresponds to the time when the timestamp is recorded.
  • the timestamp for the current storage I/O request and the timestamp for the previous storage I/O request of the same class of storage I/O requests are normalized using the duration or size of predefined fixed-sized time slots, e.g., 200 milliseconds, which may be configurable. In an embodiment, each timestamp is normalized by dividing the timestamp value by the duration value of the time slots.
  • a slot index gap between the slot index of the current storage I/O request and the slot index of the previous storage I/O request is calculated.
  • the slot index of the previous storage I/O request is set to zero.
  • the slot index of the current storage I/O request is computed by taking the difference between the normalized timestamp value for the current storage I/O request and the normalized timestamp value for the previous storage I/O request. This difference is then divided by the duration value of the time slots. The resulting value is the time slot index (sometimes referred to herein simply as “slot index”) of the current storage I/O request.
  • the slot indexes of the current and previous storage I/O requests will be the same slot index, i.e., both the current and previous storage I/O requests are in the same time slot.
  • the slot indexes of the current and previous storage I/O requests will be different slot indexes. In such a case, the slot index gap will be larger for greater difference between the normalized timestamp value for the current storage I/O request and the normalized timestamp value for the previous storage I/O request.
  • the operation proceeds to block 412 , where the time-based rolling average bandwidth for the class of storage I/O requests of the current storage I/O request is updated according to the slot index gap.
  • the time-based rolling average bandwidth is updated by multiplying the previous time-based rolling average bandwidth for the class of storage I/O requests of the current storage I/O request with the decay weight value for the slot index of the current storage I/O request.
  • the decay weight value may be determined using a predefined decay rate for each unit time slot. As an example, the predefined decay rate may have a default setting of 95% decay for each subsequent time slot, which may be changed by the user.
  • the decay weight for the first five (5) time slots are 95.0%, 90.3%, 85.7%, 81.5% and 77.4%, respectively.
  • the decay weight value for the first five (5) time slots are 0.950, 0.903, 0.857, 0.815 and 0.774, respectively. The operation then proceeds to block 416 .
  • the operation proceeds to block 414 , where the time-based rolling average bandwidth for the class of storage I/O request of the current storage I/O request is set to zero. It is noted here that setting the time-based rolling average bandwidth to zero is similar to multiplying the previous time-based rolling average bandwidth for the class of storage I/O request of the current storage I/O request with the decay weight value for the slot index of the current storage I/O request because the decay weight value for the 128th time slot is 0.001 or 0.1%. The operation then proceeds to block 416 .
  • the size of the current storage I/O request is added to the time-based rolling average bandwidth for the class of storage I/O request of the current storage I/O request to derive the current time-based rolling average bandwidth for the I/O class.
  • the size of a storage I/O request can be any size, for example, 1024 bytes.
  • the current time-based rolling average bandwidth for the class of storage I/O requests of the current storage I/O request is recorded.
  • the recorded time-based rolling average bandwidths for different classes of storage I/O requests are used by the time-based congestion adjuster 210 of the VSAN module 114 for time-based congestion discount operation, as described in detail below.
  • the time-based congestion discount operation executed by the time-based congestion adjuster 210 of the VSAN module 114 in each host computer 104 of the distributed storage system 100 in accordance with an embodiment of the invention is now described with reference to a process flow diagram of FIG. 5 .
  • This operation is performed when a storage I/O request, e.g., a write request, is returned from the LSOM 206 of the VSAN module 114 due to congestion at the local storage 122 .
  • a returned storage I/O request and a congestion signal from the LSOM 206 are received at the time-based congestion adjuster 210 .
  • the congestion signal indicates the amount of congestion at the persistent layer of the host computer, i.e., the local storage devices of the host computer.
  • the congestion signal includes a value from zero (0) to two hundred fifty-five (255), where 0 indicates no storage resource constraint and 255 indicates the maximum storage resource constraint.
  • the different classes of storage I/O requests may be differentiated by examining at one or more flags that are set in the headers of the storage I/O requests. These flags may be set by DOM client (that handles regular I/Os) and DOM owner (that handles internally initiated I/Os, such as resync I/Os).
  • the class of a storage I/O request may be identified by looking at an OperationType flag in the header of the storage I/O request, which may indicate that the storage I/O request is, but not limited to, a VM I/O request, a namespace I/O request, an internal metadata I/O request or a resync I/O request.
  • the OperationType flag of a storage I/O request can indicate whether that storage I/O request belongs to the class of resync storage I/O requests or not. If the returned storage I/O request is not a resync storage I/O request, the operation proceeds to block 522 .
  • the operation proceeds to block 506 , where the ratio of the time-based rolling average bandwidth for resync storage I/O requests to the time-based rolling average bandwidth for VM storage I/O requests is calculated.
  • This average bandwidth ratio will be referred to herein as the actual ratio of the time-based rolling average bandwidth for resync storage I/O requests to the time-based rolling average bandwidth for VM storage I/O requests or the actual average bandwidth ratio.
  • returned storage I/O requests are differentiated between the one class of storage I/O requests, e.g., resync storage I/O requests, and other classes of storage I/O requests, e.g., VM I/O requests, namespace I/O requests and internal metadata I/O requests.
  • one class of storage I/O requests e.g., resync storage I/O requests
  • other classes of storage I/O requests e.g., VM I/O requests, namespace I/O requests and internal metadata I/O requests.
  • the actual average bandwidth ratio is divided and normalized against an expected I/O fairness ratio of the average bandwidth for resync storage I/O requests to the average bandwidth for VM storage I/O requests to derive a normalized discounting ratio.
  • the expected average bandwidth ratio which may be simply referred to herein as the expected ratio, be configurable by the user. In this fashion, the actual average bandwidth ratio is compared with the expected average bandwidth ratio.
  • the default setting for the expected average bandwidth ratio may be a ratio of 4:1 for the average bandwidth for resync storage I/O requests to the average bandwidth for VM storage I/O requests.
  • the normalized discounting ratio may be expressed as a percent or a decimal.
  • the operation proceeds to block 516 , where the congestion discount is set as 100% or its equivalent. The operation then proceeds to block 520 . However, if the normalized discounting ratio is less than the second threshold, i.e., less than the second threshold and greater than the first threshold, the operation proceeds to block 518 , where the congestion discount is calculated using the normalized discounting ratio.
  • the value of the congestion discount which can be between 0% and 100%, is determined linearly by the position of the normalized discounting ratio on a straight linear line from the first threshold to the second threshold, e.g., a straight line from 150% to 500%. Thus, for example, if the normalized discounting ratio is 325% (midpoint on a line from 150% to 500%), then the congestion discount will be 50% (midpoint on a line from 0% and 100%).
  • the congestion signal for the returned storage I/O request which is a resync storage I/O request, is updated or adjusted using the congestion discount.
  • the adjusted congestion signal is transmitted to sources of storage I/O requests so that discounted delay can be applied to new storage I/O requests issued from the sources.
  • the adjusted or discounted congestion signal will help resync I/O requests delay less, balance off the single OM limit of the resync I/O pattern, increase its I/O bandwidth and reach the expected I/O fairness ratio for the different classes of storage I/O requests.
  • the approach described herein always rebalances more bandwidth to the low OM resync I/O once its bandwidth is squelched too much by high OM guest VM I/O, caused by the resource constraint congestion, and guarantees IO fairness under the per-component resource constraint conditions.
  • a method for managing storage I/O requests in a distributed storage system in accordance with an embodiment of the invention is described with reference to a flow diagram of FIG. 6 .
  • congestion signals associated with storage requests at a host computer of the distributed storage system are generate based on congestion at local storage of the host computer that supports a virtual storage area network.
  • the storage requests are differentiated between a first class of storage requests and at least one other class of storage requests.
  • an actual ratio of a current average bandwidth of the first class of storage requests to a current average bandwidth of a second class of storage requests is calculated.
  • the actual ratio of a current average bandwidth of the first class of storage requests to a current average bandwidth of a second class of storage requests is compared with an expected ratio.
  • a congestion signal associated with the first class of storage requests is adjusted based on comparison of the actual ratio to the expected ratio to produce an adjusted congestion signal.
  • the adjusted congestion signal is transmitted to at least one source of storage requests, the adjusted congestion signal being used for storage request fairness control.
  • an embodiment of a computer program product includes a computer useable storage medium to store a computer readable program that, when executed on a computer, causes the computer to perform operations, as described herein.
  • embodiments of at least portions of the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
  • a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • the computer-useable or computer-readable medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device), or a propagation medium.
  • Examples of a computer-readable medium include a semiconductor or solid state memory, non-volatile memory, NVMe device, persistent memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disc, and an optical disc.
  • Current examples of optical discs include a compact disc with read only memory (CD-ROM), a compact disc with read/write (CD-R/W), a digital video disc (DVD), and a Blu-ray disc.

Abstract

Computer system and method for managing storage requests in a distributed storage system uses congestion signals associated with storage requests, which are generated based on congestion at local storage of the computer system that supports a virtual storage area network. The storage requests are differentiated between a first class of storage requests and at least one other class of storage requests. For a storage request of the first class of storage requests, an actual ratio of a current average bandwidth of the first class of storage requests to a current average bandwidth of a second class of storage requests is calculated and compared with an expected ratio. The congestion signal associated with the storage request is then adjusted and transmitted to at least one source of storage requests for storage request fairness control.

Description

    BACKGROUND
  • A distributed storage system allows a cluster of host computers to aggregate local storage devices, which may be located in or attached to each host computer, to create a single and shared pool of storage. This pool of storage is accessible by all host computers in the cluster, including any virtualized instances running on the host computers, such as virtual machines. Because the shared local storage devices that make up the pool of storage may have different performance characteristics, such as capacity, input/output per second (IOPS) capabilities, etc.), usage of such shared local storage devices to store data may be distributed among the virtual machines based on the needs of each given virtual machine.
  • This approach provides enterprises with cost-effective performance. For instance, distributed storage using pooled local storage devices is inexpensive, highly scalable, and relatively simple to manage. Because such distributed storage can use commodity storage devices, e.g., disk drives, in the cluster, enterprises do not need to invest in additional storage infrastructure. However, one issue that arises with this approach relates to contention between multiple clients, such as virtual machines on different host computers, accessing the shared storage resources. In particular, reduced overall performance and higher latency occur when multiple clients and/or other software processes need to simultaneously access the same local storage devices.
  • BRIEF DESCRIPTION OF THE DRAWINGS
  • FIG. 1 is a block diagram of a distributed storage system in accordance with an embodiment of the invention.
  • FIG. 2 is a block diagram of a virtual storage array network (VSAN) module in each host computer of the distributed storage system in accordance with an embodiment of the invention.
  • FIG. 3A illustrates a congestion signal for a non-resync I/O request being generated and transmitted to sources of storage I/O requests in accordance with an embodiment of the invention.
  • FIG. 3B illustrates a congestion signal for a resync I/O request being generated, adjusted and transmitted to the sources of storage I/O requests in accordance with an embodiment of the invention.
  • FIG. 4 is a process flow diagram of a storage request management operation executed in each host computer of the cluster in the distributed data storage system in accordance with an embodiment of the invention.
  • FIG. 5 is a process flow diagram of a storage request management operation executed in each host computer of the cluster in the distributed data storage system in accordance with an embodiment of the invention.
  • FIG. 6 is a flow diagram of a method of managing storage requests in a distributed storage system in accordance with an embodiment of the invention.
  • Throughout the description, similar reference numbers may be used to identify similar elements.
  • DETAILED DESCRIPTION
  • FIG. 1 illustrates a distributed storage system 100 in accordance with an embodiment of the invention. As shown in FIG. 1, the distributed storage system 100 provides a software-based “virtual storage area network” (VSAN) 102 that leverages local storage resources of host computers 104, which are part of a logically defined cluster 106 of host computers that is managed by a cluster management server 108. The VSAN 102 allows local storage resources of the host computers 104 to be aggregated to form a shared pool of storage resources, which allows the host computers 104, including any software entities running on the host computers, to use the shared storage resources.
  • The cluster management server 108 operates to manage and monitor the cluster 106 of host computers. The cluster management server may be configured to allow an administrator to create the cluster 106, add host computers to the cluster and delete host computers from the cluster. The cluster management server may also be configured to allow an administrator to change settings or parameters of the host computers in the cluster regarding the VSAN 102, which is formed using the local storage resources of the host computers in the cluster. The cluster management server may further be configured to monitor the current configurations of the host computers and any virtual instances running on the host computers, for example, virtual machines (VMs). The monitored configurations may include hardware configuration of each of the host computers and software configurations of each of the host computers. The monitored configurations may also include virtual instance hosting information, i.e., which virtual instances (e.g., VMs) are hosted or running on which host computers. The monitored configurations may also include information regarding the virtual instances running on the different host computers in the cluster.
  • The cluster management server 108 may also perform operations to manage the virtual instances and the host computers 104 in the cluster 106. As an example, the cluster management server may be configured to perform various resource management operations for the cluster, including virtual instance placement operations for either initial placement of virtual instances and/or load balancing. The process for initial placement of virtual instances, such as VMs, may involve selecting suitable host computers for placement of the virtual instances based on, for example, memory and CPU requirements of the virtual instances, the current memory and CPU loads on all the host computers in the cluster and the memory and CPU capacity of all the host computers in the cluster.
  • In some embodiments, the cluster management server 108 may be a physical computer. In other embodiments, the cluster management server may be implemented as one or more software programs running on one or more physical computers, such as the host computers 104 in the cluster 106, or running on one or more virtual machines, which may be hosted on any host computers. In an implementation, the cluster management server is a VMware vCenter™ server with at least some of the features available for such a server.
  • As illustrated in FIG. 1, each host computer 104 in the cluster 106 includes hardware 110, a hypervisor 112, and a VSAN module 114. The hardware 110 of each host computer includes hardware components commonly found in a physical computer system, such as one or more processors 116, one or more system memories 118, one or more network interfaces 120 and one or more local storage devices 122 (collectively referred to herein as “local storage”). Each processor 116 can be any type of a processor, such as a central processing unit (CPU) commonly found in a server. In some embodiments, each processor may be a multi-core processor, and thus, includes multiple independent processing units or cores. Each system memory 118, which may be random access memory (RAM), is the volatile memory of the host computer 104. The network interface 120 is an interface that allows the host computer to communicate with a network, such as the Internet. As an example, the network interface may be a network adapter. Each local storage device 122 is a nonvolatile storage, which may be, for example, a solid-state drive (SSD) or a magnetic disk.
  • The hypervisor 112 of each host computer 104, which is a software interface layer that, using virtualization technology, enables sharing of the hardware resources of the host computer by virtual instances 124, such as VMs, running on the host computer. With the support of the hypervisor, the VMs provide isolated execution spaces for guest software.
  • The VSAN module 114 of each host computer 104 provides access to the local storage resources of that host computer (e.g., handle storage input/output (I/O) operations to data objects stored in the local storage resources as part of the VSAN 102) by other host computers 104 in the cluster 106 or any software entities, such as VMs 124, running on the host computers in the cluster. As an example, the VSAN module of each host computer allows any VM running on any of the host computers in the cluster to access data stored in the local storage resources of that host computer, which may include virtual disks (or portions thereof) of VMs running on any of the host computers and other related files of those VMs. In addition to these VM I/Os, the VSAN module may handle other types of storage I/Os, such as namespace I/Os, resync I/Os, and internal metadata I/O. Namespace I/Os are writes and read operations for configuration files for VMs, such as vmx files, log files, digest files and memory snapshots. Resync I/Os are writes and read operations for data related to failed disks, host computers, racks or clusters. Internal metadata I/Os writes and read operations that are performed on internal data structures other than actual data, such as operations to read from logs, bitmaps, or policies. The VSAN module is designed to provide fairness among these different classes of storage I/O requests, which may have different I/O patterns due to their different workloads. As an example, the resync I/O traffic is one type of internal I/O traffic that needs to get it's fair share compared to VM I/Os, but not too much as to significantly affect the throughput of the VM I/Os, which may be detectable by the VM users.
  • In some VSAN systems, there are two typical I/O workloads. The first is the external guest VM I/O workload, which can have very high OIO (outstanding IO). The second is system internal inter-component data resynchronization IO workload, which is sequential from the perspective of the resynchronization job and always only has one OM from the perspective of one VSAN object. For each I/O workload, there are different kinds of resource constraint in different layers in a VSAN system. For the lowest data persistent layer, generally speaking, there are two kinds of resource constraints, one is the shared resource constraint (e.g., the constraint is shared among all components within one disk group or a host computer), and the other is non-shared constraint exclusively and individually operated on a data unit (e.g., VSAN object or VSAN data component), which has no impact on other data components in the same disk group or host computer.
  • In order to avoid system overwhelming problem, a conventional VSAN system may have a congestion-based flow control mechanism to propagate resource constraint notification from the lowest data persistent layer to upper data path layers, which is used especially when the data persistent layer is close to or reaches its maximum resource constraint. However, the congestion-based flow control mechanism will ultimately translate the resource constraint into a delay time, and the incoming I/O requests will be delayed at the VSAN I/O distributed coordinator (distributed object manager (DOM) Owner) or at VSAN I/O interface layer (DOM client). Thus, if the resource constraint is not handled properly, the throughput of each I/O workload will be totally determined by its OIO, which will cause I/O unfairness between guest VM I/Os and VSAN resynchronization I/Os, as well as other type of storage I/Os. The VSAN module 114 of each host computer 104 in the distributed storage system 100 addresses the I/O fairness issue when the congestion or delay is caused by the per-component resource constraint.
  • The VSAN module 114 is designed to fairly process non-shared resource fullness, also known as component congestions, as opposed to diskgroup congestion. This is a challenging problem because when only a small number of components receive large amounts of storage I/O requests, a component could be under heavy VM I/O workload along with a resync I/O workload. In this scenario, component congestion will be more significant than diskgroup congestion, dominating per I/O latency delay. As described in detail below, the VSAN module 114 uses the ratio of resync/non-resync I/O bandwidth to drive a subsequent throttling action, which adjusts resync I/O discount since resync I/O's are susceptible to using low (e.g., down to 1) OIOs during the straggler phase. The resync discounting process is a feedback control loop to minimize resync I/O's unfairness, which is more likely to happen than VM I/O unfairness because VM I/O workload can always use more OIO more easily, but resync OIO is controlled to be fixed (e.g., 1) for each component. Thus, VM I/O throughput is determined by the latency of each resync I/O, which includes the delay converted from component congestion.
  • Turning now to FIG. 2, components of the VSAN module 114, which is included in each host computer 104 in the cluster 106, in accordance with an embodiment of the invention are shown. As shown in FIG. 2, the VSAN module includes a cluster level object manager (CLOM) 202, a distributed object manager (DOM) 204, a local log structured object management (LSOM) 206, a reliable datagram transport (RDT) manager 208, a time-based congestion adjuster 210 and a cluster monitoring, membership and directory service (CMMDS) 212. These components of the VSAN module may be implemented as software running on each of the host computers in the cluster.
  • The CLOM 202 operates to validate storage resource availability, and DOM 204 operates to create components and apply configuration locally through the LSOM 206. The DOM also operates to coordinate with counterparts for component creation on other host computers 104 in the cluster 106. All subsequent reads and writes to storage objects funnel through the DOM 204, which will take them to the appropriate components. The LSOM operates to monitor the flow of storage I/O operations to the local storage 122, for example, to report whether a storage resource is congested. In an embodiment, the LSOM generates a congestion signal that indicates current storage usage, such as the current tier-1 device resource fullness, which indicates the current congestion at the local storage 122. The RDT manager 208 is the communication mechanism for storage I/Os in a VSAN network, and thus, can communicate with the VSAN modules in other host computers in the cluster. The RDT manager uses transmission control protocol (TCP) at the transport layer and it is responsible for creating and destroying TCP connections (sockets) on demand. The time-based congestion adjuster 210 operates to selectively adjust or modify congestion signals from the LSOM 206 using time-based rolling average bandwidths of different classes of storage I/O requests, which is computed by the DOM 204, to ensure fairness between the different classes of storage I/O requests, e.g., between resync storage I/O requests and non-resync storage I/O requests, with respect to management of the storage I/O requests, as described in detail below. The CMMDS 212 is responsible for monitoring the VSAN cluster's membership, checking heartbeats between the host computers in the cluster, and publishing updates to the cluster directory. Other software components use the cluster directory to learn of changes in cluster topology and object configuration. For example, the DOM uses the contents of the cluster directory to determine the host computers in the cluster storing the components of a storage object and the paths by which those host computers are reachable.
  • In an embodiment, as illustrated in FIGS. 3A and 3B, the components of the VSAN module 114 of a host computer operate to generate and transmit congestion signals to sources 330 of storage I/O requests. In FIGS. 3A and 3B, some of the components of the VSAN 114 are not illustrated. The sources 330 of storage I/O requests may include the host computers 104 of the cluster 106, the VMs 124 running on the host computers 104 and software processes or routines (not shown) operating in the host computers 104. Each congestion signal transmitted from the VSAN module 114 of the host computer 104 to the sources 330 provides information on the current fullness of the local storage 122 of that host computer for one or more classes of storage I/O requests. Each host computer that receives a congestion signal from the VSAN module 114 may implement a delay based on the received congestion signal, which may be a time-averaged latency-based delay. Since each congestion signal is associated with one or more classes of storage I/O requests, the congestion signals from the VSAN module 114 may be used to selectively control the issuance of different classes of storage I/O requests. Thus, if one class of storage I/O requests is indicated as being heavily congested by the received congestion signals, the host computers in the cluster may use that information to apply more backpressure on that class of storage I/O requests. However, less backpressure may be applied to other less backlogged classes of storage I/O requests so that the different classes of storage I/O requests may be processed in a fair manner.
  • The congestions signals for different classes of storage I/O requests are processed differently by the components of the VSAN 114. In one embodiment, resync storage I/O requests and non-resync storage I/O requests are handled differently with respect to the congestion signals. In this embodiment, congestion signals generated by the LSOM 206 for resync storage I/O requests may be adjusted by the time-based congestion adjuster 210. However, congestion signals generated by the LSOM 206 for non- resync storage I/O requests, e.g., VM I/O requests, namespace I/O requests and internal metadata I/O requests, are not adjusted by the time-based congestion adjuster 210. Each congestion signal for resync storage I/O requests may be adjusted or discounted depending on the current time-based rolling average bandwidth for resync storage I/O requests and the current time-based rolling average bandwidth for storage I/O requests of another class, such as VM storage I/O requests, which are calculated by the DOM 204, as described in detail below. Thus, congestion signals for resync storage I/O requests may be discounted so that more resync storage I/O requests are processed than other non-resync storage I/O requests, such as VM storage requests, when storage constraint conditions warrant such action.
  • As illustrated in FIG. 3A, for a non-resync storage I/O request, such as a VM storage I/O request, the LSOM 206 generates a congestion signal CS1 when conditions in the local storage 122 warrants that such congestion signal be issued. As an example, a congestion signal may be generated by the LSOM 206 when write requests in a write buffer (not shown) exceeds certain threshold. The value of a congestion signal may vary depending on how much that threshold is exceeded by the write requests in the write buffer. For example, the value of a congestion signal may be from zero (0) to two hundred fifty-five (255), where 0 indicates the minimal congestion for the local storage and 255 indicates the maximum congestion for the local storage. The congestion signal CS1 is transmitted from the LSOM 206 to the time-based congestion adjuster 210 with a returned storage I/O request. Because the congestion signal CS1 is associated with a non-resync storage I/O request, the congestion signal CS1 is transmitted to the sources 330 without being adjusted by the time-based congestion adjuster 210 so that delay may be applied to non-resync storage I/O requests.
  • However, as illustrated in FIG. 3B, for a resync storage I/O request, the LSOM 206 generates a congestion signal CS2 when conditions in the local storage 122 warrants that such congestion signal be issued. The congestion signal CS2 is transmitted from the LSOM 206 to the time-based congestion adjuster 210 with a returned storage I/O request. Because the congestion signal CS2 is associated with a resync storage I/O request, the congestion signal CS2 may be first adjusted or discounted before being transmitted to the sources 330. The amount or percentage that the congestion signal C2 is discounted depends on the current time-based rolling average bandwidth for resync storage I/O requests (resyncAB) and the current time-based rolling average bandwidth for storage I/O requests of another class (non-resyncAB), which are computed and provided to the time-based congestion adjuster 210, and the desired ratio between these two average bandwidths, as described in more detail below. If discount is applied to the congestion signal CS2 by the time-based congestion adjuster 210, the resulting discounted congestion signal D-CS2 is then transmitted to the sources 330 so that less delay may be applied to resync storage I/O requests so that different classes of storage I/O requests may be processed in a more fair manner.
  • The operation executed by the DOM 204 of the VSAN module 114 in each host computer 104 of the distributed storage system 100 to compute time-based rolling average bandwidths in accordance with an embodiment of the invention is now described with reference to a process flow diagram of FIG. 4. This operation is performed after the processing of each storage I/O request by the DOM 204 has been completed, i.e., the storage I/O request has been processed by the DOM and passed to the LSOM 206. In this operation, rolling average bandwidths are computed based on the elapsed time between consecutive storage I/O requests. As the elapsed time between consecutive storage I/O requests increases, the rolling average bandwidth is reduced further. If too much time has elapsed between storage I/O requests, then the rolling average bandwidth is reset to zero. As described below, the elapse of time between storage I/O requests is measured using a slot gap mechanism, which determines the elapse of time using the gap between time slots for the two consecutive storage I/O requests.
  • At block 402, the timestamp at the moment when the processing of a current storage I/O request by the DOM has completed is recorded. The timestamp may be a numerical value that corresponds to the time when the timestamp is recorded. Next, at block 404, the timestamp for the current storage I/O request and the timestamp for the previous storage I/O request of the same class of storage I/O requests are normalized using the duration or size of predefined fixed-sized time slots, e.g., 200 milliseconds, which may be configurable. In an embodiment, each timestamp is normalized by dividing the timestamp value by the duration value of the time slots.
  • Next, at block 406, a slot index gap between the slot index of the current storage I/O request and the slot index of the previous storage I/O request is calculated. In an embodiment, the slot index of the previous storage I/O request is set to zero. The slot index of the current storage I/O request is computed by taking the difference between the normalized timestamp value for the current storage I/O request and the normalized timestamp value for the previous storage I/O request. This difference is then divided by the duration value of the time slots. The resulting value is the time slot index (sometimes referred to herein simply as “slot index”) of the current storage I/O request. Thus, if the difference between the normalized timestamp value for the current storage I/O request and the normalized timestamp value for the previous storage I/O request is less than the duration value of the time slots, then the slot indexes of the current and previous storage I/O requests will be the same slot index, i.e., both the current and previous storage I/O requests are in the same time slot. However, if the difference between the normalized timestamp value for the current storage I/O request and the normalized timestamp value for the previous storage I/O request is greater than the duration value of the time slots, then the slot indexes of the current and previous storage I/O requests will be different slot indexes. In such a case, the slot index gap will be larger for greater difference between the normalized timestamp value for the current storage I/O request and the normalized timestamp value for the previous storage I/O request.
  • Next, at block 408, a determination is made whether the current and previous storage I/O requests are in the same time slot, i.e., the slot index gap between the slot index of the current storage I/O request and the slot index of the previous storage I/O request is zero. If the current and previous storage I/O requests are in the same time slot, then the operation proceeds to block 416. However, if the current and previous storage I/O requests are not in the same time slot, then the operation proceeds to block 410, where a determination is made whether the slot index gap is greater than the total number of time slots, e.g., one hundred twenty-eight (128) time slots. This total number of time slots used by the DOM 204 may have a default setting of 128 time slots, but may be configurable by a user.
  • If the slot index gap is less than the total number of time slots, then the operation proceeds to block 412, where the time-based rolling average bandwidth for the class of storage I/O requests of the current storage I/O request is updated according to the slot index gap. In an embodiment, the time-based rolling average bandwidth is updated by multiplying the previous time-based rolling average bandwidth for the class of storage I/O requests of the current storage I/O request with the decay weight value for the slot index of the current storage I/O request. The decay weight value may be determined using a predefined decay rate for each unit time slot. As an example, the predefined decay rate may have a default setting of 95% decay for each subsequent time slot, which may be changed by the user. In this example, the decay weight for the first five (5) time slots are 95.0%, 90.3%, 85.7%, 81.5% and 77.4%, respectively. Thus, in this example, the decay weight value for the first five (5) time slots are 0.950, 0.903, 0.857, 0.815 and 0.774, respectively. The operation then proceeds to block 416.
  • However, if the slot index gap is greater than the total number of time slots, then the operation proceeds to block 414, where the time-based rolling average bandwidth for the class of storage I/O request of the current storage I/O request is set to zero. It is noted here that setting the time-based rolling average bandwidth to zero is similar to multiplying the previous time-based rolling average bandwidth for the class of storage I/O request of the current storage I/O request with the decay weight value for the slot index of the current storage I/O request because the decay weight value for the 128th time slot is 0.001 or 0.1%. The operation then proceeds to block 416.
  • At block 416, the size of the current storage I/O request is added to the time-based rolling average bandwidth for the class of storage I/O request of the current storage I/O request to derive the current time-based rolling average bandwidth for the I/O class. The size of a storage I/O request can be any size, for example, 1024 bytes. Next, at block 418, the current time-based rolling average bandwidth for the class of storage I/O requests of the current storage I/O request is recorded. The recorded time-based rolling average bandwidths for different classes of storage I/O requests, e.g., resync and VM storage I/O requests, are used by the time-based congestion adjuster 210 of the VSAN module 114 for time-based congestion discount operation, as described in detail below.
  • The time-based congestion discount operation executed by the time-based congestion adjuster 210 of the VSAN module 114 in each host computer 104 of the distributed storage system 100 in accordance with an embodiment of the invention is now described with reference to a process flow diagram of FIG. 5. This operation is performed when a storage I/O request, e.g., a write request, is returned from the LSOM 206 of the VSAN module 114 due to congestion at the local storage 122.
  • At block 502, a returned storage I/O request and a congestion signal from the LSOM 206 are received at the time-based congestion adjuster 210. The congestion signal indicates the amount of congestion at the persistent layer of the host computer, i.e., the local storage devices of the host computer. In an embodiment, the congestion signal includes a value from zero (0) to two hundred fifty-five (255), where 0 indicates no storage resource constraint and 255 indicates the maximum storage resource constraint.
  • Next, at block 504, a determination is made whether the returned storage I/O request is a resync storage I/O request. In some embodiments, the different classes of storage I/O requests may be differentiated by examining at one or more flags that are set in the headers of the storage I/O requests. These flags may be set by DOM client (that handles regular I/Os) and DOM owner (that handles internally initiated I/Os, such as resync I/Os). The class of a storage I/O request may be identified by looking at an OperationType flag in the header of the storage I/O request, which may indicate that the storage I/O request is, but not limited to, a VM I/O request, a namespace I/O request, an internal metadata I/O request or a resync I/O request. Thus, the OperationType flag of a storage I/O request can indicate whether that storage I/O request belongs to the class of resync storage I/O requests or not. If the returned storage I/O request is not a resync storage I/O request, the operation proceeds to block 522. However, if the returned storage I/O request is a resync storage I/O request, the operation proceeds to block 506, where the ratio of the time-based rolling average bandwidth for resync storage I/O requests to the time-based rolling average bandwidth for VM storage I/O requests is calculated. This average bandwidth ratio will be referred to herein as the actual ratio of the time-based rolling average bandwidth for resync storage I/O requests to the time-based rolling average bandwidth for VM storage I/O requests or the actual average bandwidth ratio. Thus, returned storage I/O requests are differentiated between the one class of storage I/O requests, e.g., resync storage I/O requests, and other classes of storage I/O requests, e.g., VM I/O requests, namespace I/O requests and internal metadata I/O requests.
  • Next, at block 508, the actual average bandwidth ratio is divided and normalized against an expected I/O fairness ratio of the average bandwidth for resync storage I/O requests to the average bandwidth for VM storage I/O requests to derive a normalized discounting ratio. The expected average bandwidth ratio, which may be simply referred to herein as the expected ratio, be configurable by the user. In this fashion, the actual average bandwidth ratio is compared with the expected average bandwidth ratio. In an embodiment, the default setting for the expected average bandwidth ratio may be a ratio of 4:1 for the average bandwidth for resync storage I/O requests to the average bandwidth for VM storage I/O requests. The normalized discounting ratio may be expressed as a percent or a decimal.
  • Next, at block 510, a determination is made whether the normalized discounting ratio is greater than a first threshold, which may be a configurable value expressed as a percent or a decimal. As an example, the first threshold may be set to a default setting of 150%. If the normalized discounting ratio is not greater than the first threshold, the operation proceeds to block 512, where the congestion discount is set as 0% or its equivalent. The operation then proceeds to block 520. However, if the normalized discounting ratio is greater than the first threshold, the operation proceeds to block 514, where another determination is made whether the normalized discounting ratio is less than a second threshold, which is higher than the first threshold. Similar to the first threshold, the second threshold may be a configurable value expressed as a percent or a decimal. As an example, the second threshold may be set to a default setting of 500%.
  • If the normalized discounting ratio is not less than the second threshold, i.e., greater than the second threshold, the operation proceeds to block 516, where the congestion discount is set as 100% or its equivalent. The operation then proceeds to block 520. However, if the normalized discounting ratio is less than the second threshold, i.e., less than the second threshold and greater than the first threshold, the operation proceeds to block 518, where the congestion discount is calculated using the normalized discounting ratio. In an embodiment, the value of the congestion discount, which can be between 0% and 100%, is determined linearly by the position of the normalized discounting ratio on a straight linear line from the first threshold to the second threshold, e.g., a straight line from 150% to 500%. Thus, for example, if the normalized discounting ratio is 325% (midpoint on a line from 150% to 500%), then the congestion discount will be 50% (midpoint on a line from 0% and 100%).
  • Next, at block 520, the congestion signal for the returned storage I/O request, which is a resync storage I/O request, is updated or adjusted using the congestion discount. In an embodiment, the congestion signal for the returned storage I/O request is adjusted by multiplying the original congestion value received from the LSOM by one (1) minus the congestion discount, which can be expressed as: adjusted congestion value=original congestion value*(1−congestion discount).
  • Next, block 522, the adjusted congestion signal is transmitted to sources of storage I/O requests so that discounted delay can be applied to new storage I/O requests issued from the sources.
  • The adjusted or discounted congestion signal will help resync I/O requests delay less, balance off the single OM limit of the resync I/O pattern, increase its I/O bandwidth and reach the expected I/O fairness ratio for the different classes of storage I/O requests. Regardless of I/O throughput of the component and per-component resource constraint status, the approach described herein always rebalances more bandwidth to the low OM resync I/O once its bandwidth is squelched too much by high OM guest VM I/O, caused by the resource constraint congestion, and guarantees IO fairness under the per-component resource constraint conditions.
  • A method for managing storage I/O requests in a distributed storage system in accordance with an embodiment of the invention is described with reference to a flow diagram of FIG. 6. At block 602, congestion signals associated with storage requests at a host computer of the distributed storage system are generate based on congestion at local storage of the host computer that supports a virtual storage area network. At block 604, the storage requests are differentiated between a first class of storage requests and at least one other class of storage requests. At block 606, an actual ratio of a current average bandwidth of the first class of storage requests to a current average bandwidth of a second class of storage requests is calculated. At block 608, the actual ratio of a current average bandwidth of the first class of storage requests to a current average bandwidth of a second class of storage requests is compared with an expected ratio. At block 610, a congestion signal associated with the first class of storage requests is adjusted based on comparison of the actual ratio to the expected ratio to produce an adjusted congestion signal. At block 612, the adjusted congestion signal is transmitted to at least one source of storage requests, the adjusted congestion signal being used for storage request fairness control.
  • The components of the embodiments as generally described in this document and illustrated in the appended figures could be arranged and designed in a wide variety of different configurations. Thus, the following more detailed description of various embodiments, as represented in the figures, is not intended to limit the scope of the present disclosure, but is merely representative of various embodiments. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
  • The present invention may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by this detailed description. All changes which come within the meaning and range of equivalency of the claims are to be embraced within their scope.
  • Reference throughout this specification to features, advantages, or similar language does not imply that all of the features and advantages that may be realized with the present invention should be or are in any single embodiment of the invention. Rather, language referring to the features and advantages is understood to mean that a specific feature, advantage, or characteristic described in connection with an embodiment is included in at least one embodiment of the present invention. Thus, discussions of the features and advantages, and similar language, throughout this specification may, but do not necessarily, refer to the same embodiment.
  • Furthermore, the described features, advantages, and characteristics of the invention may be combined in any suitable manner in one or more embodiments. One skilled in the relevant art will recognize, in light of the description herein, that the invention can be practiced without one or more of the specific features or advantages of a particular embodiment. In other instances, additional features and advantages may be recognized in certain embodiments that may not be present in all embodiments of the invention.
  • Reference throughout this specification to “one embodiment,” “an embodiment,” or similar language means that a particular feature, structure, or characteristic described in connection with the indicated embodiment is included in at least one embodiment of the present invention. Thus, the phrases “in one embodiment,” “in an embodiment,” and similar language throughout this specification may, but do not necessarily, all refer to the same embodiment.
  • Although the operations of the method(s) herein are shown and described in a particular order, the order of the operations of each method may be altered so that certain operations may be performed in an inverse order or so that certain operations may be performed, at least in part, concurrently with other operations. In another embodiment, instructions or sub-operations of distinct operations may be implemented in an intermittent and/or alternating manner.
  • It should also be noted that at least some of the operations for the methods may be implemented using software instructions stored on a computer useable storage medium for execution by a computer. As an example, an embodiment of a computer program product includes a computer useable storage medium to store a computer readable program that, when executed on a computer, causes the computer to perform operations, as described herein.
  • Furthermore, embodiments of at least portions of the invention can take the form of a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system. For the purposes of this description, a computer-usable or computer readable medium can be any apparatus that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • The computer-useable or computer-readable medium can be an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system (or apparatus or device), or a propagation medium. Examples of a computer-readable medium include a semiconductor or solid state memory, non-volatile memory, NVMe device, persistent memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disc, and an optical disc. Current examples of optical discs include a compact disc with read only memory (CD-ROM), a compact disc with read/write (CD-R/W), a digital video disc (DVD), and a Blu-ray disc.
  • In the above description, specific details of various embodiments are provided. However, some embodiments may be practiced with less than all of these specific details. In other instances, certain methods, procedures, components, structures, and/or functions are described in no more detail than to enable the various embodiments of the invention, for the sake of brevity and clarity.
  • Although specific embodiments of the invention have been described and illustrated, the invention is not to be limited to the specific forms or arrangements of parts so described and illustrated. The scope of the invention is to be defined by the claims appended hereto and their equivalents.

Claims (21)

What is claimed is:
1. A method for managing storage requests in a distributed storage system, the method comprising:
generating congestion signals associated with storage requests at a host computer of the distributed storage system based on congestion at local storage of the host computer that supports a virtual storage area network;
differentiating the storage requests between a first class of storage requests and at least one other class of storage requests;
calculating an actual ratio of a current average bandwidth of the first class of storage requests to a current average bandwidth of a second class of storage requests;
comparing the actual ratio of the current average bandwidth of the first class of storage requests to the current average bandwidth of the second class of storage requests with an expected ratio;
adjusting a congestion signal associated with the first class of storage requests based on comparison of the actual ratio to the expected ratio to produce an adjusted congestion signal; and
transmitting the adjusted congestion signal to at least one source of storage requests, the adjusted congestion signal being used for storage request fairness control.
2. The method of claim 1, further comprising transmitting a second congestion signal, selected from the generated congestion signals, that is associated with another class of storage requests to the at least one source of storage requests without any adjustment.
3. The method of claim 1, wherein comparing the actual ratio of the current average bandwidth of the first class of storage requests to the current average bandwidth of the second class of storage requests with the expected ratio includes dividing the actual ratio by the expected ratio to derive a discounting ratio.
4. The method of claim 3, further comprising:
determining whether the discounting ratio is greater than a first threshold; and
setting a congestion discount to a first value if the discounting ratio is greater than a first threshold.
5. The method of claim 4, further comprising:
determining whether the discounting ratio is less than a second threshold; and
setting the congestion discount to a second value if the discounting ratio is less than a second threshold, wherein the second value is greater than the first value.
6. The method of claim 5, further comprising:
if the discounting ratio is not greater than the first threshold and not less than a second threshold, setting the congestion discount to an intermediate value between the first value and the second value using the discounting value.
7. The method of claim 6, wherein setting the congestion discount to an intermediate value includes determining the intermediate value by the position of the discounting ratio on a linear line from the first threshold to the second threshold.
8. The method of claim 1, further comprising:
recording a timestamp when a processing of a current storage request has completed;
determining a slot index of the current storage request, wherein the slot index indicates one of multiple time slots;
calculating a slot index gap between the slot index of the current storage request and a slot index of a previous storage request of the same class of storage requests; and
deriving an average bandwidth for the current storage request, the average bandwidth being the current average bandwidth of the first class of storage requests or the current average bandwidth of the second class of storage requests.
9. The method of claim 1, wherein differentiating the storage requests includes examining at least one flag in headers of the storage requests that indicates whether the storage requests belong to the first class of storage requests.
10. A non-transitory computer-readable storage medium containing program instructions for managing storage requests in a distributed storage system, wherein execution of the program instructions by one or more processors of a computer system causes the one or more processors to perform steps comprising:
generating congestion signals associated with storage requests at a host computer of the distributed storage system based on congestion at local storage of the host computer that supports a virtual storage area network;
differentiating the storage requests between a first class of storage requests and at least one other class of storage requests;
calculating an actual ratio of a current average bandwidth of the first class of storage requests to a current average bandwidth of a second class of storage requests;
comparing the actual ratio of a current average bandwidth of the first class of storage requests to a current average bandwidth of a second class of storage requests with an expected ratio;
adjusting a congestion signal associated with the first class of storage requests based on comparison of the actual ratio to the expected ratio to produce an adjusted congestion signal; and
transmitting the adjusted congestion signal to at least one source of storage requests, the adjusted congestion signal being used for storage request fairness control.
11. The computer-readable storage medium of claim 10, further comprising transmitting a second congestion signal, selected from the generated congestion signals. that is associated with another class of storage requests to the at least one source of storage requests without any adjustment.
12. The computer-readable storage medium of claim 10, wherein comparing the actual ratio of the current average bandwidth of the first class of storage requests to the current average bandwidth of the second class of storage requests with the expected ratio includes dividing the actual ratio by the expected ratio to derive a discounting ratio.
13. The computer-readable storage medium of claim 12, further comprising:
determining whether the discounting ratio is greater than a first threshold; and
setting a congestion discount to a first value if the discounting ratio is greater than a first threshold.
14. The computer-readable storage medium of claim 13, further comprising:
determining whether the discounting ratio is less than a second threshold; and
setting the congestion discount to a second value if the discounting ratio is less than a second threshold, wherein the second value is greater than the first value.
15. The computer-readable storage medium of claim 14, further comprising:
if the discounting ratio is not greater than the first threshold and not less than a second threshold, setting the congestion discount to an intermediate value between the first value and the second value using the discounting value.
16. The computer-readable storage medium of claim 15, wherein setting the congestion discount to an intermediate value includes determining the intermediate value by the position of the discounting ratio on a linear line from the first threshold to the second threshold.
17. The computer-readable storage medium of claim 10, further comprising:
recording a timestamp when a processing of a current storage request has completed;
determining a slot index of the current storage request, wherein the slot index indicates one of multiple time slots;
calculating a slot index gap between the slot index of the current storage request and a slot index of a previous storage request of the same class of storage requests; and
deriving an average bandwidth for the current storage request, the average bandwidth being the current average bandwidth of the first class of storage requests or the current average bandwidth of the second class of storage requests.
18. A computer system comprising:
memory; and
a processor configured to:
generate congestion signals associated with storage requests based on congestion at local storage of the computer system that supports a virtual storage area network;
differentiate the storage requests between a first class of storage requests and at least one other class of storage requests;
calculate an actual ratio of a current average bandwidth of the first class of storage requests to a current average bandwidth of a second class of storage requests;
compare the actual ratio of a current average bandwidth of the first class of storage requests to a current average bandwidth of a second class of storage requests with an expected ratio;
adjust a congestion signal associated with the first class of storage requests based on comparison of the actual ratio to the expected ratio to produce an adjusted congestion signal; and
transmit the adjusted congestion signal to at least one source of storage requests, the adjusted congestion signal being used for storage request fairness control.
19. The computer system of claim 18, wherein comparing the actual ratio of the current average bandwidth of the first class of storage requests to the current average bandwidth of the second class of storage requests with the expected ratio includes dividing the actual ratio by the expected ratio to derive a discounting ratio, and wherein the processor is further configured to set a congestion discount to a first value if the discounting ratio is greater than a first threshold, or set the congestion discount to a second value if the discounting ratio is less than a second threshold, wherein the second value is greater than the first value.
20. The computer system of claim 19, wherein the processor is further configured to set the congestion discount to an intermediate value between the first value and the second value using the discounting ratio if the discounting ratio is not greater than the first threshold and not less than a second threshold.
21. The computer system of claim 18, wherein the processor is further configured to:
record a timestamp when a processing of a current storage request has completed;
determine a slot index of the current storage request, wherein the slot index indicates one of multiple time slots;
calculate a slot index gap between the slot index of the current storage request and a slot index of a previous storage request of the same class of storage requests; and
derive an average bandwidth for the current storage request, the average bandwidth being the current average bandwidth of the first class of storage requests or the current average bandwidth of the second class of storage requests.
US15/947,313 2018-04-06 2018-04-06 Time-based congestion discounting for I/O fairness control Active 2039-01-08 US10965739B2 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US15/947,313 US10965739B2 (en) 2018-04-06 2018-04-06 Time-based congestion discounting for I/O fairness control

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US15/947,313 US10965739B2 (en) 2018-04-06 2018-04-06 Time-based congestion discounting for I/O fairness control

Publications (2)

Publication Number Publication Date
US20190312925A1 true US20190312925A1 (en) 2019-10-10
US10965739B2 US10965739B2 (en) 2021-03-30

Family

ID=68096176

Family Applications (1)

Application Number Title Priority Date Filing Date
US15/947,313 Active 2039-01-08 US10965739B2 (en) 2018-04-06 2018-04-06 Time-based congestion discounting for I/O fairness control

Country Status (1)

Country Link
US (1) US10965739B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11403143B2 (en) * 2020-12-04 2022-08-02 Vmware, Inc. DRR-based two stages IO scheduling algorithm for storage system with dynamic bandwidth regulation

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6240446B1 (en) * 1998-10-14 2001-05-29 International Business Machines Corporation Multiplexing of multiple data packets for multiple input/output operations between multiple input/output devices and a channel subsystem having multiple channels
US6289383B1 (en) * 1998-11-30 2001-09-11 Hewlett-Packard Company System and method for managing data retrieval bandwidth
US7512940B2 (en) * 2001-03-29 2009-03-31 Microsoft Corporation Methods and apparatus for downloading and/or distributing information and/or software resources based on expected utility
US7228354B2 (en) * 2002-06-28 2007-06-05 International Business Machines Corporation Method for improving performance in a computer storage system by regulating resource requests from clients
US7739470B1 (en) * 2006-10-20 2010-06-15 Emc Corporation Limit algorithm using queue depth to control application performance
US8625635B2 (en) * 2010-04-26 2014-01-07 Cleversafe, Inc. Dispersed storage network frame protocol header
WO2012030271A1 (en) * 2010-09-03 2012-03-08 Telefonaktiebolaget L M Ericsson (Publ) Scheduling multiple users on a shared communication channel in a wireless communication system
CA2750345C (en) * 2011-08-24 2013-06-18 Guest Tek Interactive Entertainment Ltd. Method of allocating bandwidth between zones according to user load and bandwidth management system thereof
JP5954074B2 (en) * 2012-09-20 2016-07-20 富士通株式会社 Information processing method, information processing apparatus, and program.
US9454408B2 (en) * 2013-05-16 2016-09-27 International Business Machines Corporation Managing network utility of applications on cloud data centers
EP3025244A1 (en) * 2013-07-23 2016-06-01 Hewlett Packard Enterprise Development LP Work conserving bandwidth guarantees using priority
EP2833589A1 (en) * 2013-08-02 2015-02-04 Alcatel Lucent Intermediate node, an end node, and method for avoiding latency in a packet-switched network
JP6179321B2 (en) * 2013-09-27 2017-08-16 富士通株式会社 Storage management device, control method, and control program
JP6273966B2 (en) * 2014-03-27 2018-02-07 富士通株式会社 Storage management device, performance adjustment method, and performance adjustment program
US10110300B2 (en) * 2014-09-08 2018-10-23 Hughes Network Systems, Llc Bandwidth management across logical groupings of access points in a shared access broadband network
US9575794B2 (en) * 2014-09-30 2017-02-21 Nicira, Inc. Methods and systems for controller-based datacenter network sharing
US10158712B2 (en) * 2015-06-04 2018-12-18 Advanced Micro Devices, Inc. Source-side resource request network admission control
US10715442B2 (en) * 2016-08-23 2020-07-14 Netduma Software, LTD. Congestion control

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11403143B2 (en) * 2020-12-04 2022-08-02 Vmware, Inc. DRR-based two stages IO scheduling algorithm for storage system with dynamic bandwidth regulation

Also Published As

Publication number Publication date
US10965739B2 (en) 2021-03-30

Similar Documents

Publication Publication Date Title
US20210075731A1 (en) Distributed policy-based provisioning and enforcement for quality of service
US10254991B2 (en) Storage area network based extended I/O metrics computation for deep insight into application performance
US10810143B2 (en) Distributed storage system and method for managing storage access bandwidth for multiple clients
JP6290462B2 (en) Coordinated admission control for network accessible block storage
EP2972746B1 (en) Storage unit selection for virtualized storage units
US9183016B2 (en) Adaptive task scheduling of Hadoop in a virtualized environment
WO2021126295A1 (en) Request throttling in distributed storage systems
US9509621B2 (en) Decentralized input/output resource management
CN104937584A (en) Providing optimized quality of service to prioritized virtual machines and applications based on quality of shared resources
US9807014B2 (en) Reactive throttling of heterogeneous migration sessions in a virtualized cloud environment
CN110018781B (en) Disk flow control method and device and electronic equipment
US20160275412A1 (en) System and method for reducing state space in reinforced learning by using decision tree classification
US20120290789A1 (en) Preferentially accelerating applications in a multi-tenant storage system via utility driven data caching
US10761726B2 (en) Resource fairness control in distributed storage systems using congestion data
US10965739B2 (en) Time-based congestion discounting for I/O fairness control
JP2023539212A (en) Storage level load balancing
US11436123B2 (en) Application execution path tracing for inline performance analysis
US20240111755A1 (en) Two-phase commit using reserved log sequence values
US10721181B1 (en) Network locality-based throttling for automated resource migration
US11048554B1 (en) Correlated volume placement in a distributed block storage service
Wen Improving Data Access Performance of Applications in IT Infrastructure
Zheng et al. CLIBE: Precise Cluster-Level I/O Bandwidth Enforcement in Distributed File System
CN116781707A (en) Load balancing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
FEPP Fee payment procedure

Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY

STPP Information on status: patent application and granting procedure in general

Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION

AS Assignment

Owner name: VMWARE, INC., CALIFORNIA

Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:XIANG, ENNING;KNAUFT, ERIC;XU, YIQI;AND OTHERS;SIGNING DATES FROM 20180925 TO 20180926;REEL/FRAME:047345/0185

STPP Information on status: patent application and granting procedure in general

Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER

STPP Information on status: patent application and granting procedure in general

Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS

STPP Information on status: patent application and granting procedure in general

Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED

STCF Information on status: patent grant

Free format text: PATENTED CASE