CN117480494A

CN117480494A - Coordinated container scheduling for improved resource allocation in virtual computing environments

Info

Publication number: CN117480494A
Application number: CN202180036233.5A
Authority: CN
Inventors: 杰里米·华纳·奥姆斯特德-汤普森
Original assignee: Google LLC
Current assignee: Google LLC
Priority date: 2020-11-23
Filing date: 2021-07-20
Publication date: 2024-01-30
Also published as: US11740921B2; US20220164208A1; WO2022108631A1; US20230393879A1; EP4248314A1

Abstract

The technique allocates available resources in a computing system through two-way communication between a hypervisor in the computing system and a container scheduler. The computing system that allocates resources includes one or more processors configured to receive a first scheduling request to initiate a first container on a first virtual machine having a set of resources. In response to the first scheduling request, a first amount of resources from the set of resources is allocated to the first container on the first virtual machine, the hypervisor in the host is notified of the first amount of resources allocated to the first container, and a second amount of resources from the set of resources is allocated to a second virtual machine in the host. A reduction in the available resources in the set of resources is determined. The container scheduler is notified by the hypervisor of the reduction in resources of the set of resources available on the first virtual machine.

Description

Coordinated container scheduling for improved resource allocation in virtual computing environments

Cross Reference to Related Applications

This application is a continuation of U.S. patent application Ser. No.17/101,714, filed 11/23 in 2020, the disclosure of which is incorporated herein by reference.

Background

The containerized environment may be used to efficiently run applications on a distributed or cloud computing system. For example, various services of an application may be packaged into a container. The container separates the application from the underlying host (host) infrastructure, thereby making it easier to deploy the application in a different cloud or Operating System (OS) environment. The containers may be logically grouped into container groups (pods). A container group as referred to herein is a set of one or more containers with shared storage/network resources, and a specification of how the containers are to be operated. The container group may be deployed on a cloud computing system, for example, on a cluster (cluster) of nodes that are virtual machines ("VMs"). The result of the deployment of the container group is a container group instance, and the container group runs on nodes in the cluster. A cluster may include one or more nodes running a container. The cluster control plane is a logical service running on the nodes of the cluster that can manage the workload and resources of the nodes according to various cloud and user-defined configurations and policies. The cluster control plane includes a plurality of software processes and a database storing the current state of the clusters. Clusters may be operated by cloud providers, self-managed by end users, or a hybrid combination thereof. For example, a cloud provider may have a cloud control plane that sets rules and policies for all clusters on the cloud, or provide a user with a simple way to perform management tasks on the clusters.

As more and more applications are executed in cloud computing systems, hardware and/or software in the cloud computing systems for supporting the applications are configured to be dynamically scalable to meet the needs of the applications at any given time. Each virtual machine running on a host computing device is allocated a portion of memory, such as random access memory, processing capacity, and/or other resources available on the host computing device. However, some virtual machines often remain idle for a relatively long time interval and only need to access a corresponding portion of memory for a short period of time. During such idle time intervals, resources allocated to those virtual machines, such as vCPU and memory, are typically not utilized. This unused resource capacity results in inefficient hardware utilization.

Disclosure of Invention

The present invention provides a resource allocation management, such as an enhanced oversubscription (oversubscription) mechanism, that can utilize resource capacity between one or more virtual machines in a cloud computing system with enhanced utilization. In one example, a method of allocating resources in a computing system includes: the method includes receiving, by one or more processors, a first scheduling request to launch (initiate) a first container on a first virtual machine having a set of resources, allocating, by the one or more processors, a first amount of resources from the set of resources to the first container on the first virtual machine in response to the first scheduling request, notifying, by the one or more processors, a hypervisor (hypervisor) in a host of the first amount of resources allocated to the first container, allocating, by the one or more processors, a second amount of resources from the set of resources to a second virtual machine in the host, determining, by the one or more processors, a reduction in resources available in the set of resources, and notifying, by the one or more processors, to the first virtual machine or container scheduler, the reduction in resources of the set of resources available on the first virtual machine.

In one example, the method further comprises: the one or more processors receive a second scheduling request to initiate a second container at the node, and allocate, by the one or more processors, a third amount of resources from the set of resources to the second container at the node in response to the second scheduling request.

In one example, the method further comprises: the hypervisor is notified by the one or more processors of the third amount of resources in the set of resources allocated to the second container, and a fourth amount of resources from the set of resources is allocated to a second virtual machine in the host by the one or more processors.

In one example, the method further comprises: the method includes determining, by the one or more processors, an accumulated amount of resources used in the set of resources, and determining, by the one or more processors, whether the accumulated amount of resources occupies an entire amount of the set of resources on the host, and notifying, by the one or more processors, the container scheduler from the hypervisor when the entire amount of the set of resources is consumed.

In one example, the method further comprises: the method further includes receiving, by the one or more processors, a third scheduling request to the container scheduler for scheduling a third container on the node, and rejecting, by the one or more processors, the third scheduling request to schedule the third container when the total amount of the set of resources is consumed.

In one example, the container scheduler and the hypervisor are both controlled by a cloud service provider. The method also includes assigning, by the one or more processors, an upper bound for the set of resources on the node.

In one example, the method further includes notifying, by the one or more processors, the hypervisor of the upper limit of the set of resources on the node. Balloon drivers are utilized in the hosts to allocate resources.

In one example, the method further comprises: the workload consumed in the first container is checked by the one or more processors and maintained below the first amount of resources requested by the one or more processors. The container scheduler and the hypervisor are configured for bi-directional coordination or communication. The resource set includes a CPU and memory available on the host.

The present disclosure also provides a computing system that allocates resources. The computing system includes one or more processors configured to receive a first scheduling request to initiate a first container on a first virtual machine having a set of resources, allocate a first amount of resources from the set of resources to the first container on the first virtual machine in response to the first scheduling request, notify a hypervisor in a host of the first amount of resources allocated to the first container, allocate a second amount of resources from the set of resources to a second virtual machine in the host, determine an amount of reduction in resources available in the set of resources, and notify the first virtual machine or container scheduler of the amount of reduction in resources of the set of resources available on the first virtual machine.

In some examples, a second scheduling request is received to initiate a second container on the node. In response to the second scheduling request, a third amount of resources from the set of resources is allocated to a second container on the node.

In some examples, the hypervisor is notified of the third amount of resources in the set of resources allocated to the second container, and a fourth amount of resources from the set of resources is allocated to a second virtual machine in the host.

In some examples, a cumulative amount of resources used in the set of resources is determined. If the cumulative amount of resources is determined to occupy the total amount of the set of resources on the host, the hypervisor notifies the container scheduler when the total amount of the set of resources is consumed. Both the container scheduler and the hypervisor are controlled by the cloud service provider.

The present disclosure also provides a method of allocating resources in a computing system. The method includes coordinating, by one or more processors, between a container scheduler and a hypervisor in a host, and determining, by the one or more processors, allocation of an amount of unused resources in the host.

In some examples, both the container scheduler and the hypervisor are controlled by the cloud service provider. The method also includes, after allocating the amount of unused resources, notifying, by the one or more processors, the container scheduler of the reduced amount of resources.

Drawings

FIG. 1 depicts an example distributed system on which clusters may operate in accordance with aspects of the present disclosure.

Fig. 2 depicts a cluster utilized in the example distributed system of fig. 1 in accordance with an aspect of the present disclosure.

Fig. 3A-3D depict block diagrams of utilization of resource allocation of the node of fig. 2, in accordance with aspects of the present disclosure.

FIG. 4 depicts a timing diagram that illustrates exemplary coordination between a hypervisor and a container scheduler in a computing system in accordance with an aspect of the present invention.

Fig. 5 is an exemplary flow chart according to aspects of the present disclosure.

Detailed Description

The technology generally involves efficiently managing and allocating available resources in a virtualized computing environment established in a computing system, such as a cloud computing system. In some examples, where resources of a virtual machine often remain idle for a relatively long period of time, unused resources may be efficiently allocated and assigned to other virtual machines or for other uses to increase utilization of resources of a computing system. Such allocation or reclamation of unused resources may be accomplished through coordination between a hypervisor of the cloud computing system and the container scheduling system. The container scheduling system may include a container scheduler configured to operate or schedule containers running on nodes of the cloud computing system. Note that the node referred to herein is a virtual machine registered with the container dispatch system that is scheduled to run the container group. The container scheduler may provide resource requirements, e.g., resource expectations of containers, scheduled in a group of containers running on the node to the hypervisor in real-time so that the hypervisor may be dynamically notified of the use of resources on the node. Once a node has unused resources, the unused resources may be reclaimed for other uses to improve hardware utilization. In one example, unused resources of a node correspond to the total amount of resources allocated to the node minus the sum of the resource requirements of a container group or container scheduled on the node. Thus, coordination between the container scheduling system and the hypervisor in the cloud computing system allows the hypervisor to be dynamically informed in real-time of the usage capacity of the articulation points, so that redundant or unused resources can be dynamically rearranged or allocated, so that hardware utilization can be increased. In some examples, the nodes are dedicated to running container groups and/or container workloads.

Fig. 1 depicts a functional diagram illustrating an exemplary distributed system 100 on which a cluster comprising a plurality of containers in a group of containers may be operated on the distributed system 100. As shown, the system 100 may include a plurality of computing devices or computing systems, such as server computers 110,120,130,140 coupled to the network 190 or in electrical communication with the network 190. For example, the server computers 110,120,130,140 may be part of a cloud computing system operated by a cloud service provider. The cloud service provider may also maintain one or more memories, such as memory 180 and memory 182. Further, as shown, the system 100 may include one or more client computing devices, such as a client computer 150 capable of communicating with the server computers 110,120,130,140 over a network 190.

The server computers 110,120,130,140 and the storage 180,182 may be maintained by a cloud service provider in one or more data centers. For example, as shown, server computers 110,120 and memory 180 may be located in data center 160, while server computers 130,140 and memory 182 may be located in another data center 170. The data centers 160,170 and/or the server computers 110,120,130,140 may be located a substantial distance from each other, such as in different cities, states, countries, continents, etc. In addition, within the data centers 160,170, one or more regions (zones) or partitions (zones) may exist. For example, regions or partitions may be logically partitioned based on any suitable attribute.

Clusters may operate on distributed system 100. For example, the clusters may be implemented by one or more processors in the data center, such as by processors 112,122 of server computers 110,120 or by processors 132 and 142 of server computers 130 and 140. Further, a storage system, such as a persistent disk ("PD"), for maintaining persistent and consistent records of the state of clusters may be implemented on a cloud computing system, such as in memory 180,182, or in data 118,128,138,148 of server computers 110,120,130, 140.

The server computers 110,120,130,140 may be similarly configured. For example, as shown, the server computer 110 may contain one or more processors 112, memory 114, and other components typically found in a general purpose computer. Memory 114 may store information accessible to processor 112, including instructions 116 that may be executed by processor 112. The memory may also include data 118 that may be retrieved, manipulated, or stored by the processor 112. Memory 114 may be one type of non-transitory computer-readable medium capable of storing information accessible to processor 112, such as a hard disk drive, a solid state drive, a tape drive, an optical memory, a memory card, ROM, RAM, DVD, CD-ROM, writeable and read-only memory. The processor 112 may be a well-known processor or other known type of processor. Alternatively, the processor 112 may be a dedicated controller, such as a GPU or ASIC, such as a TPU.

The instructions 116 may be a set of instructions, such as computing device code, that are executed directly by the processor 112, or a set of instructions, such as scripts, that are executed indirectly. In this regard, the terms "instructions," "steps" and "procedures" are used interchangeably herein. The instructions 116 may be stored in an object code format for direct processing by the processor 112, or other types of computer languages, including scripts or sets (collections) of individual source code modules that are interpreted or precompiled as needed. The functions, methods and routines of the instructions are explained in more detail in the foregoing examples and in the example methods below. The instructions 116 may include any of the example features described herein.

The data 118 may be retrieved, stored, or modified by the processor 112 according to the instructions 116. For example, although the system and method are not limited by a particular data structure, the data 118 may be stored in a computer register as a table having a plurality of different fields and records, in a relational or non-relational database, or in a JSON, YAML, proto, or XML document. The data 118 may also be formatted in a computer readable format such as, but not limited to, binary values, ASCII, or Unicode. In addition, the data 118 may include information sufficient to identify the relevant information, such as numbers, descriptive text, proprietary codes, pointers, references to data stored in other memory, including other network locations, or information used by a function to calculate the relevant data. _cell

Although fig. 1 functionally illustrates the processor 112 and the memory 114 as being within the same block, the processor 112 and the memory 114 may include multiple processors and memories that may or may not be stored within the same physical housing. For example, some of the instructions 116 and data 118 may be stored on a removable CD-ROM, while others may be stored on a read-only computer chip. Some or all of the instructions and data may be stored in a location physically remote from the processor 112 but still accessible to the processor 112. Similarly, the processor 112 may include a collection of processors (collections) that may or may not operate in parallel. Each of the server computers 110,120,130,140 may include one or more internal clocks that provide timing information that may be used for time measurement of operations and programs executed by the server computers 110,120,130, 140.

The server computers 110,120,130,140 may implement any of a variety of architectures and technologies including, but not limited to, direct Attached Storage (DAS), network Attached Storage (NAS), storage Area Network (SAN), fibre Channel (FC), fibre channel over ethernet (FCoE), hybrid architecture networks, and the like. In some cases, the server computer 110,120,130,140 may be a virtualized environment.

The server computers 110,120,130,140 and the client computer 150 may each be located at one node of the network 190 and capable of directly and indirectly communicating with other nodes of the network 190. For example, the server computers 110,120,130,140 may include a web server capable of communicating with the client computer 150 via the network 190 such that it uses the network 190 to transmit information to applications running on the client computer 150. The server computers 110,120,130,140 may also be computers in one or more load balancing server farms (server farms) that may exchange information with different nodes of the network 190 for the purpose of receiving, processing, and sending data to the client computers 150. Although only a few server computers 110,120,130,140, memories 180,182, and data centers 160,170 are shown in FIG. 1, it should be understood that a typical system may include a large number of connected server computers, a large number of memories, and/or a large number of data centers, each at a different node of the network 190.

Client computer 150 may also be configured similar to server computers 110,120,130,140, having a processor 152, memory 154, instructions 156, and data 158. The client computer 150 may have all of the components typically used with personal computing devices, such as a Central Processing Unit (CPU), memory (e.g., RAM and internal hard drives) storing data and instructions, input and/or output devices, sensors, clocks, and the like. Client computers 150 may comprise full-size personal computing devices, or they may comprise mobile computing devices capable of wirelessly exchanging data with servers over a network such as the internet. For example, client computer 150 may be a desktop or laptop computer, or a mobile phone or device such as a wireless-enabled PDA, tablet PC, or a netbook capable of obtaining information over the Internet or a wearable computing device, or the like.

Client computer 150 may include an application interface module 151. The application interface module 151 may be used to access one or more server computers, such as server computers 110,120,130,140, for example, available services. The application interface module 151 may include subroutines, data structures, object classes, and other types of software components for allowing servers and clients to communicate with each other. In one aspect, the application interface module 151 may be a software module operable in connection with several types of operating systems known in the art. The memory 154 may store data 158 accessed by the application interface module 151. Data 158 may also be stored on removable media such as a magnetic disk, magnetic tape, SD card, or CD-ROM, which may be connected to client computer 150.

Further, as shown in FIG. 1, client computer 150 may include one or more user inputs 153, such as a keyboard, mouse, mechanical actuator, soft actuator, touch screen, microphone, sensor, and/or other components. The client computer 150 may include one or more output devices 155, such as a user display, touch screen, one or more speakers, transducers (transducers) or other audio output, a haptic interface, or other haptic feedback that provides non-visual and non-audible information to the user. Furthermore, although only one client computer 150 is depicted in FIG. 1, it should be appreciated that a typical system may serve a large number of client computers at different nodes of the network 190. For example, a server computer in system 100 may run the workload of an application on a large number of client computers.

As with memory 114, memories 180,182 may be any type of computerized memory capable of storing information accessible by one or more server computers 110,120,130,140 and client computers 150, such as hard drives, memory cards, ROM, RAM, DVD, CD-ROMs, writeable and read-only memories. In some cases, the memory 180,182 may include one or more persistent disks ("PDs"). Further, the memories 180,182 may comprise a distributed storage system in which data is stored on a plurality of different storage devices that may be physically located in the same or different geographic locations. The memories 180,182 may be connected to the computing device via a network 190 as shown in FIG. 1 and/or may be directly connected to any of the server computers 110,120,130,140 and the client computer 150.

The server computers 110,120,130,140 and client computer 150 are capable of direct and indirect communication, such as through a network 190. For example, using an Internet socket, the client computer 150 may connect to a service operating on a remote server computer 110,120,130,140 through an Internet protocol suite. The server computer 110,120,130,140 may establish a listening socket that may accept an originating connection for sending and receiving information. The network 190 and intermediate nodes may include various configurations and protocols including the Internet, world Wide Web, intranets, virtual private networks, wide area networks, local area networks, private networks using one or more corporate proprietary communication protocols, ethernet, wiFi (e.g., 802.81b, g, n, or other such standards), and HTTP, as well as various combinations of the foregoing. Such communication may be accomplished through devices capable of transmitting data to and from other computers, such as modems (e.g., dial-up, cable or fiber optic) and wireless interfaces.

Fig. 2 is a functional diagram illustrating an exemplary cluster 200 running on a hardware host 210 that includes a software layer of a hypervisor 220. For example, a user, such as a developer, may design an application and use a client computer, such as client computer 150 of FIG. 1, to provide configuration data for the application. The container orchestration architecture of cluster 200 provided by the cloud computing system may be configured to encapsulate various services of an application into a container. The container orchestration architecture may be configured to allocate resources for the containers, load balancing services provided by the containers, and scale the containers, e.g., by replication (replication) and deletion.

In one example shown in FIG. 2, one or more nodes 250a,250c may be configured in a cluster 200. Each node 250a,250c of the cluster 200 may run on a physical machine or virtual machine. In one example described herein, nodes 250a,250c are Virtual Machines (VMs) registered with a container dispatch system.

Cluster 200 may run on a distributed system such as system 100. For example, the nodes 250a,250c of the cluster 200 may run on one or more processors 112,122,134,144 in the data centers 160,170 shown in FIG. 1. The nodes 250a,250c may include containers 255a1,255a2,255b1,255b2,255c1,255c2 of computer code and program runtime that form part of the user application. Containers 255a1,255a2,255b1,255b2,255c1,255c2 may be grouped into their respective sets of containers 252a,252b,252c. In one embodiment, the set of containers 252a,252b,252c, including containers 255a1,255a2,255b1,255b2,255c1,255c2, are software instances that enable virtualization at the operating system level. Thus, with containerization, the kernel of the operating system of the management node 250a,250c is configured to provide multiple isolated user space instances. These instances appear to be the only server from the perspective of the end user in communication with the containers in the container group. However, from the perspective of the operating system that manages the nodes 250a,250c on which the container executes, the container is a user process that is scheduled and dispatched by the operating system.

As shown in fig. 2, the container orchestration architecture may be configured as a cluster 200 comprising a plurality of nodes, e.g., node 1 250a and node 2 250b, that run on and share a common hardware platform 210 that serves as a hardware hosting system. The hardware platform 210 includes conventional computer hardware components such as one or more Central Processing Units (CPUs) 202, memory 204, one or more network interfaces 206, and persistent storage 208. The memory 204 may be, for example, random Access Memory (RAM). Persistent storage 208 may be any of a number of different types of persistent (e.g., non-volatile) storage devices such as magnetic disk drives, optical disk drives, solid State Drives (SSDs), and the like.

A virtualization software layer, hereinafter referred to as hypervisor 220, is installed on top of hardware platform 210. The hypervisor 220 allows for concurrent instantiation and execution of one or more nodes 250a,250c configured in the cluster 200. The nodes 250a,250c in communication with the hypervisor 220 are facilitated by virtual machine coordinators 222a,222 b. Each virtual machine coordinator 222a,222b is assigned to and monitors a respective node 250a,250b running on the cluster 200.

After instantiation, each node 250a,250c encapsulates a physical computing machine platform executing under the control of hypervisor 220. The virtual devices of the nodes are contained in virtual hardware platforms 223a,223b, which virtual hardware platforms 223a,223b include, but are not limited to, one or more virtual CPUs (vcpus), virtual random access memories (vrams), such as virtual memories (memories), virtual network interface adapters (vnics), and virtual memories (vstores). Thus, each node 250a,250c is a Virtual Machine (VM) registered with the container dispatch system.

Further as shown, in some examples, containers 255a1,255a2,255b1,255b2,255c1,255c2 are further organized into one or more groups of containers 252a,252b,252c. For example, as shown in fig. 2, nodes 250a,250c may include containers 255a1,255a2,255b1,255b2,255c1,255c2, wherein containers 255a1,255a2,255b1,255b2 are organized into container groups 252a,252b, respectively, in node 1 250a, and node 2 250c may include containers 255c1,255c2, wherein containers 255c1,255c2 are organized into container groups 252c. In some examples, containers 255a1,255a2,255b1,255b2,255c1,255c2 logically grouped into container groups 252a,252b,252c may then be deployed on a computing system, such as cluster 200 of nodes 250a,250c running on a single physical machine, e.g., hardware 210. In one example, containers 255a1,255a2,255b1,255b2,255c1,255c2 may be used to run more than one instance of a computing system on a single physical machine, e.g., hardware 210, where the VM and physical machine share a common hardware architecture.

The use of VMs is driven by the desire to merge many less capable physical machines onto a single more capable physical machine, typically reducing operating costs by multiplexing more virtual resources than are present, which is known as oversubscription. For example, a physical machine containing a single physical CPU may include multiple containers 255a1,255a2,255b1,255b2,255c1,255c2, each of which may be assigned a virtual CPU. During the clock cycles of the physical CPU 202, the CPU executes some number of instructions, divided among the virtual CPUs, such that the sum of all clock cycles consumed by the virtual CPU group is less than or equal to the clock cycle rate of the physical CPU 202. Thus, the time slices of the physical devices 202 are divided and oversubscription is achieved by having more than one virtual device per physical device 202. In one example, a single VM may not be aware of its execution on a host with an oversubscribed resource as long as the sum of the actual resources used by the VM executing on the host remains less than or equal to the physical resources available to the host. Techniques are disclosed herein that enable oversubscription to occur in a secure manner, thereby eliminating the possibility that a VM requesting host does not physically have resources due to oversubscription. In this disclosure, such secure oversubscription is at least partially notified by each VM or node to a hypervisor on the host machine regarding cumulative resource reservations, i.e., planned future resource usage, by container workloads scheduled on the respective VM or node, and by one or more VM coordinators 222 in communication with the hypervisor, which monitors the current and planned future resource usage of the respective VM and dynamically adjusts through which virtual resources are allocated to each VM on the host by utilizing balloon drivers.

Containers 255a1,255a2,255b1,255b2,255c1,255c2 and groups of containers 255a1,255a2,255b1,255b2,255c1,255c2 may have various workloads running thereon, e.g., the workloads may serve the process content of a web site or application. The container groups 252a,252b,252c may belong to a "service" that exposes the container groups to network traffic from users of the workload, such as users of applications or visitors to the website. Each container group 252a,252b,252c is managed by a respective container manager 260a,261a,260 b. The container manager 260a,261a,260b may be configured to start, stop and/or maintain containers and/or container groups based on instructions from the nodes. The container manager 260a,261a,260b may further include one or more load balancers that may be configured to distribute traffic, e.g., requests from services, to workloads running on the cluster 200. For example, container managers 260a,260b,260c may manage traffic allocated among boxes 252a,252b,252c in nodes 250a,250c of cluster 200. In some examples, the container manager 260a,261a,260b is configured to schedule and dispatch multiple processes executing and accessing computer resources, such as vCPU or virtual memory, simultaneously using various algorithms. In some instances, the container manager 260a,261a,260b may dispatch a process to certain vcpus that are less busy than others.

In one example, the container manager 260a,260b,260c may include balloon drivers 264a,264b,264c. Balloon drives 264a,264b,264c may provide resource allocation techniques to facilitate resource allocation. For example, balloon drivers 264a,264b,264c receive commands from hypervisor 220 regarding resource expectations, and then balloon drivers inflate or deflate the available resources within the VM to meet the target expectations. Balloon drivers 264a,264b,264c may communicate with the kernel schedulers 224a,224b to adjust available resources, such as available memory or CPU for use by the containers.

Balloon drivers 264a,264b,264c may communicate with the kernel schedulers 224a,224b to adjust available resources, such as available memory or CPU for use by the containers. After execution, the balloon drivers 264a,264b,264c may return to the idle state until triggered again by another timer event or by another command from the hypervisor 220 and/or VM coordinator 222a,222 b. Balloon drivers, through commands from the hypervisor 220 and/or VM coordinators 222a,222b, obtain information from the kernel schedulers 224a,224b to adjust the resources, e.g., the amount of vCPU or memory that can be used by the nodes 250a,250 c. Balloon drivers 264a,264b,264c are configured to maintain a predetermined amount of resources, such as a predetermined amount of CPU or memory activated for nodes 250a,250c and containers in the container group.

The kernel schedulers 224a,224b are components of the hypervisor 220. The kernel schedulers 224a,224b are responsible for allocating the physical CPUs 202 among the various processes running on the cluster 200 at a given time, where the processes used herein are executing computer programs. For example, the kernel schedulers 224a,224b may determine what processes should run on the CPU 202 or memory 204, the order in which the processes should run, and so on.

In addition to managing access to the physical CPUs 202, in the embodiment described herein, the kernel schedulers 224a,224b are configured to determine a target vCPU size, which is a target number of vCPUs that a particular container or group of containers may use at a given point in time. The target vCPU size specified by the kernel schedulers 224a,224b communicates with the balloon drivers 264a,264b,264c, as indicated by the directional line 290. The balloon drivers 264a,264b,264c may then utilize information and/or advice from the kernel schedulers 224a,224b to adjust the available resources running on the container and/or containers.

For example, if the vCPU running on the container is not fully utilized, the balloon drivers 264a,264b,264c may reduce the amount of resources available to the container, such as vCPU or memory, through communication or coordination between the container scheduler or node and the hypervisor. The kernel schedulers 224a,224b indicated by the hypervisor 220 may then reclaim and control the unused resources for other utilization.

The cluster control plane 270 may be used to manage nodes 250a,250b in the cluster 200. For example, as shown, cluster control plane 270 may include provisioning tool 274, container scheduler 272, API server 440, and the like. It should be noted that cluster 200 may include multiple cluster control planes for different processes. For example, cluster 200 may include multiple API servers and multiple container schedulers for different process suggestions.

The cluster control plane 270 may be configured to perform management tasks for the clusters 200, including managing the clusters 200, managing nodes running within the clusters 200, provisioning nodes, migrating nodes from one cluster to another, and load balancing among clusters.

In one or more embodiments, cluster control plane 270 is configured to perform resource management on containers 255a1,255a2,255b1,255b2,255c1,255c2 in a virtualized environment. The cluster control plane 270 is also configured to deploy, update or remove instances of containers 255a1,255a2,255b1,255b2,255c1,255c2 on each node. By implementing the container on a node such as a virtual machine, response time may be improved because booting the container is typically faster than booting the VM. The container footprint is also smaller than the VM, thereby increasing density. Memory space may also be saved.

In one or more embodiments, the cluster control plane 270 can have a plurality of components configured to perform resource management on containers 255a1,255a2,255b1,255b2,255c1,255c2 in a virtualized environment. The cluster control plane 270 may create a virtual infrastructure by instantiating a packed group (or pool 291) of multiple nodes 250a,250b, e.g., virtual Machines (VMs), installed therein. Cluster control plane 270 may deploy, update, or remove instances of containers on each VM.

In one example, cluster control plane 270 may also include at least an API server 271, a container scheduler 272, and a temporary tool 274 configured to perform different tasks. Note that other components not shown in fig. 2 may also be included in the cluster control plane 270 to facilitate the operation and placement of containers in the nodes.

In one example, the API server 271 may be configured to receive requests, such as input API requests from user applications or from workloads running on the nodes 250a,250c, and manage the nodes 250a,250c to run workloads for processing these API requests. The API server 271 may be configured to route incoming requests to the appropriate server for proper work scheduling.

API server 271 may be configured to provide container scheduler 272 with the intent and status of cluster 200. Accordingly, container scheduler 272 may be configured to track and schedule resources used on each container in the group of containers based on information provided from API server 271 to ensure that the workload is not scheduled to exceed the available resources. To this end, container scheduler 272 may be provided with resource requirements, resource availability, and other user-provided constraints and policy directives, such as quality of service, similarity/anti-similarity requirements, data locality, and the like. As such, the role of container scheduler 272 may be to match resource provisioning to workload requirements.

The API server 271 may be configured to communicate with the nodes 250a,250 b. For example, API server 271 may be configured to ensure that configuration data matches configuration data of containers in nodes 250a,250b, such as containers 255a1,255a2,255b1,255b2,255c1,255c 2. For example, the API server 271 may be configured to communicate with the container managers 260a,260 b. The container manager 260a,260b may be configured to start, stop, and/or maintain containers based on instructions from the nodes 250a,250 c.

Provisioning tool 274 may also be used to manage provisioning, migration, and/or termination of containers or virtual machines on the virtualization infrastructure. Provisioning tool 274 may track virtual machine usage statistics on the virtualization infrastructure to identify maintenance issues to be performed. For example, provisioning tool 274 may utilize statistics to determine that virtual machines, such as nodes in a cluster, may advantageously migrate from one data center to another. The decision may be based on access statistics associated with client accesses. The visit statistics may include historical locations of visits, recent locations, and trend locations. Some examples of access statistics include total access volume per zone per time period, flow change per zone per time period, and other such statistics.

The container scheduler 272 in the cluster control plane 270 may manage virtualized resources such that a specified number of free CPU and memory resources are available within the pool 291 of nodes 250a,250c at any given time. When a user requests a container, container scheduler 272 may place the requested container on nodes 250a,250c within the specified pool 291, based on the user-specified container request, the nodes 250a,250c having sufficient resources, e.g., CPU, memory. The container scheduler 272 may intelligently allocate available resources across the pool 291 of nodes 250a,250c for initial placement of containers within the pool of nodes. Accordingly, container scheduler 272 may be configured to identify the current CPU and memory available on each node 250a,250c in pool 291 to identify VMs for initial placement of containers. The container scheduler 272 instances scheduled on nodes 250a,250b and nodes 250a,250b may be collectively referred to as a container scheduling system. After completion of the initial implementation of the container, the container scheduler 272, the instance scheduled on node 250a,250b or node 250a,250b may begin to dynamically communicate or coordinate with the hypervisor 202 to inform the hypervisor of the consumed resources and information of unused available resources in the node. The unused available resources may then be further recycled or used for other purposes, as will be further described below with reference to fig. 3A-3D and 4.

In one example, container scheduler 272 may also be used to perform tasks, such as determining a container instance into which to launch an application, and causing the instance running in the determined container to launch the application. Container scheduler 272 manages the scheduling of containers in cluster 200, determining which containers are running on nodes at a particular time. In one example, a single container may be run at a time, or alternatively, multiple containers may be run simultaneously based on the number of physical processors and/or using processor cores in cluster 200. Each container may include one or more processors for executing the workload of the container.

In some examples, container scheduler 272 may schedule groups of containers based on the virtual storage volumes used by a particular container. Container scheduler 272 may check whether the container is using any consistent volumes or has unused resources free to be utilized. If such a predictable or unused resource exists, container scheduler 272 may identify such a resource and notify hypervisor 220, as indicated by directional arrow 292, so that the hypervisor may reclaim or use such a resource for a different arrangement, such as creating another virtual machine. For example, container scheduler 272 may communicate with containers such that metrics of instances running on the containers, such as processor utilization and memory utilization data, may be reported to container scheduler 272 in order to better utilize unused resources.

In one example, container scheduler 272 and/or instances scheduled on nodes 250a,250c may coordinate bi-directionally with hypervisor 220 such that container scheduler 272 and/or nodes 250a,250c may dynamically report actual resource usage to hypervisor 202 such that hypervisor 202 may reclaim and reuse such unused resources for other arrangements when available.

As described above, oversubscription occurs when more virtual resources than physical resources are present are multiplexed, e.g., more than one virtual device per physical device. Traditionally, an operator of a virtualized computing environment, such as a cloud service provider, may provide its users, such as end users or customers, with access to physical resources and allow the users to execute their programs using the physical resources of the service provider. For example, each cluster 200 includes a hypervisor 220, or other virtualization component, that hosts one or more nodes, such as virtual machines. Each container in a node may be owned by a particular user and may use physical resources of a virtualized computing environment to execute a service or application. Users may request, access and manage virtual machines assigned to them through API server 271, container scheduler 272, and other management tools. Oversubscription often presents a risk when two virtual machines attempt to use the entire available memory presented, e.g., to make the virtual machines visible, while insufficient memory from hardware 210 is available to support their operation.

Thus, through coordination between container scheduler 272 and nodes 250a,250b and hypervisor 220, container scheduler 272 then monitors the actual use of resources, e.g., CPUs and/or memory, in each container so that hypervisor 220 can have real-time information about the availability of unused resources. Thus, when available resources from a container in a node run below the resources actually allocated to the node, the container scheduler 272 or node may communicate with the hypervisor 220 and coordinate such availability so that the hypervisor 220 may determine whether to suggest adding other resources or to allocate new input requests to other available resources, e.g., other nodes or other auxiliary resources, in order to prevent the risk of trapping oversubscription. In this regard, the hypervisor may temporarily utilize these standby resources for other uses, such as scheduling another VM. Thus, the hypervisor may temporarily provide nodes 250a,250b with the illusion of more resources than are physically available so that unused resources in nodes 250a,250b may be reused as needed.

In contrast, in a conventional boxing (bin packing) process configured to pack containers of different sizes into units having a predetermined physical resource capacity, the conventional boxing process may utilize only all physical resources in the units, but not more, such that any underutilization of the VM inevitably results in unused hardware resources. Through coordination between container scheduler 272 and nodes 250a,250b and hypervisor 220, container scheduler 272 then monitors the actual usage of resources in each container so that hypervisor 220 can have real-time information about the availability of unused resources. Thus, unused resources can be further utilized as needed.

In some examples, further requirements/constraints may be established when workload to the node is established to prevent the risk of running oversubscription. For example, each workload scheduled on a node is required to set a resource request with an upper bound. Thus, the user is only allowed to specify a workload at a particular resource capacity without exceeding an upper limit. Thus, the workload is not allowed to consume resources other than the request, such as an upper limit set for each node. Thus, the workload assigned to the node needs to specify the resource limitations used on the node, such as the number of vcpus and memory requirements. Once the workload is scheduled or deleted, container scheduler 272 may immediately communicate and send the actual current usage to hypervisor 220 so that hypervisor 220 may narrow the resource limit to an upper limit and reclaim excess resources through balloon drivers 264a,264b,264c for other uses, thereby improving hardware utilization. For example, when a workload requests resources of 5.5vCPU and 32GB memory running on an instance on a node where the instance has allocated 8vCPU and 64GB resources, by utilizing bi-directional communication and coordination between hypervisor 220 and container scheduler 272 and/or nodes, unused resources of 2.5vCPU and 32GB memory may be safely reclaimed by hypervisor 220 for other configurations, such as creating additional virtual machines, to improve hardware utilization.

After hypervisor 220 reclaims excess resources, a reduced resource value or reduced available resource baseline may be set, and then hypervisor 220 may notify container scheduler 272 or node of the reduced effective resources/capacity. Thus, container scheduler 272 or node knows that the available capacity becomes limited and the affected nodes can be updated with the new reduced capacity. Thus, the workload of the new input may be set or scheduled at the new reduced capacity. Thus, by utilizing bi-directional communication and coordination between hypervisor 220 and container scheduler 272 and/or nodes, the actual use of hardware resources may be dynamically adjusted while minimizing the risk of trapping resource contention due to oversubscription.

Fig. 3A-3D depict block diagrams of resource allocation in the virtual machine of fig. 2, in accordance with aspects of the present disclosure. Resource allocation, such as vCPU and virtual memory allocation, may be performed by utilizing balloon drivers 264a,264b,264c configured in nodes 250a,250 b. As shown in fig. 3A, each computing system may be allocated a resource capacity 300. An instance running on a node may have node capacity 304. Each workload and/or instance scaled in containers 255a1,255a2,255b1,255b2,255c1,255c2 in each container group 252a,252b,252c may consume a particular amount of capacity set in node capacity 304. In the example shown in fig. 3A, some unused capacity 308 is available in capacity 304 scaled in the node and some unused capacity 306 from the host. The generic instance 302 may be set up on the host.

When container scheduler 272 schedules additional container groups 310 with containers 225d in a node, additional capacity may be consumed in node capacity 304, as shown in fig. 3B, reducing the available capacity from unused capacity 308 to reduced available capacity 312.

Thus, when available resources in node capacity 304 change, container scheduler 272 may notify hypervisor 220 of such a capacity change. Such resources, such as reduced available capacity 312, may then be reserved using balloon drives. In the example shown in fig. 3C, after additional container groups 310 with containers 255d are scheduled in node 304, for example, the unconsumed capacity of the currently available resource capacity may include the reduced available capacity 312 and the unused capacity 306, as shown by dashed line 352.

The hypervisor 220 may then reclaim the reduced available capacity 312 for other uses. For example, in the embodiment shown in FIG. 3D, the hypervisor 220 may utilize the reduced available capacity 312 in combination with unused capacity 306 in the host to schedule another generic instance 350, such as another virtual machine, in the computing system. The capacity available in the node is then shrunk from the larger capacity 304 to the reduced effective capacity 380, with a portion of the capacity 312 being reallocated for other uses, such as for the generic instance 350 to create another virtual machine.

Fig. 4 depicts a timing diagram illustrating example communications and coordination between hypervisor 220 and container scheduler 272 and/or nodes for resource allocation. In this example, a node 402 is scheduled having 8v CPU resources and memory of 32 GB. When a first container is started on a node, as shown in communication path 401, the first container may have a capacity request requesting a first container instance of capacity 4vCPU and 12GB of memory. After allocating the first container with the requested capacity of the first container instance on node 402, node 402 may inform hypervisor 220 about the reserved or used capacity of the first container instance of the first container, as shown in communication path 402. The hypervisor 220 may then be notified of the amount of resource change available in the node and/or host. After completion of the notification, the VM coordinator 222 then schedules another VM, e.g., the first VM, with unused resources and provides feedback to the hypervisor 220 regarding the capacity used by the first VM, as shown in communication path 403. Note that the first VM scheduled herein may be any suitable type of VM, such as a VM running an operating system, a VM registered with a container scheduling system (or referred to as a node), and so forth.

When the second container is further started with the second container instance, the second container instance may request the capacity of the 1.5vCPU and 4GB memory on the node as shown in communication path 404. After the second container with the requested second container instance capacity is allocated on node 402, node 402 notifies hypervisor 220 of accumulated resources, such as the total capacity of the 5.5vCPU and 16GB of memory consumed for both the first and second containers, as shown in communication path 405. The hypervisor 220 may be notified of the amount of resource change in the node and/or host. After completion of the notification, VM coordinator 222 may utilize or reclaim unused resources to schedule another VM, such as a second VM when needed. As described above, the second VM scheduled herein may be any suitable type of VM, such as a VM running an operating system, a VM registered with a container scheduling system (or referred to as a node), and so forth.

After the second VM is scheduled, the updated capacity is fed back to the hypervisor 220, informing that the capacity of the resource has been fully reclaimed and consumed, without spare resources, based on the used resources of the first and second VMs, as shown in communication path 406. In this example, as indicated at block 450, the capacity of the 2vCPU and 16GB memory has been required and consumed for scheduling the first VM and the second VM, leaving minimal or no available resources on the node 402. Thus, the hypervisor 220 may be notified that the spare capacity of the 2vCPU and 16GB memory has been consumed, and that no additional spare or unused resources are available at this point in time.

The hypervisor 220 may then notify the node 402 and/or directly notify the container scheduler 272, or collectively referred to as a container scheduling system, for a new available capacity baseline, such as a reduced effective capacity, as shown in communication path 407. A new available capacity baseline is now set on node 402, such as a resource capacity of 6CPU and a reduced effective capacity of 16GB of memory, as shown in block 451.

Thus, after communication and coordination between container scheduler 272 and hypervisor 220, container scheduler 272 and/or nodes, or collectively referred to as a container scheduling system, is then notified of the reduced effective capacity of the 6CPU and 16GB memory, as shown in block 451, from which the first container and second container have consumed the total capacity of the 5.5vCPU and 16GB memory. Thus, when container scheduler 272 performs a further attempt to schedule a third container requiring 1v cpu and 2GB memory resources, as indicated by dashed communications path 408, such an attempt may be denied because the reduced effective capacity does not allow additional scheduling of containers, as indicated by dashed communications path 409, because no spare capacity is available. The scheduling of the new container will require searching for another available instance or another node.

Thus, by utilizing coordination and communication between the container scheduler and/or node and the hypervisor, the hypervisor can dynamically reclaim unused resources to other uses. The container scheduler may also be immediately notified for such resource allocation in order to reduce the likelihood of actual oversubscription when attempting to schedule additional containers on the VM to request more resources than the physical hardware can provide.

FIG. 5 depicts an exemplary flowchart 500 that may be performed by one or more processors, e.g., one or more processors 112, 122. For example, the processors 112,122 may receive data and commands. Based on the received data or commands, the processors 112,122 may make various determinations, as shown in flow chart 500. Flowchart 500 depicts one example of coordination between hypervisor 220 and container scheduler 272 and/or nodes executed by one or more processors 112, 122.

Referring to FIG. 5, at block 502, a request to schedule a first container on a node, e.g., a first VM, in a computing system is received by a container scheduler. The request may include a desired amount of resources, e.g., a first amount of resources, for scheduling a first container on the node.

At block 504, a first amount of resources in a node is allocated to create a first container.

At block 506, after the first container is scheduled, information for the first amount of resources of the first container is notified and sent to the hypervisor.

The hypervisor may then determine the amount of unused or remaining resources remaining in the node at block 508.

Unused resources may then be allocated and reclaimed by the hypervisor for other arrangements or purposes, e.g., creating another virtual machine, e.g., a second VM, at block 510. Note that the second VM scheduled herein may be any suitable type of VM, such as a VM running an operating system, a VM registered with a container scheduling system (or referred to as a node), and so forth.

At block 512, after unused resources are reclaimed for other uses, such as creating another VM, the hypervisor may then communicate and coordinate with the container scheduler and/or node to reduce the effective capacity or reduce the effective available resources on the instances on the node.

This technique is advantageous because it provides coordination and communication between the container scheduler and the hypervisor in the cluster on the host, so that the actual use of resources in the node can be dynamically allocated and scheduled for better resource management and hardware utilization. Further, to enable coordination between container scheduler instances in a node and a hypervisor, the container scheduler instances in the node and the hypervisor may be configured to be owned or controlled by an operator of a virtualized computing environment, such as a cloud service provider. Thus, the operator can monitor the use of resources and perform appropriate allocation of resources to enhance utilization and oversubscription management.

The foregoing selective examples are not mutually exclusive, but may be combined to achieve unique advantages, unless otherwise specified. As these and other variations and combinations of the features discussed above can be utilized without departing from the subject matter defined by the claims, the foregoing description of the embodiments should be taken by way of illustration rather than by way of limitation of the subject matter defined by the claims. In addition, the provision of examples described herein, as well as terms expressed as "for example," "including," etc., should not be construed as limiting the claimed subject matter to a particular example; rather, these examples are intended to be illustrative of only one of many possible embodiments. Furthermore, the same reference numbers in different drawings may identify the same or similar elements.

Claims

1. A method of allocating resources in a computing system, comprising:

receiving, by the one or more processors, a first scheduling request to initiate a first container on a first virtual machine having a set of resources;

allocating, by the one or more processors, a first amount of resources from the set of resources to the first container on the first virtual machine in response to the first scheduling request;

Notifying, by the one or more processors, a hypervisor in a host of the first amount of resources allocated to the first container;

allocating, by the one or more processors, a second amount of resources from the set of resources to a second virtual machine in the host;

determining, by the one or more processors, a reduction in resources available in the set of resources; and

the first virtual machine or container scheduler is notified of a reduction in the resources of the set of resources available on the first virtual machine by the one or more processors.

2. The method as recited in claim 1, further comprising:

receiving, by one or more processors, a second scheduling request to initiate a second container on the first virtual machine; and

a third amount of resources from the set of resources is allocated to the second container on the first virtual machine by the one or more processors in response to the second scheduling request.

3. The method as recited in claim 2, further comprising:

notifying, by the one or more processors, the hypervisor of the third amount of resources from the set of resources allocated to the second container; and

A fourth amount of resources from the set of resources is allocated to a third virtual machine in the host by the one or more processors.

4. A method according to claim 3, further comprising:

determining, by the one or more processors, a cumulative amount of resources used in the set of resources;

determining, by the one or more processors, whether the cumulative amount of resources occupies the entire amount of the set of resources on the host; and

the container scheduler or the first virtual machine is notified by the one or more processors when the full amount of the set of resources is consumed.

5. The method of claim 1, wherein the first virtual machine is a virtual machine registered with a container dispatch system.

6. The method of claim 1, wherein the container scheduler and the hypervisor are both controlled by a cloud service provider.

7. The method of claim 1, wherein receiving the first scheduling request further comprises:

designating, by the one or more processors, an upper bound for the set of resources on the first virtual machine, an

The upper bound of the set of resources on the first virtual machine is notified to the hypervisor by the one or more processors.

8. The method of claim 1, wherein notifying the first virtual machine or the container scheduler of the reduction in resources further comprises:

the reduced amount of the resource is notified to the first virtual machine or the container scheduler by the hypervisor.

9. The method of claim 1, wherein allocating the first amount of resources further comprises:

the resources are allocated using balloon drives.

10. The method of claim 1, wherein receiving the first scheduling request for scheduling the first container further comprises:

checking, by the one or more processors, the workload consumed in the first container; and

the workload is maintained below the first amount of resources requested by the one or more processors.

11. The method of claim 1, wherein the container scheduler and the hypervisor are configured to communicate bi-directionally.

12. The method of claim 1, wherein the first container is started in a group of containers deployed in the first virtual machine.

13. A computing system for allocating resources, comprising:

One or more processors configured to:

receiving a first scheduling request to initiate a first container on a first virtual machine having a set of resources;

assigning a first amount of resources from the set of resources to the first container on the first virtual machine in response to the first scheduling request;

notifying a hypervisor in a host of the first amount of resources allocated to the first container;

allocating a second amount of resources from the set of resources to a second virtual machine in the host;

determining a reduction in resources available in the set of resources; and

the first virtual machine or a container scheduler is notified of a reduction in the resources of the set of resources available on the first virtual machine.

14. The computing system of claim 13, further comprising:

receiving a second scheduling request to initiate a second container on the first virtual machine; and

a third amount of resources from the set of resources is allocated to the second container on the first virtual machine in response to the second scheduling request.

15. The computing system of claim 14, further comprising:

Notifying the hypervisor of the third amount of resources from the set of resources allocated to the second container; and

a fourth amount of resources from the set of resources is allocated to a third virtual machine in the host.

16. The computing system of claim 15, further comprising:

determining an accumulated amount of resources used in the set of resources; determining whether the cumulative amount of the resource occupies the entire amount of the set of resources on the host; and

the container scheduler is notified by the hypervisor when the full amount of the set of resources is consumed.

17. The computing system of claim 13, wherein the container scheduler and the hypervisor are both controlled by a cloud service provider.

18. A method of allocating resources in a computing system, comprising:

coordinating, by the one or more processors, between a container scheduler and a hypervisor in the host; and

an allocation of an amount of unused resources in the host is determined by the one or more processors.

19. The method of claim 18, wherein the container scheduler and the hypervisor are both controlled by a cloud service provider.

20. The method as recited in claim 18, further comprising:

after allocating the unused amount of resources, notifying, by the one or more processors, the container scheduler of a reduced amount of resources.