CN111796932A

CN111796932A - GPU resource scheduling method

Info

Publication number: CN111796932A
Application number: CN202010576793.7A
Authority: CN
Inventors: 徐山川; 王滨; 王臣汉
Original assignee: Beijing Computing Tianjin Information Technology Co ltd
Current assignee: Beijing Computing Tianjin Information Technology Co ltd
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2020-10-20

Abstract

The invention relates to the technical field of communication application, and discloses a GPU resource scheduling method, which comprises the following steps: s1, collecting basic information of the GPU from the cluster, providing a GPU-usages interface, and entering the step S2; s2, creating GPU application, sending an application request to a Kubernetes scheduler, and entering the step S3; s3, the Kubernetes dispatcher traverses all GPU applications in the cluster after receiving the application request, and the step S4 is entered; s4, calculating a GPU meeting the scheduling requirement of the application through a GPU-usages interface, and entering the step S5; and S5, the GPU manager binds the appointed GPU resources into the application according to the machine where the GPU is located on the application. The sharing of a single GPU in a plurality of applications according to GPU video memory and GPU computing power percentages is realized, the utilization efficiency of the single GPU is greatly improved, and the cost of GPU application is reduced.

Description

GPU resource scheduling method

Technical Field

The invention relates to the technical field of communication application, in particular to a GPU resource scheduling method.

Background

With the explosive growth of the device performance and the gradual popularization of the virtualization technology, how to realize the dynamic resource allocation and flexible scheduling of multiple virtualization devices on the existing physical device and improve the resource utilization rate is urgent to meet the needs of users in daily work.

The Kubernetes is adopted to manage the enterprise server cluster, so that the operation and maintenance cost of an enterprise is greatly reduced, and the utilization rate of resources is improved, but the Kubernetes mainly manages CPU, memory, storage and other hardware for resource management of each machine at present. Because more and more enterprises adopt the GPU to carry out model training and online service of machine learning at present, the efficient management of GPU resources is more and more important.

The defects of the prior art are as follows: resource allocation is performed on a GPU resource entity GPU card as a unit, and GPU resources cannot be shared by multiple applications, which may cause that even if a single application does not fully use the allocated computing resources, the exclusive resources cannot be allocated to other applications, so that the GPU resources cannot be fully utilized.

Disclosure of Invention

The invention mainly aims to provide a GPU resource scheduling method to solve the problem that GPU resources cannot be fully utilized by single application at present.

In order to achieve the above object, the present invention provides the following techniques:

a GPU resource scheduling method comprises the following steps:

s1, collecting basic information of the GPU from the cluster, providing a GPU-usages interface, and entering the step S2;

s2, creating GPU application, sending an application request to a Kubernetes scheduler, and entering the step S3;

s3, the Kubernetes dispatcher traverses all GPU applications in the cluster after receiving the application request, and the step S4 is entered;

s4, calculating a GPU meeting the scheduling requirement of the application through a GPU-usages interface, and entering the step S5;

and S5, the GPU manager binds the appointed GPU resources into the application according to the machine where the GPU is located on the application.

Further, in step S2, in the process of creating the GPU application, the application provides the required video memory value and the calculation force value.

Further, in step S1, the collected basic information of the GPU includes the model of the GPU, the video memory, and the GPU core.

Further, in step S4, if there is no GPU in the cluster that meets the scheduling requirement of the application, the process proceeds to steps S6, S6, and isolation of GPU resources.

Further, S6 includes steps S60 and S61, and S60 returns that the video memory allocation fails if the video memory required by the application exceeds the preset value or is greater than the video memory values of all GPUs in the cluster; s61, packaging the execution thread, periodically checking the core utilization rate of the program to the GPU, if the core utilization rate exceeds the set core utilization value or is greater than the video memory values of all GPUs in the cluster, then transferring the current execution thread into the waiting execution thread.

Further, in step S2, in the process of creating the GPU application, the model of the GPU and the number of GPUs required by the GPU application should be provided.

Further, in step S4, the first GPU that meets the requirement is taken, and the name of the machine where the GPU is located and the number of the GPU in the machine are marked on the application.

Further, in step S4, the machines with the corresponding number of idle GPUs are found through the GPU-usages interface, and the machine with the minimum number of idle GPUs is selected from the machines and the name of the machine is added to the application.

Further, in step S5, the GPU manager uses exhaustion to allocate the GPU to the application, completing the scheduling and binding of GPU resources.

Further, the method completes scheduling of GPU resources for one GPU application or a plurality of GPU applications.

Compared with the prior art, the invention can bring the following technical effects:

1. the sharing of a single GPU in a plurality of applications according to GPU video memory and GPU computing power percentages is realized, the utilization efficiency of the single GPU is greatly improved, and the cost of GPU application is reduced.

2. The topological structure between the GPUs considered when the GPUs are scheduled maximizes the communication efficiency between the GPUs under the same application, and improves the use performance of the application on the GPUs.

3. When the GPU application is scheduled in the Kubernetes cluster, centralized resource allocation is supported, namely machines which use more GPU applications are supported as much as possible, and the fact that the GPU application can still be successfully scheduled to the cluster when multiple cards are used subsequently is guaranteed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention and to enable other features, objects and advantages of the invention to be more fully apparent. The drawings and their description illustrate the invention by way of example and are not intended to limit the invention. In the drawings:

FIG. 1 is a general flow chart of a GPU resource scheduling method of the present invention;

FIG. 2 is a flow diagram of the sharing of a default scheduling policy with a single entity GPU of the present invention in the prior art;

FIG. 3 is a schematic diagram of the topology of DGX1 in an embodiment of the invention;

FIG. 4 is a diagram of a prior art multi-GPU distribution without consideration of topology among GPUs and with consideration of topology relation among GPUs according to the present invention;

FIG. 5 is a flowchart of a multiple GPU application scheduling of a default uniform scheduling policy and a centralized scheduling policy of the present invention in the prior art;

FIG. 6 is a diagram illustrating an example of a topology of a GPU in an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged under appropriate circumstances in order to facilitate the description of the embodiments of the invention herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the present invention, the terms "upper", "lower", "left", "right", "front", "rear", "top", "bottom", "inner", "outer", "center", "vertical", "horizontal", "lateral", "longitudinal", and the like indicate an orientation or positional relationship based on the orientation or positional relationship shown in the drawings. These terms are used primarily to better describe the invention and its embodiments and are not intended to limit the indicated devices, elements or components to a particular orientation or to be constructed and operated in a particular orientation.

Moreover, some of the above terms may be used to indicate other meanings besides the orientation or positional relationship, for example, the term "on" may also be used to indicate some kind of attachment or connection relationship in some cases. The specific meanings of these terms in the present invention can be understood by those skilled in the art as appropriate.

In addition, the term "plurality" shall mean two as well as more than two.

It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.

Example 1

As shown in fig. 1 and 2, for an application that only needs one GPU, a method of resource allocation according to the required GPU video memory and the required number of cores is supported, instead of allocating a complete GPU to the application. The default GPU resource manager does not support allocation according to the resources required by the application, but directly locks the whole GPU and allocates the GPU to the required application.

A GPU resource scheduling method comprises the following steps:

s1, collecting basic information of the GPU from the cluster, providing a GPU-usages interface, and entering the step S2; in step S1, the collected basic information of the GPU includes the model of the GPU, the video memory, and the GPU core. And the scheduler is convenient to acquire the cluster GPU resource information.

S2, creating GPU application, sending an application request to a Kubernetes scheduler, and entering the step S3; in step S2, in the process of creating the GPU application, the application provides the required video memory value and the calculation force value. Since the number of cores of each GPU varies greatly and is not known to application developers, the ratio of the number of cores is directly scaled here. For example, a GPU application needs information to a cluster like: the model is a GPU resource of T4 type, 4GB video memory and 25% core number.

s4, calculating a GPU meeting the scheduling requirement of the application through a GPU-usages interface, and entering the step S5; in step S4, if there is no GPU in the cluster that meets the scheduling requirement of the application, the process proceeds to step S6, where GPU resources are isolated. In step S4, the first GPU that meets the requirement is fetched, and the name of the machine where the GPU is located and the number of the GPU in the machine are marked on the application.

Further, S6 includes steps S60 and S61, and S60 returns that the video memory allocation fails if the video memory required by the application exceeds the preset value or is greater than the video memory values of all GPUs in the cluster; s61, packaging the execution thread, periodically checking the core utilization rate of the program to the GPU, if the core utilization rate exceeds the set GPU utilization core value or is greater than the video memory values of all GPUs in the cluster, then transferring the current execution thread into the waiting execution thread. After the shared scheduling of the GPU is completed, the GPU manager allocates the GPU video memory and the application of the GPU core according to the GPU application, but if no corresponding resource isolation mechanism exists, the situation that the application use exceeds the appointed GPU resource can not be guaranteed, and other applications cannot be normally used is avoided.

Further, the method can realize that the GPU resources are scheduled for one GPU application or a plurality of GPU applications.

Example 2

As shown in fig. 1, 3, 4, 5 and 6; for applications that require multiple GPUs: and distributing according to the GPU group with the highest communication efficiency. The connection structure of the GPUs in the machines is different, and the communication speed between the GPUs is also different. As shown in FIG. 3, the DGX-1 machine comprises 8 GPUs, wherein the GPU0, the GPU1, the GPU2, the GPU3 and the GPU4 can be directly connected in an NVLink mode, and the communication bandwidth can reach 40 GB/s. The connection between the GPU0 and the GPUs 5, 6 and 7 needs to be completed through PCIe Switch and QPI, which is far less efficient than NVLink. When allocating multiple GPUs to an application, the connection structure between the allocated multiple GPUs, also referred to as the topology of the GPUs, should be considered. The topology structure between the GPUs can be obtained through the driving of the GPUs, and the communication efficiency between the GPUs can be achieved through the topology structure. An example of a GPU topology is shown in figure 6.

And a centralized occupation scheme of GPU application is supported. The default Kubernetes resource scheduling mode is a resource uniform scheduling scheme, namely for a cluster, deployed applications are uniformly distributed on each node as much as possible, so that the availability of the applications can be ensured to the greatest extent, namely, when a certain machine has a problem, the applications in other machines are not influenced. However, for the multi-GPU application, the scheduling scheme may cause that GPU resources cannot be fully utilized, as shown in fig. 5, a path a adopts a default uniform scheduling policy, GPU resources are uniformly used, and scheduling cannot be completed when a new multi-GPU application demand appears; and b, adopting a centralized scheduling strategy for the path, scheduling the GPU application to a busy machine as much as possible, and finishing scheduling when multiple GPU applications exist.

The following is a one-time completed process for deployment of multiple GPU applications:

S2, creating GPU application, sending an application request to a Kubernetes scheduler, and entering the step S3; in step S2, in the process of creating the GPU application, the application provides the required video memory value and the calculation force value. Since the number of cores of each GPU varies greatly and is not known to application developers, the ratio of the number of cores is directly scaled here. For example, a GPU application needs information to a cluster like: the model is a GPU resource of T4 type, 4GB video memory and 25% core number. In the process of creating the GPU application, the model of the GPU and the number of GPUs required by the GPU application should also be provided. If the application is a multi-GPU application, only the model of the GPU and the number of the GPUs need to be provided, for example, one GPU application needs information similar to the following information to a cluster: model number T4 type, 2 GPU.

And searching the machines with the corresponding number of idle GPUs through a GPU-usages interface, selecting the machine with the minimum idle number from the machines, and adding the name of the machine to the application. If the application is a multi-GPU application, the machine with the corresponding number of idle GPUs needs to be searched through a GPU-usages interface, and the machine with the minimum number of idle GPUs is selected from the machines and the name of the machine with the minimum number of idle GPUs is added to the application. For example, the application requires two T4 type GPUs, and at this step, 3 and 4 idle T4 type GPUs are found in machine 1 and machine 2, respectively, then machine 1 is selected as the scheduling machine of the application, and the information of the machine is added to the application.

And S5, the GPU manager binds the appointed GPU resources into the application according to the machine where the GPU is located on the application. In step S5, the GPU manager allocates the GPU to the application using exhaustion, completing the scheduling and binding of GPU resources.

And the GPU manager finds a group of GPUs with highest connection efficiency according to the machines distributed by the application in the corresponding machines by using an exhaustion method to distribute the GPUs to the application to complete the scheduling and binding of GPU resources. For example, in an application requiring two V100 model GPUs, we allocated them to one DGX-1 machine, where GPUs 0, 1, 7 were idle, we exhausted by the combination of (GPU0, GPU1), (GPU0, GPU7), (GPU1, GPU7), selecting (GPU0, GPU1) as the final bound GPU.

Further, S6 includes steps S60 and S61, and S60 returns that the video memory allocation fails if the video memory required by the application exceeds the preset value or is greater than the video memory values of all GPUs in the cluster; s61, packaging the execution thread, periodically checking the core utilization rate of the program to the GPU, if the core utilization rate exceeds the set core utilization value or is greater than the video memory values of all GPUs in the cluster, then transferring the current execution thread into the waiting execution thread. After the shared scheduling of the GPU is completed, the GPU manager allocates the GPU video memory and the application of the GPU core according to the GPU application, but if no corresponding resource isolation mechanism exists, the situation that the application use exceeds the appointed GPU resource can not be guaranteed, and other applications cannot be normally used is avoided.

Example 3

As shown in fig. 1, 2, 3, 4, 5 and 6; for applications that require multiple GPUs: and distributing according to the GPU group with the highest communication efficiency. The connection structure of the GPUs in the machines is different, and the communication speed between the GPUs is also different. As shown in FIG. 3, the DGX-1 machine comprises 8 GPUs, wherein the GPU0, the GPU1, the GPU2, the GPU3 and the GPU4 can be directly connected in an NVLink mode, and the communication bandwidth can reach 40 GB/s. The connection between the GPU0 and the GPUs 5, 6 and 7 needs to be completed through PCIe Switch and QPI, which is far less efficient than NVLink. When allocating multiple GPUs to an application, the connection structure between the allocated multiple GPUs, also referred to as the topology of the GPUs, should be considered. The topology structure between the GPUs can be obtained through the driving of the GPUs, and the communication efficiency between the GPUs can be achieved through the topology structure. An example of a GPU topology is shown in figure 6.

And the GPU manager finds a group of GPUs with highest connection efficiency according to the machines distributed by the application in the corresponding machines by using an exhaustion method to distribute the GPUs to the application to complete the scheduling and binding of GPU resources. For example, in an application requiring three V100 model GPUs, we allocated them to one DGX-1 machine, where GPU0, GPU1, GPU3, GPU5, GPU7 were idle, we exhausted by the combination of (GPU0, GPU1, GPU3), (GPU1, GPU3, GPU5), (GPU3, GPU5, GPU7), (GPU0, GPU1, GPU5), (GPU0, GPU1, GPU7), (GPU1, GPU3, GPU7), (… …), selecting (GPU0, GPU1, GPU3) as the final bound GPU.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A GPU resource scheduling method is characterized by comprising the following steps:

2. The method as claimed in claim 1, wherein in step S2, in the process of creating the GPU application, the application provides the required video memory value and computation force value.

3. The method as claimed in claim 1 or 2, wherein in step S1, the collected basic information of the GPU includes a model of the GPU, a video memory, and a GPU core.

4. The method as claimed in claim 3, wherein in step S4, if there is no GPU in the cluster meeting the scheduling requirement of the application, the method proceeds to steps S6, S6, and GPU resource isolation.

5. The method for scheduling GPU resources of claim 4, wherein S6 includes steps S60 and S61, S60, and if the video memory required by the application exceeds the preset value or is greater than all GPU video memory values in the cluster, then returning a video memory allocation failure; s61, packaging the execution thread, periodically checking the core utilization rate of the program to the GPU, if the core utilization rate exceeds the set core utilization value or is greater than the video memory values of all GPUs in the cluster, then transferring the current execution thread into the waiting execution thread.

6. A method as claimed in claim 1, 2, 4 or 5, wherein in step S2, the model number of the GPU and the number of GPUs required by the GPU should be provided in the process of creating the GPU application.

7. The method as claimed in claim 6, wherein in step S4, the first GPU meeting the requirement is taken, and the name of the machine where the GPU is located and the number of the GPU in the machine are marked on the application.

8. A method as claimed in claim 1, 2, 4, 5 or 7, wherein in step S4, the machines with the corresponding number of idle GPUs are found through the GPU-usages interface, and the machine with the least number of idle GPUs is selected from the machines and its name is added to the application.

9. A method as claimed in claim 1, 2, 4, 5 or 7, wherein in step S5, the GPU manager uses exhaustion to allocate the GPU to the application, so as to complete the scheduling and binding of GPU resources.

10. A method for scheduling GPU resources as claimed in any of claims 1 to 9, wherein the method performs scheduling of GPU resources for a GPU application or a plurality of GPU applications.