CN117632447A

CN117632447A - GPU resource using method, GPU virtualization method, job scheduling device and cluster

Info

Publication number: CN117632447A
Application number: CN202210950598.5A
Authority: CN
Inventors: 李孟轩; 张冠一
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2024-03-01
Also published as: WO2024032587A1

Abstract

The disclosure relates to a GPU resource using method, a virtualization method, a job scheduling device and a cluster. The GPU is split into multiple virtual GPUs. And for at least one virtual GPU, at least part of the host memory is used as a video memory exchange area and is distributed to the virtual GPU, so that the available video memory of the virtual GPU is larger than the on-board video memory of the virtual GPU. The video memory application request of the application or task aiming at the virtual GPU is replaced by the video memory application request based on the unified address space, so that at least part of data in the on-board video memory can be exchanged to the video memory exchange area based on the unified address space under the condition that the currently available on-board video memory of the virtual GPU is insufficient. Therefore, the video memory exchange area can serve as a virtual video memory, the available thread of the virtual GPU is increased, the problem that the computing power utilization rate and the video memory utilization rate of the GPU cannot be considered is solved, and meanwhile, the application or task can use the virtual video memory without perception by replacing the video memory application request of the application or task aiming at the virtual GPU with the video memory application request based on the unified address space.

Description

GPU resource using method, GPU virtualization method, job scheduling device and cluster

Technical Field

The disclosure relates to the field of computers, and in particular relates to a GPU resource using method, a GPU virtualization method, a job scheduling device and a cluster.

Background

Modern graphics processing units (Graphics Processing Unit, GPUs) were originally accelerators for Windows video games, but have evolved over the past 20 years as enterprise server processors for high performance computing and artificial intelligence applications.

GPUs are now in the leading position of performance in supercomputing, artificial intelligence training and reasoning, drug research, financial modeling and medical imaging. In cases where the CPUs are not fast enough, they are also applied to more mainstream tasks, such as in GPU-driven relational databases. GPUs are better suited than CPUs for handling many of the computations required for artificial intelligence and machine learning in enterprise data centers and very large scale networks. The CPU can handle the work but requires a longer time. GPUs can solve complex mathematical problems faster because they aim to solve them in parallel by decomposing them into separate tasks that they process simultaneously.

After an enterprise purchases a large number of GPU servers, how to improve the utilization rate of the GPU is an important problem for the enterprise to save purchase cost. On the one hand, many AI applications or tasks are not enough to fully occupy the computing power or the video memory of a GPU, but only one GPU needs to be monopolized so as not to cause mutual interference between the applications or tasks, thus wasting resources on the GPU. On the other hand, K8S is a container arrangement and cluster organization tool that is more and more widely used, but K8S is task-exclusive for GPU management, which also results in low utilization.

In order to improve the utilization rate of GPUs, GPU virtualization techniques are widely developed and applied.

The GPU virtualization technology is used for achieving the purpose of improving the utilization rate of the GPU by dividing the GPU into a plurality of virtual GPUs with smaller granularity, and running applications or tasks with smaller calculation power and video memory consumption by each virtual GPU.

However, the utilization of GPU computing power and the utilization of video memory are not proportional in some scenarios. For example, the utilization rate of calculation force is far smaller than that of video memory in the scenes of model reasoning A/B test, notebook investigation and the like. If the computing power is expected to be more fully utilized, the GPU is required to be cut into virtual GPUs with smaller granularity, and if a traditional GPU virtualization scheme is used, the video memory of each virtual GPU is small, and a single application or task cannot be normally supported; if the GPU is split into virtual GPUs with larger granularity in order to meet the video memory requirement, the computing power of the GPU is free.

Therefore, a solution is needed that can improve the overall utilization of GPU resources and solve the problem that the power utilization and the video memory utilization of the GPU cannot be compatible.

Disclosure of Invention

The technical problem to be solved by the present disclosure is to provide a scheme capable of improving the overall utilization rate of GPU resources and solving the problem that the power utilization rate and the video memory utilization rate of the GPU cannot be considered.

According to a first aspect of the present disclosure, there is provided a GPU resource usage method, including: dividing the GPU into a plurality of virtual GPUs; for at least one virtual GPU, at least part of host memory is used as a video memory exchange area and is distributed to the virtual GPU, so that the available video memory of the virtual GPU is larger than the on-board video memory of the virtual GPU; and replacing the video memory application request of the application or task aiming at the virtual GPU with a video memory application request based on a unified address space, so that at least part of data in the on-board video memory can be exchanged to a video memory exchange area based on the unified address space under the condition that the currently available on-board video memory of the virtual GPU is insufficient.

Optionally, replacing the video memory application request of the application or task to the virtual GPU with a video memory application request based on a unified address space includes: and intercepting a call request of the application or task to the virtual GPU by utilizing the hijacking library, and replacing a default interface used by the application or task for applying the video memory with a video memory application interface based on a unified address space.

Optionally, the method further comprises: obtaining virtualization information, wherein the virtualization information comprises the maximum number of virtual GPUs and the virtual memory size, and dividing the GPUs into a plurality of virtual GPUs comprises the following steps: on the condition that the number of virtual GPUs obtained by segmentation is not greater than the maximum virtual GPU number, segmenting the GPUs into a plurality of virtual GPUs, and distributing partial memories of equipment where the GPUs are located to the virtual GPUs as video memory switching areas, wherein the method comprises the following steps: and taking the host memory with the same size as the virtual memory as a memory exchange area, and distributing the memory to the virtual GPU.

Optionally, the method further comprises: acquiring resource demand information, wherein the resource demand information is used for representing GPU resources required by an application or a task; and distributing the virtual GPU to the application or the task according to the resource demand information.

Optionally, the resource requirement information includes a used GPU power scale and a used memory size.

Optionally, the application or task is running in a container, the method further comprising: setting environment variables of the container, wherein the environment variables comprise container identifications, identifications of virtual GPUs mounted in the container and upper limit of GPU resources which can be accessed by the container; mounting the virtual GPU into a container based on the environment variable; and launching the container to run the application or task in the container.

Optionally, the method further comprises: and monitoring the resource use condition of the virtual GPU.

According to a second aspect of the present disclosure, there is provided a GPU virtualization method, comprising: dividing the GPU into a plurality of virtual GPUs; and for at least one virtual GPU, at least part of the host memory is used as a memory swap area and is distributed to the virtual GPU, so that the available memory of the virtual GPU is larger than the on-board memory of the virtual GPU.

According to a third aspect of the present disclosure, there is provided a job scheduling apparatus including: the scheduler component is used for scheduling the container class job to one or more GPUs, dividing the GPU into one or more virtual GPUs, wherein each virtual GPU corresponds to one container in the container class job, each container corresponds to one application or task, and the application or task runs in the container, and for at least one virtual GPU, the scheduler component also takes at least part of host memory as a memory exchange area and distributes the memory to the virtual GPU so that the available memory of the virtual GPU is larger than the on-board memory of the virtual GPU; the hijacking library is used for intercepting a call request of an application or task to the GPU, and for the application or task needing to use the virtual video memory, the hijacking library also sets an interface for applying for the video memory as a video memory application interface based on a unified address space.

Optionally, the hijacking library is further configured to check whether the video memory used by the application or the task application is greater than the video memory allocated thereto, and send a video memory application request to the driver if the video memory used by the application or the task application is not greater than the video memory allocated thereto.

Optionally, the apparatus further comprises: and the device component is used for mounting the hijacking library into the container and setting a pre-loading library in the container so as to force the hijacking library to be mounted before the process in the container is started.

Optionally, the apparatus further comprises: and the mounting component is used for acquiring the virtual GPU identification of the virtual GPU through communication with the scheduler component, and mounting the virtual GPU into the container according to the virtual GPU identification.

Optionally, the mounting component is further configured to obtain GPU resource information of the virtual GPU by communicating with the scheduler component, and record the GPU resource information in an environment variable of the container.

Optionally, the apparatus further comprises: and the monitoring component is used for monitoring and outputting the resource use condition of the virtual GPU.

Optionally, the apparatus further comprises: a tagging component for recording, for each container in the container class job, a container identification capable of uniquely identifying the container in an environment variable of the container.

According to a third aspect of the present disclosure, there is also provided a Kubernetes cluster, comprising: a plurality of GPU nodes, each GPU node comprising one or more GPUs; and a job scheduling device deployed on at least one GPU node, the job scheduling device being a GPU resource management device as described in the second aspect of the present disclosure.

According to a fourth aspect of the present disclosure, there is also provided a GPU virtualization apparatus, including: the segmentation module is used for segmenting the GPU into a plurality of virtual GPUs; the allocation module is used for allocating partial memory of a host computer where the GPU is located to the virtual GPU as a video memory exchange area aiming at least one virtual GPU, so that the available video memory of the virtual GPU is larger than the on-board video memory of the virtual GPU.

According to a fifth aspect of the present disclosure, there is provided a computing device comprising: a processor; and a memory having executable code stored thereon which, when executed by the processor, causes the processor to perform the method of the first aspect as described above.

According to a sixth aspect of the present disclosure there is provided a non-transitory machine-readable storage medium having stored thereon executable code which, when executed by a processor of an electronic device, causes the processor to perform the method of the first aspect as described above.

Therefore, the host memory is used as the memory exchange area and is distributed to the virtual GPU, so that the memory exchange area can serve as the virtual memory to increase the available memory of the virtual GPU, the problem that the on-board memory of the virtual GPU is insufficient to support a single application or task when the GPU is segmented into the virtual GPU with smaller granularity due to full utilization of computing power is solved, and the problem that the computing power utilization rate and the memory utilization rate of the GPU cannot be considered can be solved. On this basis, the present disclosure enables an application or task to use virtual memory (i.e., a memory swap area) without awareness by replacing the application or task's memory application request for the virtual GPU with a memory application request based on a unified address space.

Drawings

The foregoing and other objects, features and advantages of the disclosure will be apparent from the following more particular descriptions of exemplary embodiments of the disclosure as illustrated in the accompanying drawings wherein like reference numbers generally represent like parts throughout exemplary embodiments of the disclosure.

Fig. 1 illustrates a schematic diagram of the present disclosure for virtualizing a GPU.

Fig. 2 illustrates a schematic configuration of a job scheduling apparatus according to one embodiment of the present disclosure.

Fig. 3 shows a schematic diagram of a hijacked library.

Fig. 4 shows an overall process flow diagram for a vGPU task.

Fig. 5 illustrates a schematic structure of a GPU virtualization device according to one embodiment of the present disclosure.

Fig. 6 illustrates a structural schematic diagram of a computing device according to one embodiment of the present disclosure.

Detailed Description

Preferred embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While the preferred embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

Referring to fig. 1, a GPU may be split into multiple virtual GPUs. GPU may refer to a physical GPU card. The segmentation granularity can be flexibly set according to the requirement. Virtual GPUs may also be referred to as vGPU.

For at least one virtual GPU, at least part of the host memory of the host (e.g., server) where the GPU is located may be allocated to the virtual GPU as a memory swap area, so that the available memory of the virtual GPU is greater than the on-board memory of the virtual GPU. The on-board video memory refers to the video memory integrated on the physical GPU card, namely the video memory provided by the physical GPU card. The on-board video memory may also be referred to as physical video memory.

For the virtual GPU allocated with the video memory exchange area, when the on-board video memory of the virtual GPU is insufficient, a part of space in the on-board video memory can be released for the current program to use. Wherein the data in the released space is saved to the video memory switching area. When the data in the released space is needed to be used, the data in the video memory exchange area can be exchanged into the on-board video memory.

Therefore, under the action of the video memory exchange area, the total available video memory of the virtual GPU is larger than the on-board video memory of the virtual GPU. That is, as the memory of the memory swap area, the virtual memory of the virtual GPU may be used, and the effect of increasing the available memory of the virtual GPU may be achieved. The total available memory of the virtual GPU is equal to the on-board memory of the virtual GPU plus the memory swap area allocated for the virtual GPU.

According to the method and the device, the memory is used as the memory exchange area, the available memory exceeding the on-board memory can be realized, so that when the GPU is segmented into the virtual GPU with smaller granularity for the purpose of fully utilizing the GPU resources, the problem that a single application or task cannot be supported due to the fact that the on-board memory of the virtual GPU obtained by segmentation is smaller is avoided, the problem that the power utilization rate and the memory utilization rate of the GPU cannot be simultaneously considered can be solved while the overall utilization rate of the GPU resources is improved, and clients can be helped to set the segmentation granularity of the GPU virtualization more flexibly, so that the resource utilization rate of the GPU is improved to the greatest extent.

A virtual GPU may be used exclusively by an application or task. Applications or tasks that require the use of a memory swap area may refer to applications or tasks that require the use of the maximum memory that exceeds the on-board memory of the virtual GPU during operation.

In order to make the upper layer application or task use the video memory exchange area without perception, the disclosure further proposes that, for the application or task that needs to use the video memory exchange area, the video memory application request of the application or task for the virtual GPU may be replaced by a video memory application request based on the unified address space.

The unified address space may be considered to be one unified (virtual) address space to which host memory and device memory (i.e., the on-board video memory of the GPU) are mapped. In the unified address space, the memory and the video memory are not distinguished any more, so that support is provided for freely carrying out data exchange between the memory and the video memory.

Therefore, the video memory application request (i.e. the default video memory application request) of the virtual GPU is replaced by the video memory application request based on the unified address space, so that at least part of data in the on-board video memory can be exchanged to the video memory exchange area based on the unified address space under the condition that the currently available on-board video memory of the virtual GPU is insufficient.

The method and the device can intercept the call request of the application or the task to the virtual GPU by utilizing the hijacking library, and replace the default interface used by the application or the task for applying the video memory with the video memory application interface based on the unified address space so as to replace the video memory application request of the application or the task for the virtual GPU with the video memory application request based on the unified address space.

The concept of unified address space is presented in Cuda8 and subsequent Cuda versions. In the unified address space, the device memory (video memory) and the host memory are not distinguished any more, and the data exchange between the device memory and the host memory is automatically performed by the HMM component in the Linux kernel and the Nvidia kernel module, so that a user is not required to manually control by calling the CuMemcpyDtoH or the CuMemcpyHtoD. The control mechanism enables free exchange of the memory and the video memory, so that the memory can be used as an exchange area of the video memory.

The present disclosure may replace the default memory application interface (cumemolloc) with a unified address space based memory application interface (cumemolloc manager). Under the condition that the video memory used by the video memory application interface based on the unified address space exceeds the current available on-board video memory of the virtual GPU, at least part of data stored in the on-board video memory can be automatically exchanged to a video memory exchange area by a system component (such as a kernel module part in an Nvidia driver and an HMM component in a Linux kernel) so as to release at least part of on-board video memory and meet the video memory use requirements of applications or tasks.

Therefore, the common video memory application is replaced by the video memory application based on the unified address space through replacing the Cuda call chain, so that the application or task has virtual video memory use capability, and the whole process is not felt to the application or task.

GPUs described in this disclosure may be deployed in a K8S cluster, which is a Kubernetes-based GPU cluster, comprising a plurality of GPU nodes, each GPU node may include one or more GPUs. Wherein, the GPU node may refer to a GPU server. In a K8S cluster, an application or task may be deployed in a Container (Container) that can provide environmental support for the running of the application or task, and a virtual GPU that the application or task needs to use may be mounted in the Container for exclusive use by the application or task.

In order to flexibly set the segmentation granularity of GPU virtualization, the present disclosure proposes that the virtualization information may be set by a user based on the actual situation of the cluster and upper layer applications or tasks. The virtualization information may characterize the granularity of the splitting of the GPU and/or the size of the memory swap area allocated for the virtual GPU.

For example, the virtualization information may include a maximum number of virtual GPUs and a virtual memory size. The maximum number of virtual GPUs is used to define how many virtual GPUs a physical card can be split into at most. The virtual memory size is used to define the size of the virtual memory that needs to be configured for the virtual GPU. The size of the virtual video memory can be represented by an index of "virtual video memory multiple", wherein the virtual video memory multiple refers to a multiple of the virtual video memory being the actual on-board video memory, for example, if the virtual video memory multiple is set to 10, the virtual video memory is represented to be 10 times of the actual on-board video memory.

Therefore, when the GPU is segmented into a plurality of virtual GPUs, the GPU can be segmented into a plurality of virtual GPUs on the condition that the number of the segmented virtual GPUs is not larger than the maximum virtual GPU number; when at least part of the host memory is used as a video memory exchange area and is allocated to the virtual GPU, the host memory with the same size as the virtual video memory can be used as the video memory exchange area and is allocated to the virtual GPU.

In order to avoid the influence among tasks, one virtual GPU can be allocated to one application or task for use, and the scheme can limit the access and use of the tasks to the GPU based on the virtualization parameters set above, so that the running state of one application or task is changed, and the application or task on other virtual GPUs is not influenced. When a virtual GPU is allocated to an application or task, resource requirement information of the application or task may be obtained, where the resource requirement information is used to characterize GPU resources required by the application or task. The GPU resources required by an application or task may be different in different states during the running process, and the GPU resources required by an application may refer to the maximum GPU resources required by the application or task during the whole running process. According to the resource demand information, a virtual GPU capable of satisfying GPU resources required by an application or task may be selected.

The resource demand information may include a computing power usage ratio and a memory usage size. The calculation force use proportion is used for representing GPU calculation force proportion examples which are needed to be used by an application or a task, and the units are%. If the GPU calculation ratio is set to 10, the application or task can be scheduled to the virtual GPU with the GPU calculation ratio of more than 10%. The unit of the memory usage size may be MB, and if the memory usage size is set to 1024, the application or task may occupy 1024MB of GPU memory.

In order to implement the use of the container for the transparency of the virtual GPU, the present disclosure proposes that an environment variable corresponding to the virtual GPU needs to be set when creating the container, which may include, but is not limited to, a container identifier, an identifier of the virtual GPU that is mounted into the container, and an upper limit of GPU resources that the container can access. The container identification may be a universally unique identification code (Universally Unique Identifier, UUID) for the dispatcher plug-in identification container mentioned below. The virtual GPU's identification may be a GPU device number accessible within the container for device mapping inside and outside the container. The upper limit of GPU resources that the container can access may refer to the maximum GPU resources that the virtual GPU can provide, and may include a memory and a power usage scale upper limit. A virtual GPU may be mounted in a container based on an environment variable and the container is started to run an application or task in the container.

The present disclosure may also monitor resource usage of the virtual GPU. Optionally, the resource usage of the virtual GPU may also be output (e.g., visually exposed) in real time.

Thus, the basic flow related to the GPU virtualization method and the GPU resource usage method of the present disclosure is described with reference to fig. 1. The disclosure further provides a job scheduling device supporting cloud protogenesis, which is suitable for being deployed on nodes requiring GPU virtualization in a K8S cluster to schedule container class jobs in a manner of supporting cloud protogenesis.

The container class job refers to a job, such as pod, deployment, job, which is submitted by the K8S cluster and contains vGPU resources. A container class job, which may also be referred to as a vGPU task. A container class job may involve one or more containers, in K8S, pod is the minimum unit of K8S management, and one Pod may contain a plurality of containers therein. One application or task may be running in each container. GPU resources required for the execution of an application or task in a container may refer to vGPU resources, i.e. one application or task may use one vGPU exclusively. The vGPU resources may be used to characterize the resource size of the vGPU required for each application or task, such as GPU power scale and used memory size.

As shown in fig. 2, job scheduling device 200 may include a scheduler component 210 and a hijacking library 220. Optionally, job scheduling device 200 may further include one or more of equipment component 230, mount component 240, monitoring component 250, and tagging component 260, as shown in the dashed box. The components may be packaged by means of a char.

Scheduler component 210 may schedule container class jobs onto one or more GPUs based on the information of the container class job's demands for GPU resources and segment the GPUs into one or more virtual GPUs, each corresponding to a container in the container class job, each corresponding to an application or task, the application or task running in the container. For the GPU which is already put into use, the scheduler component can segment the remaining computing power and the video memory of the GPU to obtain a virtual GPU.

For at least one virtual GPU, the scheduler component 210 can also allocate at least a portion of the host memory as a memory swap area to the virtual GPU such that the available memory of the virtual GPU is greater than the on-board memory of the virtual GPU. Scheduler component 210 can segment GPUs based on vGPU resource information in container class jobs. The vGPU resource information may include the number of vGPU and maximum resource information that each vGPU needs to provide, such as an upper computational power limit and an upper memory limit.

Scheduler component 210, which may also be referred to as a scheduler plug-in. The Scheduler component 210 may be a Scheduler plug-in that improves on Scheduler extender in the K8S cluster to schedule all vGPU tasks, which may be denoted as 4PD-vGPU-Scheduler.

Scheduler component 210 can hijack and handle the scheduling of all vGPU tasks, orchestrate GPU resources for all clusters and distribute (i.e., schedule) tasks over several GPUs on the appropriate GPU node. The original K8S official scheduler only supports allocation of GPUs by number, and the allocated GPUs are regarded as exclusive resources and cannot be used by other applications or tasks. In contrast, the scheduler component 210 of the present disclosure may support tasks to specify GPU resources (e.g., memory size and computing power usage ratio) that need to be used, and by scheduling applications or tasks onto virtual GPUs that meet the requirements, support tasks to use only a portion of the memory and computing power of the GPU, thereby allowing multiple tasks to share resources of one GPU.

Hijacking library 220 is used to intercept application or task call requests to the GPU. For an application or task that needs to use the virtual memory, the hijacking library 220 further sets (e.g., replaces) an interface (e.g., cuMemAlloc) that applies for the use of the memory to a memory application interface (e.g., cumemallocManagerd) based on a unified address space, so that at least a portion of data in the on-board memory can be exchanged to the memory exchange area based on the unified address space in the case that the currently available on-board memory of the virtual GPU is insufficient. Under the condition that the video memory applied for the application of the video memory application interface based on the unified address space is larger than the currently available on-board video memory, at least part of data stored in the on-board video memory can be automatically exchanged to the video memory exchange area by utilizing a system component (such as an HMM component) so as to release at least part of the on-board video memory.

The hijacking library 220 may be retrofitted to an existing component Hooked Cuda Driver (i.e., CUDA hijacking library). The hijacking library is the prior art, but the hijacking library in the prior art does not support multi-card segmentation and cannot be directly used on K8S. The present disclosure may add functionality to the hijacked library to communicate with the scheduler component 210 so that the K8S cluster may dynamically control individual applications or tasks.

As described above, the concept of a unified address space is proposed in Cuda8 and subsequent Cuda versions, in which the device memory (video memory) and the host memory are no longer distinguished, and the data exchange between all the device memory and the host memory is automatically performed by the HMM component in the Linux kernel and the Nvidia kernel module, and the user is no longer required to manually control by calling CuMemcpyDtoH or CuMemcpyHtoD. The control mechanism enables free exchange of the memory and the video memory, so that the memory can be used as an exchange area of the video memory.

The present disclosure creatively applies the technology of unified address space and hijack library to the GPU multiplexing field, and replaces Cuda call chains with the hijack library 220 (i.e., cuda hijack library), so that the common video memory application is replaced with the application of "unified address space", and thus the task has the capability of using the memory as a video memory exchange area under the condition of completely no perception.

The hijacking library 220 can hijack all the call requests of the upper layer by hijacking the symbol call mode, and the call requests are forwarded to the real CUDA execution library of the lower layer after being processed. Fig. 3 shows a schematic diagram of a hijacked library.

As shown in fig. 3, the hijacking library is located between the driver layer (Nvidia GPU Driver) and the Cuda run (Cuda Runtime) layer, so all requests sent by the Cuda Runtime to the driver can be intercepted. On the basis, the hijacking library can check whether the video memory used by the application or the task application is larger than the video memory allocated to the application or the task application, and send a video memory application request to a lower driver, such as a GPU driver, under the condition that the video memory used by the application or the task application is not larger than the video memory allocated to the application or the task application. For example, the hijacking library may perform statistics of the video memory and the computing power for each container, and perform corresponding validity check (the applied video memory cannot exceed the allocated video memory size) and then transmit the result to the lower layer driver.

The device component 230 is configured to mount (i.e., map) the hijacking libraries 220 into a container and to set a pre-load library in the container such that the hijacking libraries are forcibly mounted prior to the start of a process in the container.

The device component 230, which may also be referred to as a device plug-in. The Device component 230 may be an improvement over the Device plug in (i.e., device Plugin) in the K8S cluster. The modified Device plug-in may be denoted as 4PD-vGPU-Device plug in. The device component 230 is responsible for mapping the Cuda hijack library (libvgpu. So) inside the container and setting a preload library (e.g., preload file/etc/ld. So. Preload) in the container. The function of/etc/ld.so.preload is to force libvgpu.so to be mounted before any process is started, so that the user cannot bypass the vGPU to directly access the GPU. Thus, all calls to Cuda inside the container are forwarded via the hijacking library.

The mounting component 240 may obtain the identity of the virtual GPU by communicating with the scheduler component 210. The identification of the virtual GPU is used to identify the device number of the virtual GPU, and the mounting component 240 may mount the virtual GPU into the container according to the identification of the virtual GPU. Wherein the mounting assembly 240 may be mounted into the container along with the corresponding drive library.

The mounting assembly 240 is located at the container level and may be modified from nvidia-container-run. The modified nvidia-container-runtime can be recorded as 4 PD-nvidia-container-runme. In contrast to the common nvidia-container-run, the mount component 240 has more communications with the scheduler component 210, and based on the communications the mount component 240 can actually mount the vGPU assigned to the container by the scheduler component 210 into the container.

The mounting component 240 may also obtain GPU resource information for the virtual GPU by communicating with the scheduler component 210 and record the GPU resource information in the environment variables of the container. The GPU resource information may refer to an upper limit of GPU resources that the virtual GPU is capable of providing, such as a display memory upper limit and a computing power upper limit. GPU resource information in the environment variables of the container, i.e., GPU resources that the container has access to (i.e., uses). Containers may be created and started based on environmental variables. The operations to create and initiate a container may be performed by the container runtime runc.

The monitoring component 250 may be denoted as a 4PD-VGPU-monitor for monitoring the resource usage of the VGPU, e.g. may monitor a number of metrics set in advance. The monitoring component 250 can also push metrics outwards, facilitating real-time monitoring and visualization of vGPU resources for the entire cluster.

The tagging component 260 can record, for each container in the container-type job, a container identification that can uniquely identify the container in an environment variable of the container, facilitating identification by the scheduler component 210. A universally unique identification code (Universally Unique Identifier, UUID) may be employed as the container identification. The tagging component 260 may refer to a MutatingWebhook in K8S.

Fig. 4 shows an overall process flow diagram for a vGPU task.

The steps shown in fig. 4 may be performed by corresponding components in the job scheduling apparatus of the present disclosure. Specifically, step S410, step S420 may be performed by the marking component; step S340, step S440, and step S480 may be performed by a scheduler component; step S450 may be performed by a device component; step S460 may be performed by the mounting component.

Referring to fig. 4, at step S410, when a task is submitted, it may first be checked by the tagging component whether there are vGPU resources in the vGPU task. If the vGPU resource is detected, the process of step S420 to step S460 may be executed, and by virtualizing the GPU, the application or task may use the virtual memory, and in particular, may use the virtual memory without sense. If vGPU resources are not checked, it indicates that the submitted task does not need to virtualize GPU, so step S470 may be performed, executing the default scheduling procedure.

In step S420, a container UUID is added for each vGPU resource found to identify the corresponding container by the square mount component. Each vGPU resource corresponds to a container, and the vGPU resources are the vGPU resources that the container needs to use.

In step S430, GPU nodes in the cluster are filtered.

In step S440, scoring is performed on the GPU nodes obtained by the screening.

By filtering and scoring, the most appropriate GPU node (e.g., the highest scoring GPU node or nodes) may be ultimately selected to perform the vGPU task. For example, the GPU nodes supporting virtualization may be first screened, then the GPU nodes are scored according to the current GPU remaining power on the GPU nodes, the supported segmentation granularity, and the like, the GPU nodes that are most suitable (i.e., highest in score) are selected, and the GPUs in the finally selected GPU nodes are segmented, so as to obtain one or more vGPU that meet the requirements.

In step S450, hijacking library mount is added.

After the task is submitted to the corresponding GPU node, the CUDA hijacking library libvgpu.so and the preloaded file/etc/ld.so.preload may be mounted.

In step S460, an environment variable is set.

After step S450 is performed, the task may be submitted to the container layer, and the nvidia container runtime of the container layer communicates with the vGPU scheduler, and obtains the GPU serial number corresponding to the vGPU, and the corresponding available video memory and the upper utilization limit, and fills the following 3 environment variables respectively: NVIDIA_VISIBLE_DEVICES, CUDA_DEVICE_MEMORY_LIMIT, CUDA_DEVICE_SM_LIMIT. The first environment variable is used for controlling the number of GPU equipment mounted in the container, and the second environment variable and the third environment variable are used for controlling the access to the GPU, and are respectively the upper limits of the video memory and the utilization rate which can be accessed by the container.

Finally, according to the environment variables, nvidia-container-cli can be called to carry out the mounting of specific GPU equipment and corresponding drive libraries, and the specific GPU equipment and the corresponding drive libraries are delivered to a run starting container.

In summary, the present disclosure discusses a cloud-native scheme-based GPU virtualization technology that enables virtual memory capabilities, giving a complete flow and solution from a product and technology perspective to address the above-mentioned drawbacks. The scheme of the present disclosure can be implemented based on a cloud native K8S technology, and can be directly adapted to a cloud native scene. Accordingly, the present disclosure may be implemented as a cloud native GPU virtualization scheme that supports virtual memory. The cloud native GPU virtualization scheme realized based on the job scheduling device mainly comprises a component deployment stage, an application or task creation stage and an application or task operation stage.

In the component deployment stage, each component in the GPU resource management device can be installed on a node needing to be virtualized by using the GPU in the K8S cluster, parameters such as virtual video memory multiple, maximum vGPU number and the like are required to be set in the deployment process, and a user can set according to actual conditions of the cluster and upper-layer applications or tasks. These two parameters are added parameters of the present scheme, and the function of the two parameters can be referred to in the related description.

In the application or task creation stage, a user needs to set parameters such as 'used GPU computing power ratio, used GPU memory size' and the like by combining the consumption condition of the application or task on GPU resources and the computing power/memory size of GPU cards in the cluster, and the computing power and the memory size required by each application or task are flexibly set. These two parameters are also novel added parameters of the present disclosure.

At this stage of running the user or task, the application or task will run on the virtual GPUs using the set parameters, and the applications on each virtual GPU will not interfere with each other, and the user will not influence the applications or tasks on other virtual GPUs by changing the running state of one application or task.

The expected effect is that the sum of the sizes of the video memories used by each application or task can exceed the total display quantity of the actual physical GPU, and the performance of each application or task is basically consistent with that of the application or task without using GPU virtualization under the reasoning scene and the GPU computing power is not full, and the loss is within 10%. This is because the locality of the reasoning task is strong, and the memory and the video memory are not exchanged too frequently, so that a great performance decay is caused, and the loss refers to an increased part of the request processing time in the reasoning scene.

The present disclosure may enable deployment of multiple AI applications (e.g., inference models) on one GPU by multiplexing the GPUs.

In the prior art, deployment of multiple inference models on one GPU may be achieved by a multi-model loading mode (e.g., selecting Nvidia Triton server, torchserv, tf-serving). For example, nvidia has developed Nvidia triton server inference engines for inference scenarios, where multiple inference models can be loaded on one GPU. As another example, many AI training frameworks have also developed corresponding inference service engines that can load multiple models simultaneously in one task and provide inference services for each model. The multi-model loading mode has the defect that the technology is only suitable for specific models, for example tf-service can only load a tensorsurface model, torchserv can only load a torch model, even nvidia triton server with the widest applicability can only load a pyrachscript model for pyrach, and all application scenes cannot be covered. The GPU multiplexing scheme of the present disclosure may cover all application scenarios.

In the prior art, deployment of multiple inference models on one GPU may also be achieved by selecting other GPU virtualization schemes. However, most of the popular GPU virtualization schemes at present cannot be adapted to the cloud native scenes of the privately deployed GPU, for example, the virtualization capability provided by the nvidia vGPU is specific to the virtual machine scene, for example, the qgpu and cgpu schemes proposed by some enterprises at present are specific to the public cloud scene thereof. The above solutions are difficult to directly adapt to the cloud native scene. The method and the device can be directly adapted to the cloud native scene, and are realized as a cloud native GPU virtualization scheme.

The GPU virtualization method described above in connection with fig. 1 of the present disclosure may also be implemented as a GPU virtualization device.

The functional units of the GPU virtualization device may be implemented by hardware, software, or a combination of hardware and software that implements the principles of the present disclosure. Those skilled in the art will appreciate that the functional units depicted in fig. 5 may be combined or divided into sub-units to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or even further definition of the functional units described herein.

The functional units that the GPU virtualization device may have and the operations that each functional unit may perform are briefly described below, and details related thereto may be referred to the above related description, which is not repeated herein.

Referring to fig. 5, the gpu virtualization device 500 may include a segmentation module 510 and an allocation module 520.

The splitting module 510 is configured to split the GPU into multiple virtual GPUs.

The allocation module 520 is configured to allocate, for at least one virtual GPU, a portion of the memory of the host in which the GPU is located as a memory swap area to the virtual GPU, so that the available memory of the virtual GPU is greater than the on-board memory of the virtual GPU.

The GPU virtualization apparatus 500 may further include a first obtaining module configured to obtain virtualization information, where the virtualization information includes a maximum number of virtual GPUs and a virtual memory size. The splitting module may split the GPU into a plurality of virtual GPUs on the condition that the number of virtual GPUs obtained by splitting is not greater than the maximum number of virtual GPUs. The allocation module may allocate host memory of the same size as the virtual memory as the memory swap area to the virtual GPU.

The GPU virtualization device 500 may further include a second acquisition module and an allocation module, where the second acquisition module is configured to acquire resource requirement information, and the resource requirement information is configured to characterize GPU resources required by an application or task. The allocation module is used for allocating the virtual GPU for the application or the task according to the resource demand information.

The GPU virtualization device 500 may further include a replacing module, configured to replace a video memory application request of an application or task for the virtual GPU with a video memory application request based on a unified address space, so that, in a case that the currently available on-board video memory of the virtual GPU is insufficient, at least part of data in the on-board video memory can be exchanged to the video memory exchange area based on the unified address space. The replacement module can intercept a call request of an application or a task to the virtual GPU by utilizing the hijacking library, and replace a default interface used by the application or the task for applying for the video memory with a video memory application interface based on a unified address space.

An application or task may run in the container. The GPU virtualization device 500 may also include a setup module, a mount module, and a launch module. The setting module is used for setting environment variables of the container, wherein the environment variables comprise container identification, identification of virtual GPU mounted in the container and upper limit of GPU resources which can be accessed by the container. The mounting module is used for mounting the virtual GPU into the container based on the environment variable. The launch module is used to launch a container to run an application or task in the container.

The GPU virtualization apparatus 500 may further include a monitoring module configured to monitor resource usage of the virtual GPU.

The disclosure may also be implemented as a Kubernetes cluster comprising a plurality of GPU nodes, each comprising one or more GPUs; and job scheduling means disposed on at least one of the GPU nodes, the job scheduling means may be the GPU resource management means described above in connection with fig. 2.

FIG. 6 illustrates a schematic architecture of a computing device that may be used to implement the above-described GPU resource usage method or GPU virtualization method, according to one embodiment of the present disclosure.

Referring to fig. 6, a computing device 600 includes a memory 610 and a processor 620.

Processor 620 may be a multi-core processor or may include multiple processors. In some embodiments, processor 620 may include a general-purpose host processor and one or more special coprocessors, such as a Graphics Processor (GPU), digital Signal Processor (DSP), etc. In some embodiments, the processor 620 may be implemented using custom circuitry, for example, an application specific integrated circuit (ASIC, application Specific Integrated Circuit) or a field programmable gate array (FPGA, field Programmable Gate Arrays).

Memory 610 may include various types of storage units, such as system memory, read Only Memory (ROM), and persistent storage. Where the ROM may store static data or instructions that are required by the processor 620 or other modules of the computer. The persistent storage may be a readable and writable storage. The persistent storage may be a non-volatile memory device that does not lose stored instructions and data even after the computer is powered down. In some embodiments, the persistent storage device employs a mass storage device (e.g., magnetic or optical disk, flash memory) as the persistent storage device. In other embodiments, the persistent storage may be a removable storage device (e.g., diskette, optical drive). The system memory may be a read-write memory device or a volatile read-write memory device, such as dynamic random access memory. The system memory may store instructions and data that are required by some or all of the processors at runtime. Furthermore, memory 610 may include any combination of computer-readable storage media including various types of semiconductor memory chips (DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), magnetic disks, and/or optical disks may also be employed. In some implementations, memory 610 may include readable and/or writable removable storage devices such as Compact Discs (CDs), digital versatile discs (e.g., DVD-ROMs, dual-layer DVD-ROMs), blu-ray discs read only, super-density discs, flash memory cards (e.g., SD cards, min SD cards, micro-SD cards, etc.), magnetic floppy disks, and the like. The computer readable storage medium does not contain a carrier wave or an instantaneous electronic signal transmitted by wireless or wired transmission.

The memory 610 has stored thereon executable code that, when processed by the processor 620, causes the processor 620 to perform the GPU resource utilization method or GPU virtualization method described above.

The GPU resource usage method, the GPU virtualization method, and the job scheduling apparatus, the cluster according to the present disclosure have been described in detail above with reference to the accompanying drawings.

Furthermore, the method according to the present disclosure may also be implemented as a computer program or computer program product comprising computer program code instructions for performing the above steps defined in the above method of the present disclosure.

Alternatively, the present disclosure may also be implemented as a non-transitory machine-readable storage medium (or computer-readable storage medium, or machine-readable storage medium) having stored thereon executable code (or computer program, or computer instruction code) that, when executed by a processor of an electronic device (or computing device, server, etc.), causes the processor to perform the steps of the above-described method according to the present disclosure.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems and methods according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description of the embodiments of the present disclosure has been presented for purposes of illustration and description, and is not intended to be exhaustive or limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for using GPU resources, comprising:

dividing the GPU into a plurality of virtual GPUs;

for at least one virtual GPU, at least part of host memory is used as a video memory exchange area and is distributed to the virtual GPU, so that the available video memory of the virtual GPU is larger than the on-board video memory of the virtual GPU; and

and replacing the video memory application request of the application or task aiming at the virtual GPU with a video memory application request based on a unified address space, so that at least part of data in the on-board video memory can be exchanged to the video memory exchange area based on the unified address space under the condition that the currently available on-board video memory of the virtual GPU is insufficient.

2. The method of claim 1, wherein replacing the application or task memory request for the virtual GPU with a unified address space based memory request comprises:

and intercepting a call request of the application or task to the virtual GPU by utilizing the hijacking library, and replacing a default interface used by the application or task for applying the video memory with a video memory application interface based on a unified address space.

3. The method as recited in claim 1, further comprising:

obtaining virtualization information, wherein the virtualization information comprises the maximum number of virtual GPUs and the virtual video memory size,

Wherein splitting the GPU into a plurality of virtual GPUs comprises: dividing the GPU into a plurality of virtual GPUs on the condition that the number of the virtual GPUs obtained by dividing is not larger than the maximum virtual GPU number,

the method comprises the steps of using partial memory of equipment where the GPU is located as a video memory exchange area, distributing the partial memory to the virtual GPU, and comprising the following steps: and taking the host memory with the same size as the virtual video memory as a video memory switching area, and distributing the host memory to the virtual GPU.

4. The method as recited in claim 1, further comprising:

acquiring resource demand information, wherein the resource demand information is used for representing GPU resources required by an application or a task;

and distributing virtual GPU for the application or task according to the resource demand information.

5. A method for virtualizing a GPU, comprising:

dividing the GPU into a plurality of virtual GPUs;

and for at least one virtual GPU, at least part of host memory is used as a video memory exchange area and is distributed to the virtual GPU, so that the available video memory of the virtual GPU is larger than the on-board video memory of the virtual GPU.

6. A job scheduling device, comprising:

a scheduler component configured to schedule container class jobs onto one or more GPUs, split the GPUs into one or more virtual GPUs, each virtual GPU corresponding to a container in the container class jobs, each container corresponding to an application or task running in the container, wherein, for at least one of the virtual GPUs, the scheduler component further allocates at least a portion of host memory as a memory swap area to the virtual GPU such that the available memory of the virtual GPU is greater than the on-board memory of the virtual GPU;

The hijacking library is used for intercepting a call request of an application or task to the GPU, and for the application or task needing to use the virtual video memory, the hijacking library also sets an interface used for applying the video memory as a video memory application interface based on a unified address space.

7. A Kubernetes cluster comprising:

a plurality of GPU nodes, each GPU node comprising one or more GPUs; and

a job scheduling device deployed on at least one of the GPU nodes, the job scheduling device being the GPU resource management device of claim 6.

8. A GPU virtualization apparatus, comprising:

the segmentation module is used for segmenting the GPU into a plurality of virtual GPUs;

the allocation module is used for allocating partial memory of the host computer where the GPU is positioned to the virtual GPU as a video memory exchange area aiming at least one virtual GPU, so that the available video memory of the virtual GPU is larger than the on-board video memory of the virtual GPU.

9. A computing device, comprising:

a processor; and

a memory having executable code stored thereon, which when executed by the processor causes the processor to perform the method of any of claims 1 to 5.

10. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the method of any of claims 1 to 5.