CN115686805A

CN115686805A - GPU resource sharing method and device, and GPU resource sharing scheduling method and device

Info

Publication number: CN115686805A
Application number: CN202110832237.6A
Authority: CN
Inventors: 薛磊; 阎姝含
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-07-22
Filing date: 2021-07-22
Publication date: 2023-02-03

Abstract

The application provides a method and a device for GPU resource sharing, a method and a device for scheduling GPU resource sharing, computing equipment and a computer storage medium. The GPU resource sharing method comprises the following steps: receiving a request to create a container; calling a kernel module running in a kernel mode of an operating system to create a virtual sub-GPU, so that the virtual sub-GPU occupies partial resources of a real GPU; adapting a virtual sub-GPU to the operating system such that a virtual sub-GPU is available for the container; mounting the virtual sub-GPU to the container, so that the virtual sub-GPU utilizes occupied partial resources of the real GPU to process the calculation request of the container after the container is created.

Description

GPU resource sharing method and device, and GPU resource sharing scheduling method and device

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for GPU (graphics processing unit) resource sharing, a method and an apparatus for scheduling GPU resource sharing, a computing device, and a computer storage medium.

Background

With the development of computer technology, the demand for GPU resource sharing is increasing, for example, in the field of cloud computing, the size of GPU clusters is expanding rapidly. As an example, in a situation where a large number of GPU resources need to be planned and laid out respectively according to the demands of multiple processes, the usage efficiency of the GPU can be improved through GPU resource sharing. Currently, in a conventional method for GPU resource sharing, GPU resource sharing is generally implemented by creating and configuring a container.

However, among conventional methods of GPU resource sharing, there are methods that do not support creating containers, particularly containers that may restrict resources; some methods support creating such a container, but only support realizing resource sharing through a user mode, and the user has an opportunity to bypass the limitation of the user mode, which finally results in low security of GPU resource sharing. Most of the existing GPU resource sharing methods are realized in a user mode, however, because the environment of the user mode is relatively complex and is updated frequently, the GPU resource sharing methods also need to be updated along with the environment, and the complexity of GPU resource sharing is greatly increased.

Disclosure of Invention

In view of the above, the present disclosure provides methods and apparatus for GPU resource sharing, methods and apparatus for scheduling GPU resource sharing, computing devices, and computer storage media, which desirably overcome some or all of the above-referenced disadvantages and possibly others.

According to a first aspect of the present disclosure, there is provided a method for GPU resource sharing, comprising: receiving a request to create a container; calling a kernel module running in a kernel mode of an operating system to create a virtual sub-GPU, so that the virtual sub-GPU occupies partial resources of a real GPU; adapting a virtual sub-GPU to the operating system such that the virtual sub-GPU is available for the container; mounting the virtual sub-GPU to the container, so that the virtual sub-GPU utilizes occupied partial resources of the real GPU to process the calculation request of the container after the container is created.

In some embodiments, invoking a kernel module running in a kernel state of an operating system to create a virtual sub-GPU, such that the virtual sub-GPU occupies a portion of resources of a real GPU comprises: responding to a container to be created to request GPU resources, calling a kernel module running in a kernel mode of an operating system to create a virtual sub-GPU, and enabling the virtual sub-GPU to occupy partial resources of a real GPU.

In some embodiments, further comprising: receiving a resource parameter indicating an amount of a partial resource of a real GPU requested by a container to be created, and wherein invoking a kernel module running in a kernel state of an operating system to create a virtual sub-GPU so that the virtual sub-GPU occupies the partial resource of the real GPU, comprises: calling a kernel module running in a kernel mode of the operating system to create a virtual sub GPU; sending a resource parameter to the kernel module, such that the kernel module configures, based on the resource parameter, an amount of the partial resource of the real GPU occupied by the virtual sub-GPU in response to determining that the amount of the requested partial resource does not exceed an amount of unoccupied resources of the real GPU.

In some embodiments, mounting the virtual sub-GPU to the container comprises: setting the device name of the virtual sub-GPU to be in the form of the device name of a real GPU; mounting the virtual sub-GPU to the container so that the container can identify the virtual sub-GPU through the device name of the virtual sub-GPU.

In some embodiments, the resource parameter includes an amount of real GPU that is requested by the container to be created, and wherein sending the resource parameter to the kernel module such that the kernel module configures, in response to determining that the amount of the requested partial resource does not exceed an amount of unoccupied resources of the real GPU, the amount of the partial resource that occupies the real GPU for the virtual sub-GPU based on the resource parameter includes: sending resource parameters to the kernel module, so that the kernel module configures the amount of the virtual sub-GPU occupying the video memory of the real GPU based on the requested video memory amount in response to determining that the requested video memory amount does not exceed the amount of the unoccupied video memory of the real GPU.

In some embodiments, the resource parameters include an amount of real GPU requested by the container to be created, and wherein sending the resource parameters to the kernel module such that the kernel module configures the virtual sub-GPU to occupy the amount of the portion of resources of the real GPU based on the resource parameters in response to determining that the amount of the requested portion of resources does not exceed an amount of unoccupied resources of the real GPU comprises: sending resource parameters to the kernel module such that the kernel module configures, in response to determining that the requested computing power does not exceed an unoccupied computing power of a real GPU, the computing power of the virtual sub-GPU to occupy the real GPU by the number of time slices of the real GPU allocated for the virtual sub-GPU based on the requested computing power.

In some embodiments, adapting the virtual sub-GPU to the operating system such that the virtual sub-GPU is available to the container comprises: adding the virtual sub-GPU to a resource control group of the operating system, wherein the resource control group of the operating system is used for limiting, controlling and separating resources occupied by the process, so that the virtual sub-GPU can be used for the container.

According to a second aspect of the present disclosure, there is provided a method for scheduling GPU resource sharing, comprising: determining node information of a current node, wherein the node information indicates resources of a real GPU of the current node; sending the node information of the current node to a scheduler, wherein the scheduler is used for scheduling a container to be created and operated on the current node based on the node information; receiving a scheduling request which is issued by a scheduler to a current node and is related to the creation of a container; in response to a container to be created requesting resources of a GPU, causing the GPU resource sharing method of claim 1 to be performed so as to cause the container to be created to occupy a portion of the resources of the real GPU of the current node.

In some embodiments, the current node comprises a created container, and wherein determining node information for the current node comprises: and determining the total amount of the real GPU resources of the current node and the amount of the real GPU resources occupied by the created container.

In some embodiments, the resources of the real GPU of the current node comprise a video memory of the real GPU of the current node.

According to a third aspect of the present disclosure, there is provided an apparatus for GPU resource sharing, comprising: a receiving module configured to receive a request to create a container; a creating module configured to invoke a kernel module running in a kernel state of an operating system to create a virtual sub-GPU such that the virtual sub-GPU occupies a portion of resources of a real GPU; an adaptation module configured to adapt a virtual sub-GPU to the operating system such that the virtual sub-GPU is available for the container; a mounting module configured to mount the virtual sub-GPU to the container such that the virtual sub-GPU processes computational requests of the container using a portion of occupied real GPU resources after the container is created.

According to a fourth aspect of the present disclosure, an apparatus for scheduling GPU resource sharing is provided, comprising: an information determination module configured to determine node information for a current node, the node information indicating resources of a real GPU of the current node; an information sending module configured to send node information of a current node to a scheduler, the scheduler being configured to schedule a container to be created and run on the current node based on the node information; the scheduling request module is configured to receive a scheduling request which is issued by a scheduler to a current node and is related to the creation of the container; a scheduling module configured to cause a GPU resource sharing method according to the first aspect of the present disclosure to be performed in response to a container to be created requesting resources of a GPU, so as to cause the container to be created to occupy a portion of resources of a real GPU of a current node.

According to a fifth aspect of the present disclosure, there is provided a computing device comprising: a memory configured to store computer-executable instructions; a processor configured to perform any of the methods described above when the computer-executable instructions are executed by the processor.

According to a sixth aspect of the present disclosure, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed, perform any of the methods described above.

The method and the device for sharing the GPU resources, the method and the device for scheduling the GPU resource sharing, the computing equipment and the computer storage medium are claimed by the present disclosure, the virtual sub-GPU is created by calling the kernel module running in the kernel mode of the operating system, and the GPU resources are shared. Due to the kernel mode of the kernel module for executing the kernel function (namely, creating the virtual sub-GPU) to run, the user cannot bypass the limitation made by the kernel module, and the safety of GPU resource sharing is improved. Moreover, the GPU resource sharing scheme disclosed by the invention supports the container, and the kernel state is updated less usually, so that the inconvenience caused by frequent updating of the user environment is avoided, the drive of the GPU is not required to be modified, and the complexity of using the method is reduced.

These and other advantages of the present disclosure will become apparent from and elucidated with reference to the embodiments described hereinafter.

Drawings

Embodiments of the present disclosure will now be described in more detail and with reference to the accompanying drawings, in which:

fig. 1 illustrates an exemplary application scenario in which a technical solution according to an embodiment of the present disclosure may be implemented;

FIG. 2 illustrates a schematic flow chart diagram of a method of GPU resource sharing in accordance with one embodiment of the present disclosure;

FIG. 3 illustrates a schematic flow chart diagram of a method of creating a virtual sub-GPU in accordance with one embodiment of the present disclosure;

FIG. 4 illustrates a schematic flow chart diagram of a method of mounting a virtual sub-GPU to a container in accordance with one embodiment of the present disclosure;

FIG. 5 illustrates a schematic flow chart of a method of scheduling GPU resource sharing;

FIG. 6 illustrates an exemplary specific principle architecture diagram of a method of GPU resource sharing according to one embodiment of the present disclosure;

FIG. 7 illustrates an exemplary detailed schematic framework diagram of a kernel module running in the kernel mode of an operating system of a method of GPU resource sharing according to one embodiment of the present disclosure;

FIG. 8 illustrates an exemplary schematic architecture diagram for a method of limiting the video memory of a virtual sub-GPU according to one embodiment of the present disclosure;

FIG. 9 illustrates an experimental effect diagram of a method of limiting video memory of a virtual sub-GPU according to one embodiment of the present disclosure;

FIG. 10 illustrates a specific schematic architecture diagram for configuring the computing power of the virtual sub-GPUs to occupy a real GPU, according to one embodiment of the present disclosure;

FIG. 11 illustrates an exemplary specific principle architecture diagram of a method of scheduling GPU resource sharing according to one embodiment of the present disclosure;

FIG. 12 illustrates an experimental effect graph of GPU resource utilization in a GPU resource sharing method according to one embodiment of the disclosure;

FIG. 13A illustrates a graph of performance versus experimental results for a large task exclusive GPU resource in a GPU resource sharing method according to one embodiment of the disclosure;

FIG. 13B illustrates a graph of performance versus experimental results for a multitask shared GPU resource in a GPU resource sharing method according to one embodiment of the disclosure;

FIG. 14 illustrates an experimental effect graph when a small task monopolizes and shares GPU resources in a GPU resource sharing method according to one embodiment of the disclosure;

FIG. 15 illustrates an exemplary block diagram of a GPU resource sharing device, according to one embodiment of the present disclosure;

FIG. 16 illustrates an exemplary block diagram of an apparatus for scheduling GPU resource sharing according to one embodiment of the present disclosure;

fig. 17 illustrates an example system that includes an example computing device that represents one or more systems and/or devices that may implement the various techniques described herein.

Detailed Description

The following description provides specific details of various embodiments of the disclosure so that those skilled in the art can fully understand and practice the various embodiments of the disclosure. It is understood that aspects of the disclosure may be practiced without some of these details. In some instances, well-known structures or functions are not shown or described in detail in this disclosure to avoid obscuring the description of the embodiments of the disclosure by these unnecessary descriptions. The terminology used in the present disclosure should be interpreted in its broadest reasonable manner, even though it is being used in conjunction with a particular embodiment of the present disclosure.

First, some terms referred to in the embodiments of the present application are explained to facilitate understanding by those skilled in the art.

GPU: a Graphics Processing Unit (GPU), also called a display core, a visual processor, and a display chip, is a microprocessor that is specially used for image and graphics related operations on a personal computer, a workstation, a game machine, and some mobile devices (e.g., a tablet computer, a smart phone, etc.). The GPU reduces the dependence of the graphics card on the CPU, and performs part of the work of the original CPU, and particularly, the core technologies adopted by the GPU in 3D graphics processing include hardware T & L (geometric transformation and illumination processing), cubic environment texture mapping and vertex mixing, texture compression and bump mapping, a dual-texture four-pixel 256-bit rendering engine, and the like, and the hardware T & L technology can be said to be a mark of the GPU. Manufacturers of GPUs have primarily NVIDIA and ATI.

Operating the system: an Operating System (OS) is a computer program that manages computer hardware and software resources. The operating system needs to handle basic transactions such as managing and configuring memory, prioritizing system resources, controlling input devices and output devices, operating networks, and managing file systems. The operating system also provides an operator interface for the user to interact with the system. At present, the common operating systems mainly include: microsoft Windows, macOS X, linux, unix, etc.

CUDA: CUDA (computer Unified Device Architecture) is a computing platform offered by the video card vendor NVIDIA. CUDA is a general-purpose parallel computing architecture derived from NVIDIA, which enables a GPU to solve complex computing problems. It contains the CUDA Instruction Set Architecture (ISA) and the parallel computing engine inside the GPU. Developers can use the C language to write programs for the CUDA architecture, which can run at ultra-high performance on a processor that supports CUDA. CUDA3.0 has begun to support C + + and FORTRAN.

A container: the method is an abstraction for application layer codes and dependency relationships, wherein a plurality of containers run on the same physical machine (also called host), and share the same Operating System (OS) kernel. A container is distinct from a virtual machine, which is an abstraction of a physical device, which abstracts a host into multiple machines, each providing a complete operating system. All application apps in the container do not need to go through a virtualization process, while applications of the virtual machine all go through various virtualization processes, and need to be switched over for a plurality of times of contexts, and the performance of the apps is poorer than that of the apps in the container.

docker: docker is an open source application container engine that allows developers to package their applications and dependencies into a portable image and then release it to any popular Linux or Windows machine for virtualization.

The user equipment: the electronic device can be used for installing various applications and displaying objects provided in the installed applications, and can be mobile or fixed. For example, a mobile phone, a tablet computer, various wearable devices, a vehicle-mounted device, a Personal Digital Assistant (PDA), a point of sale (POS), a monitoring device in a subway station, or other electronic devices capable of implementing the above functions may be used.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

kubernets: abbreviated as k8s, is an open source platform for automated container operations, which include deployment, scheduling, and inter-node cluster extension, etc., and a kubernets cluster represents a group of nodes (nodes), which may be physical servers or virtual machines, on which the kubernets platform is installed.

Pod: the API object is the smallest API object in a kubernets project and is composed of a plurality of containers and related configuration information, in the kubernets, the most basic management unit is Pod instead of container (container), pod is one-layer packaging of the kubernets on the containers and is composed of a group of one or more containers running on the same host, and the Pod packaging is used because the recommended usage of the containers is that only one process runs in the containers, and a certain application is generally composed of a plurality of components. The essence of the container is a process, namely a process in a future cloud computing system, the container image is an 'exe' installation package in the system, and accordingly kubernets can be understood as an operating system.

Kubelet: kubernets are a distributed cluster management system, an operator (worker) is required to run on nodes executing a specific service container to manage the life cycle of the container, the worker program is a kubel, the kubel is a main node agent running on each node, each node starts a kubel process and is used for processing a task issued by a main (master) node to the node, and the Pod and the container therein are managed according to a PodSpec description (wherein, the PodSpec is used for describing a YAML or JSON object of the Pod), generally, the main work of the kubel is to create and destroy the Pod, the kubel needs to monitor Pod resources, and the kubel monitors the Pod which wants to monitor, namely, the pods which are distributed to the node.

Scheduler (Scheduler): the scheduler allocates resources in the system to each running application according to constraints such as capacity and queues (e.g., a certain amount of resources are allocated to each queue, and a certain number of jobs are executed at most). The scheduler performs resource allocation only according to the resource requirements of the respective applications, and the resource allocation unit is a container, thereby limiting the amount of resources used by each task. The scheduler is not responsible for monitoring or tracking the state of the application and for handling restarts (handled by the ApplicationMaster) that may be required for various reasons. In summary, the scheduler allocates resources encapsulated in containers to the applications based on the resource requirements of the applications and the resource conditions of the clustered machines. Cloud technology (Cloud technology): based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied in the cloud computing business model, a resource pool can be formed and used as required, and the cloud computing business model is flexible and convenient. Cloud computing technology will become an important support. Background services of the technical network system require a large amount of computing and storage resources, such as video websites, picture-like websites and more web portals. With the high development and application of the internet industry, each article may have its own identification mark and needs to be transmitted to a background system for logic processing, data in different levels are processed separately, and various industrial data need strong system background support and can only be realized through cloud computing.

Cloud computing (cloud computing): the method refers to a delivery and use mode of an IT infrastructure, and refers to acquiring required resources in an on-demand and easily-extensible mode through a network; the broad cloud computing refers to a delivery and use mode of a service, and refers to obtaining a required service in an on-demand and easily-extensible manner through a network. Such services may be IT and software, internet related, or other services. The cloud Computing is a product of development and fusion of traditional computers and Network technologies, such as Grid Computing (Grid Computing), distributed Computing (distributed Computing), parallel Computing (Parallel Computing), utility Computing (Utility Computing), network storage (Network storage technologies), virtualization (Virtualization), load balancing (Load Balance), and the like.

The technical scheme provided by the application relates to a virtualization technology in cloud computing, and mainly relates to a GPU resource sharing technology.

An exemplary application scenario 100 in which technical solutions according to embodiments of the present disclosure may be implemented. As shown in FIG. 1, the illustrated application scenario includes a container that includes a GPU 110 and a device 120 communicatively coupled to the GPU 110, the GPU 110 running a plurality of shared GPU resources, and the GPU 110 may be configured in the device 120 or external to the device 120.

As an example, when the device 120 generates a computation demand, it sends a computation request to the GPU 110 in the device and waits for the GPU 110 to return a computation result. In order to improve the operation efficiency of the GPU, a large number of GPU resources need to be planned and laid out according to the requirements of multiple processes, so that the use efficiency of the GPU can be improved through GPU resource sharing. In conventional methods of GPU resource sharing, GPU resource sharing is typically achieved by creating and configuring a container. As shown in fig. 1, sharing of resources of the GPU 110 is achieved by creating one or more containers in a user mode of an operating system of the device 120, so that the work efficiency of the GPU 110 is improved, and finally, the use experience of the device 120 by a user is improved.

As an example, when implementing the method of GPU resource sharing of the present disclosure in a user mode, first, a request to create a container is received; then, calling a kernel module running in a kernel mode of the operating system to create a virtual sub-GPU, so that the virtual sub-GPU occupies partial resources of the real GPU; then, adapting the virtual sub-GPU to the operating system such that the virtual sub-GPU is available for the container; and finally, mounting the virtual sub-GPU to the container, so that the virtual sub-GPU utilizes part of occupied resources of the real GPU to process the calculation request of the container after the container is created.

The scenario described above is only one example in which the embodiments of the present disclosure may be implemented, and is not limiting. For example, the GPU 110 may be a common graphics processor, such as multiple models of GPUs under NVIDIA and ATI flags. Device 120 may be a terminal or a server or the like that is communicatively coupled to the GPU. When the device 120 is a terminal, it may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, and the like, without limitation herein. When the device 120 is a server, it may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform.

Fig. 2 illustrates a schematic flow chart diagram of a method 200 of GPU resource sharing according to one embodiment of the present disclosure. The method of GPU resource sharing may be implemented, for example, on device 120 as shown in fig. 1. As shown in fig. 2, the method 200 includes the following steps.

At step 210, a request to create a container is received. By way of example, a request to create a container may be received from the docker engine, as well as requests to create containers from other components.

In step 220, a kernel module running in a kernel state of the operating system is called to create a virtual sub-GPU, so that the virtual sub-GPU occupies a part of resources of the real GPU. By way of example, the operating system may be Linux, unix, or other operating systems, and is not limited herein. When the operating system is Linux, the kernel state of the operating system is used for controlling hardware resources of a computer and providing an environment for running upper-layer application programs. Depending on the resources provided by the kernel (including CPU resources, storage resources, I/O resources, etc.), the processes in the user state can only be executed.

In some embodiments, when the kernel module running in the kernel state of the operating system is called to create the virtual sub-GPU so that the virtual sub-GPU occupies part of the resources of the real GPU, the kernel module running in the kernel state of the operating system may be called to create the virtual sub-GPU in response to the container to be created requesting GPU resources so that the virtual sub-GPU occupies part of the resources of the real GPU. As an example, when finding that the container to be created requests GPU resources, calling a kernel module running in the kernel mode of the Linux operating system to create a virtual sub-GPU.

At step 230, the virtual sub-GPU is adapted to the operating system so that the virtual sub-GPU is available for the container. In some embodiments, when the virtual sub-GPU is adapted to the operating system such that the virtual sub-GPU is available to the container, the virtual sub-GPU is made available to the container by adding the virtual sub-GPU to a resource control group of the operating system that is used to restrict, control, and segregate resources occupied by processes. As an example, when the operating system is Linux, the virtual child GPU may be added to a white list of Linux cgroup. The Linux CGroup is called a Linux Control Group (Linux Control Group), and is a function of a Linux kernel, which is used to limit, control and separate resources (such as a CPU, a memory, and a disk input/output) of a process Group.

At step 240, mount the virtual sub-GPU to the container, so that the virtual sub-GPU processes the computation request of the container using the occupied partial resource of the real GPU after the container is created. As an example, after the virtual sub-GPU is mounted to the container, the computation request of the process in the container is not sent to the real GPU but sent to the virtual sub-GPU, and after the computation request of the container is received by the virtual sub-GPU, the computation request of the container is processed by using a part of resources of the real GPU occupied by the virtual sub-GPU.

The method 200 implements sharing of GPU resources by invoking kernel modules running in the kernel state of the operating system to create virtual sub-GPUs. Then, by mounting the virtual sub-GPU onto the container to be created, the computing requirements of the processes on the container are sent to the virtual sub-GPU, and the kernel module can recall the real GPU resources to process the computing requirements of the processes received by the virtual sub-GPU. Due to the kernel mode of the kernel module for executing the kernel function (namely creating the virtual sub-GPU) to run, the user cannot bypass the limitation of the kernel module, and the resource sharing safety of the GPU is improved. Moreover, the GPU resource sharing scheme disclosed by the invention supports the container, and the kernel state is updated less usually, so that the inconvenience caused by frequent updating of the user environment is avoided, and the complexity of using the method is reduced.

In some embodiments, a method 200 of GPU resource sharing according to an embodiment of the present disclosure as illustrated in fig. 2 further comprises: receiving a resource parameter indicating an amount of a partial resource of the real GPU requested by the container to be created. In this case, a schematic flow of calling a kernel module running in a kernel mode of the operating system to create a virtual sub-GPU so that the virtual sub-GPU occupies part of the resources of the real GPU may be as shown in fig. 3. As an example, the resource parameters may include the amount of graphics memory and computational power of the real GPU requested by the container to be created, e.g., 1/4 of the total computational power of the real GPU and the graphics memory requested by 100 MiB. FIG. 3 illustrates a schematic flow chart diagram of a method 300 of creating a virtual sub-GPU in accordance with one embodiment of the present disclosure, which includes: in step 310, a kernel module running in a kernel mode of the operating system is called to create a virtual sub-GPU; at step 320, resource parameters are sent to the kernel module, such that the kernel module configures, based on the resource parameters, an amount of the partial resource of the real GPU that is occupied by the virtual sub-GPU in response to determining that the amount of the requested partial resource does not exceed an amount of unoccupied resources of the real GPU.

In some embodiments, the resource parameters include a video memory amount of the real GPU requested by the container to be created, and when sending the resource parameters to the kernel module causes the kernel module to, in response to determining that the requested amount of the partial resource does not exceed an amount of unoccupied resources of the real GPU to configure the amount of the partial resource of the real GPU occupied by the virtual sub-GPU based on the resource parameters, send the resource parameters to the kernel module causes the kernel module to, in response to determining that the requested amount of the video memory does not exceed an amount of unoccupied video memory of the real GPU, configure the amount of video memory of the real GPU based on the requested amount of the video memory.

As an example, if the amount of the requested partial resource exceeds the amount of the unoccupied resource of the real GPU, for example, the amount of the requested partial resource is 400MiB and the amount of the unoccupied resource of the real GPU is 200MiB, then configuring 400MiB of video memory for the virtual sub-GPU will be rejected. If the amount of the requested partial resource does not exceed the amount of the unoccupied resource of the real GPU, for example, the amount of the requested partial resource is 50MiB and the amount of the unoccupied resource of the real GPU is 100MiB, then a video memory of 50MiB will be configured for the virtual sub-GPU based on the resource parameters.

In some embodiments, the resource parameters include an effort of the real GPU requested by the container to be created, and when sending the resource parameters to the kernel module causes the kernel module to, in response to determining that the requested amount of the partial resource does not exceed an amount of unoccupied resources of the real GPU, configure the amount of the partial resource of the real GPU occupied by the virtual sub-GPU based on the resource parameters, send the resource parameters to the kernel module to cause the kernel module to, in response to determining that the requested effort does not exceed an unoccupied effort of the real GPU, configure the effort of the real GPU occupied by the virtual sub-GPU based on the requested effort by a number of time slices of the real GPU allocated for the virtual sub-GPU.

As an example, if the amount of the requested partial resource exceeds the amount of unoccupied resource of the real GPU, e.g., the amount of the requested partial resource is 1/4 of the total amount of computing power of the real GPU, and the unoccupied resource is 1/8 of the total amount of computing power of the real GPU, then the virtual sub-GPU will be rejected from being configured by 1/4 of the total amount of computing power of the real GPU. If the amount of the requested partial resource does not exceed the amount of the unoccupied resource of the real GPU, for example, the amount of the requested partial resource is 1/16 of the total amount of computing power of the real GPU and the amount of the unoccupied resource of the real GPU is 1/4 of the total amount of computing power of the real GPU, the virtual sub-GPU is configured according to 1/16 of the total amount of computing power of the real GPU.

Therefore, the method 300 shown in fig. 3 can effectively limit the GPU resources occupied by the container, avoid overload caused by the fact that the total amount of the GPU resources applied by the container exceeds the real amount of the GPU resources, reduce the risk of error reporting of the process due to the overload of the GPU resources, and improve the safety and stability of process operation.

Fig. 4 illustrates a schematic flow chart diagram of a method 400 of mounting a virtual sub-GPU to a container in accordance with one embodiment of the present disclosure. As shown in fig. 4, in some embodiments, mounting the virtual sub-GPU to the container comprises: at step 410, set the device name of the virtual sub-GPU to a form having the device name of the real GPU; at step 420, mount the virtual sub-GPU to the container so that the container can identify the virtual sub-GPU by the device name of the virtual sub-GPU. As an example, if the device name of the real GPU is/dev/nvidia and the created virtual sub-GPU is/dev/fgpu _ nv, the device name of the virtual sub-GPU may be modified or set to/dev/nvidiaX (X is a natural number), so that the container may identify the virtual sub-GPU as the real GPU by the device name of the virtual sub-GPU. As an example, in step 420, if the operating system is Linux, the virtual sub-GPU may be mounted to the container by calling libfgpu-container library. Wherein, fgpu-container-runtime-hook is a container runtime developed by us, which makes secondary modification to runc (runc is a lightweight tool for creating and running containers according to the OCI standard), and injects a custom pre-start hook into all containers of specified fgpu runtime to make the containers support virtual GPU operation. It can be seen that the method 400 shown in fig. 4 can implement mounting of the virtual sub-GPU to the container, so that the computation request of the container is sent to the virtual sub-GPU instead of directly to the real GPU, which provides a basis for implementing the sharing of GPU resources.

FIG. 5 illustrates a schematic flow chart diagram of a method 500 of scheduling GPU resource sharing. The method of scheduling GPU resource sharing may be implemented, for example, on device 120 as shown in fig. 1. As shown in fig. 5, the method 500 includes the following steps.

At step 510, node information of the current node is determined, the node information indicating resources of a real GPU of the current node. As an example, the resources of the real GPU of the current node include the video memory and the computational power of the real GPU of the current node, for example, if the current node has 2 real GPUs and each real GPU has a video memory of 600 mibs, then the amount of video memory of the real GPU of the current node is 1200 mibs.

In some embodiments, the current node includes the created container. In this case, when determining the node information of the current node, the total amount of resources of the real GPU of the current node and the amount of resources of the real GPU occupied by the created container may be determined. As an example, it is determined that the total amount of the video memory in the resources of the real GPU of the current node is 1000MiB, and the amount of the video memory in the resources of the real GPU occupied by the created container is 600MiB.

In step 520, the node information of the current node is sent to a scheduler, which is configured to schedule a container to be created and run on the current node based on the node information. In some embodiments, the scheduler may be, for example, a kubernets scheduler, a gpuadmision scheduler, or the like.

In step 530, a scheduling request associated with creating a container, which is issued by the scheduler to the current node, is received. In some embodiments, the scheduling request issued by the scheduler to the current node in connection with creating the container includes the amount of video memory that will be occupied by the container to be created.

In step 540, in response to the container to be created requesting resources of the GPU, the GPU resource sharing method (e.g., method 200) described above is caused to be executed so as to cause the container to be created to occupy a portion of the resources of the real GPU of the current node.

In some embodiments, the resources of the real GPU of the current node comprise a video memory of the real GPU of the current node. As an example, if the video memory of the real GPU of the current node is 600MiB, the information will be included in the node information of the current node. Since the resources of the real GPU of the current node include the video memory of the real GPU of the current node, the container to be created occupies a portion of the video memory of the real GPU of the current node when the GPU resource sharing method (e.g., method 200) described above is executed.

By the method for scheduling GPU resource sharing, the purpose of scheduling the video memory in the shared GPU resource is easily and efficiently achieved.

FIG. 6 illustrates an exemplary specific principle architecture diagram of a method of GPU resource sharing according to one embodiment of the present disclosure. In the embodiment shown in fig. 6, the GPU is of the Nvidia series, and the operating system is Linux.

As shown in fig. 6, GPU resource sharing is achieved through the docker toolkit and the components developed according to the docker toolkit. The components developed according to the docker toolkit include: in the process of realizing GPU resource sharing, a component fGPU-container-runtime-hook for calling prestart hook and a component libfgpu-container for mounting the fGPU on a container. fgpu-container-runtime-hook is a container runtime developed by the inventor, which makes a second modification to runc (runc is a lightweight tool for creating and running containers according to the OCI standard), and injects a custom pre-start hook into all containers of the specified fgpu runtime to make the containers support virtual GPU operation. libfgpu-container provides a library and a simple CLI program, with which the fppu can use Linux containers. The method comprises the following specific steps of firstly calling a container mirror image dockerd, and then sending a request for creating a container through docker-container + shim. After the runc receives the request for creating the container, it calls fgpu-container-runtime-hook in step 601; then, in step 602, prestart hook of fgpu-container-runtime-hook calls libfgpu-container. Then, in step 603, the libfgpu-container requests the fppu kernel module (which is a kernel module running in a kernel state of the Linux operating system) to create a virtual sub-GPU, and sets parameters of the virtual sub-GPU, such as a video memory size, a time slice number, and the like. At this time, if the set parameters of the virtual sub-GPU exceed the amount of unoccupied resources of the real GPU, the virtual sub-GPU is rejected. Then, in step 604, libfgpu-container adds the device of the virtual sub-GPU as a white list in Linux cgroup. Then, in step 605, libfgpu-container sets the device name of the virtual sub-GPU to the form of the device name of the real GPU, and mounts the device of the virtual sub-GPU as the real GPU device into a container, so that the virtual sub-GPU processes the computation request of the container using the occupied partial resource of the real GPU after the container is created. Finally, in step 606, runc completes creating the container for the mounted virtual sub-GPU. The container for mounting the virtual sub-GPU can occupy the resources of the real GPU according to the setting, and the sharing of the resources of the real GPU is realized.

Fig. 7 illustrates an exemplary detailed schematic framework diagram of a kernel module running in the kernel mode of an operating system of a method of GPU resource sharing according to one embodiment of the present disclosure. In the embodiment shown in fig. 7, the GPU is of an Nvidia series, the operating system is Linux, and the kernel module is an fgpu kernel module.

As shown in fig. 7, the fGPU kernel module provides virtual GPU devices (fGPU devices), i.e.,/dev/fGPU _ ctl and/dev/fGPU _ nv in the figure, to the container, and configures parameters of the fGPU devices through the procfs configuration interface. By way of example, the number of fGPU devices is not limited herein, and an upper limit on the number of fGPU devices is typically defined by an fGPU kernel module. Then, the fGPU device would be renamed to the device name of the real GPU: the method may further include the steps of,/dev/nvidia _ ctl and/dev/nvdia _ nvX (X depends on an upper limit on the number of fGPU devices defined by the fGPU kernel module, e.g., X takes a natural number between 0-9 when the upper limit on the number is defined as 10) such that the user-state container may identify the fGPU device as a real GPU device and send a computation request thereto. The amount of the video memory and the computational power of the real GPU occupied by the fGPU equipment can be limited by the fGPU kernel module based on the parameters of the fGPU equipment, so that the purpose of controlling the virtual sub-GPU to occupy a part of the resource amount of the real GPU is achieved, and the sharing of real GPU resources is achieved.

In this embodiment, the process of limiting the display memory of the real GPU occupied by the fGPU device by the fGPU kernel module may be performed as shown in fig. 8. Fig. 8 illustrates an exemplary specific schematic architecture diagram of a method of limiting the video memory of a virtual sub-GPU according to one embodiment of the present disclosure. As shown in fig. 8, when a process in a container requests a video memory from an fpgpu device mounted on the container, an application requesting the video memory is sent to an fpgpu kernel module, and after receiving a user process request for the video memory, the fpgpu kernel module determines whether the current real GPU unoccupied video memory is exceeded (for example, whether a used video memory plus a requested video memory exceeds a set video memory upper limit of the container), and if the current real GPU unoccupied video memory exceeds the set video memory upper limit, the fpgpu kernel module rejects the application requesting the video memory, and if the current real GPU unoccupied video memory exceeds the set video memory upper limit, allows the application of the video memory, and communicates with the real GPU through an Nvidia driver.

As an example, fig. 9 illustrates an effect diagram of limiting the video memory of a virtual sub-GPU according to one embodiment of the present disclosure. As shown in fig. 9, the size of the available display memory of the fGPU device mounted on the CONTAINER is set by TENCENT _ GPU _ MEM _ controller. The available display size of the fGPU device is set here to 10000000000 bytes. After the setting is completed, the display amount of the display memory of the real GPU actually occupied by the fGPU device is 9536MiB, which is consistent with the set available display memory size (10000000000 bytes) of the fGPU device. Experiments prove that the limitation of the virtual sub-GPU on the video memory amount of the real GPU can be effectively realized by the method in the embodiment of the disclosure.

In this embodiment, the process of limiting the real GPU effort occupied by the fGPU device by the fGPU kernel module may proceed as shown in fig. 10. FIG. 10 illustrates a specific schematic architecture diagram for configuring the computing power of the virtual sub-GPUs to occupy a real GPU, according to one embodiment of the disclosure. As the method illustrated in FIG. 10, a pattern isolation algorithm resource of time slices may be used, where the number of copies of a CONTAINER time slice may be set by TESTING _ GPU _ WEIGHT _ CONTAINER. As shown in FIG. 10, the time slice taken by container A is set to 2 and the time slices for containers B and C are set to 1. When containers A and C are running a CUDA process and container B is not, only the time slices of containers A and C participate in the scheduling, and the time slice of container B will be skipped. At this time, the proportion of the real GPU time slices occupied by the container a and the container C is 2. Later, if container B runs a CUDA process, then the time slice of container B will also participate in the scheduling. Therefore, the virtual sub-GPUs mounted on the containers are limited from occupying the time slices of the real GPU, the computing power of the real GPU occupied by each virtual sub-GPU can be effectively limited, and the computing requirements of processes on each container are further effectively coordinated and met.

Fig. 11 illustrates an exemplary specific schematic architecture diagram of a method of scheduling GPU resource sharing according to one embodiment of the present disclosure. As shown in fig. 11, in an embodiment, a GPU Admission of a tkestack source may be adopted as a Scheduler, where the GPU Admission is a GPU shared extension Scheduler (Scheduler extender) of a kubernets Scheduler supporting a Scheduler Framework (Scheduler frame).

In the method for scheduling GPU resource sharing shown in fig. 11, a Kubernetes (K8 s for short) scheduler schedules a plurality of nodes (e.g. N1, N2, N3, etc. in the figure) including virtual sub-GPUs under a GPU shared repository (GPU Share Registry) through a GPU shared extended scheduler, and each node includes one or more real GPUs and can use GPU resources of the real GPUs. Meanwhile, each node includes created virtual sub-GPUs (for example, GPU 0, GPU 1, etc. in N1 are virtual sub-GPUs that have occupied resources of a part of real GPUs) and Pod mounted by these virtual sub-GPUs (in a kubernets cluster, pod is the basis of all traffic types, and is also the minimum unit level of K8s management, which is a combination of one or more containers, and in this embodiment, each Pod includes one container). The fGPU device plug-in (fGPU-device-plugin) in each node reports the fGPU resource of the node, for example, the total amount of video memory of all GPUs on the node, to the Kubelet of the node. And then, the Kubelet of the node reports the node information to an API Server (API Server), so that all schedulers can know the node information of each node through a watch interface of the API Server, wherein the node information comprises the resource amount of the fGPU under each node. The API Server provides interfaces for adding, deleting, modifying, searching and waiting of various k8s resource objects (pod, RC, service and the like), and is a data bus and a data center of the whole system. When a user issues a request for creating a Pod (i.e., a request for creating a container) to the API Server, the K8s scheduler will know that there is a new Pod creation request through the viewing interface of the API Server, and the K8s scheduler starts to schedule a node for this Pod. The K8s scheduler will first check whether a Pod requests fGPU resources, and if a Pod requests fGPU resources, it will request the GPU-shared extended scheduler to schedule the Pod. After receiving the scheduling task, the GPU sharing extension scheduler first determines the amount of unoccupied resources of the real GPU on each node, and decides to schedule the new Pod to a certain real GPU of a certain node through a series of scheduling algorithms. Of course, if the resource on all nodes does not satisfy the request, the next scheduling cycle will be entered until scheduling is enabled. When the scheduling is successful, the GPU sharing extension scheduler will add an entry in the Annotation (note in K8s, used for recording Docker mirror information, etc.) of the Pod to describe to which GPU of the node this Pod is bound. Then, the GPU sharing extended scheduler returns the scheduling result to the K8s scheduler, and the K8s scheduler reports the scheduling result to the API Server. When the Kubelet finds that a Pod is scheduled on the node through the watch API Server, the Kubelet will start to create the Pod. If Kubelet finds that the Pod requests the fGPU resource, a request is sent to fGPU-device-plugin. After receiving the request, the fGPU-device-plugin checks the Annotation of the Pod, so as to determine on which GPU the Pod of the Pod needs to be bound. The fGPU-device-plugin also sets the environment variables of the container according to the indication of the Pod and the amount of the request resources of the Pod. Kubelet will then start creating a container with an fGPU device, and the process of creating a container with an fGPU device may use the method of the embodiment shown in fig. 6.

For the cluster administrator, the computation power mode of the node may be set, for example, to be a fair (fair)/efficiency (best-effort) mode. When the node is set to be in the fair mode, the time slice is set to be longer, so that the computing performance of the container is more stable and cannot be influenced by computing tasks of other containers. The container now holding two time slices will perform twice as well as the 1 time slice container under the same conditions. The performance difference between the fairness mode and the efficiency mode is small when the real GPU resources are exclusively occupied. But when real GPU resources are shared, the fair mode performance is worse than the efficient mode. When the node is set to the efficiency mode, the time slice is set to be shorter, and the performance of the node is higher than that of the fair mode when real GPU resources are shared, so that the performance of the node is close to that of a bare computer (namely, the real GPU is mounted to a container). In the efficiency mode, however, the performance of the container task may be affected by the computational tasks of other containers occupying the resources of the same real GPU.

To verify the efficiency of the GPU resource sharing approach shown in this disclosure, we performed a series of experiments. The experimental GPUs were all NVIDIA Tesla V100, and the processes were all Tensorflow-benchmark. In the experiment, we will select different data sets and models, adjust batch size (batch _ size) and XLA (Accelerated Linear Algebra), for the purpose of comprehensively evaluating the technical effects of the present disclosure. The experimental procedure was as follows: python 3./scripts/tf _ cnn _ marks. Py — data _ name = a — model = B — num _ batchs = x-batch _ size = y-display _ event = z-xla.

Fig. 12 illustrates an experimental effect graph of GPU resource utilization in a GPU resource sharing method according to an embodiment of the present disclosure. As shown in fig. 12, when the node is in the efficiency mode, the single task exclusively occupies the actual GPU resource and has only about 40% of the utilization rate of the actual GPU resource, and at this time, the utilization rate of the actual GPU resource is increased to about 80% when the two tasks share the actual GPU resource, which indicates that after the GPU resource sharing method of the present disclosure is used, the utilization rate of the GPU resource is basically twice as high as before.

To further verify the beneficial effects of the method shown in the present disclosure, we adopted some typical GPU resource sharing schemes developed by the well-known development group as a control group, and these typical GPU resource sharing schemes include cGPU/vGPU soft/vGPU hard/vCUDA soft. To test the effectiveness of the method proposed by the present disclosure for different tasks, we performed experiments for the large and small tasks, respectively. The large task refers to a task with 90% of GPU resource utilization rate during monopolizing, and the small task refers to a task with less than 50% of GPU resource utilization rate during monopolizing.

Fig. 13A and 13B respectively show graphs comparing performance of the technical solution of one embodiment of the present disclosure with that of a control group when a big task exclusively owns a GPU resource and the big task shares the GPU resource. In the figure, bs represents a batch size (batch _ size, a size for characterizing the amount of data of each batch), XLA represents Accelerated Linear Algebra (XLA, accelerated Linear Algebra, for accelerating training of neural network models). As can be seen from fig. 13A and 13B, when a large task is executed, no matter the large task exclusively occupies GPU resources and the large task shares GPU resources, the GPU resource sharing method (fGPU) proposed by the present disclosure is significantly better than the control group, and even approaches the operating efficiency of the bare metal.

FIG. 14 illustrates a graph comparing performance of a solution of an embodiment of the present disclosure with that of a control group when a small task monopolizes GPU resources and shares GPU resources. We selected the tensierflow-benchmark's resnet20 model, the cifar10 dataset, and performed multiple sets of experiments by adjusting batch size and xla. As can be seen from fig. 14, when a small task is executed, no matter the small task exclusively occupies the GPU resource and the small task shares the GPU resource, the GPU resource sharing method (fGPU) proposed by the present disclosure already approaches the operating efficiency of the bare engine, which is significantly better than the control group.

Fig. 15 illustrates an exemplary block diagram of an apparatus 1500 for GPU resource sharing according to an embodiment of the present disclosure. As shown in fig. 15, the apparatus 1500 for GPU resource sharing includes a receiving module 1510, a creating module 1520, an adapting module 1530, and a mounting module 1540.

The receiving module 1510 is configured to receive a request to create a container.

The creating module 1520 is configured to call a kernel module running in a kernel state of the operating system to create the virtual sub-GPU, so that the virtual sub-GPU occupies a part of resources of the real GPU.

The adaptation module 1530 configured to adapt the virtual sub-GPU to the operating system such that the virtual sub-GPU is available for the container.

The mount module 1540 is configured to mount the virtual sub-GPU to the container, so that the virtual sub-GPU processes the computation request of the container using the occupied partial resource of the real GPU after the container is created.

Fig. 16 illustrates an exemplary block diagram of an apparatus 1600 for scheduling GPU resource sharing according to an embodiment of the present disclosure. As shown in fig. 16, the apparatus 1600 for scheduling GPU resource sharing includes an information determining module 1610, an information sending module 1620, a scheduling request module 1630, and a scheduling module 1640.

The information determining module 1610 is configured to determine node information of the current node, the node information indicating resources of a real GPU of the current node.

The information sending module 1620 is configured to send node information of the current node to a scheduler, and the scheduler is configured to schedule a container to be created and run on the current node based on the node information.

The scheduling request module 1630 is configured to receive a scheduling request related to creating a container, which is issued by the scheduler to the current node.

The scheduling module 1640, configured to, in response to the container to be created requesting resources of the GPU, cause the GPU resource sharing method 200 as shown in fig. 2 to be performed so as to cause the container to be created to occupy a portion of the resources of the real GPU of the current node.

Fig. 17 illustrates an example system 1700 that includes an example computing device 1710 that represents one or more systems and/or devices that can implement the various techniques described herein. The computing device 1710 may be, for example, a server of a service provider, a device associated with a server, a system on a chip, and/or any other suitable computing device or computing system. The apparatus 1500 for GPU resource sharing described above with reference to fig. 15 and the apparatus 1600 for scheduling GPU resource sharing described with reference to fig. 16 may each take the form of a computing device 1710. Alternatively, the apparatus 1500 for GPU resource sharing may be implemented as a computer program in the form of an application 1716.

The example computing device 1710 as illustrated includes a processing system 1711, one or more computer-readable media 1712, and one or more I/O interfaces 1713 communicatively coupled to each other. Although not shown, the computing device 1710 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.

The processing system 1711 is representative of functionality to perform one or more operations using hardware. Thus, the processing system 1711 is illustrated as including hardware elements 1714 that may be configured as processors, functional blocks, and the like. This may include implementation in hardware as an application specific integrated circuit or other logic device formed using one or more semiconductors. The hardware elements 1714 are not limited by the materials from which they are formed or the processing mechanisms employed therein. For example, a processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, processor-executable instructions may be electronically-executable instructions.

The computer-readable medium 1712 is illustrated as including a memory/storage 1715. Memory/storage 1715 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage 1715 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). The memory/storage 1715 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) as well as removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 1712 may be configured in a variety of other ways, which are further described below.

One or more I/O interfaces 1713 represent functionality that allows a user to enter commands and information to computing device 1710 using various input devices and optionally also allows information to be presented to the user and/or other components or devices using various output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice input), a scanner, touch functionality (e.g., capacitive or other sensors configured to detect physical touch), a camera (e.g., motion that does not involve touch may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, a haptic response device, and so forth. Accordingly, the computing device 1710 may be configured in various ways to support user interaction, as described further below.

Computing device 1710 also includes application 1716. The application 1716 may be, for example, a software instance of the apparatus 1500 for GPU resource sharing or the apparatus 1600 for scheduling GPU resource sharing, and implement the techniques described herein in combination with other elements in the computing device 1710.

Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, elements, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.

An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer readable media can include a variety of media that can be accessed by computing device 1710. By way of example, and not limitation, computer-readable media may comprise "computer-readable storage media" and "computer-readable signal media".

"computer-readable storage medium" refers to a medium and/or device, and/or a tangible storage apparatus, capable of persistently storing information, as opposed to mere signal transmission, carrier wave, or signal per se. Accordingly, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in a method or technology suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of computer readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage, tangible media, or an article of manufacture suitable for storing the desired information and which may be accessed by a computer.

"computer-readable signal medium" refers to a signal-bearing medium configured to transmit instructions to the hardware of the computing device 1710, such as via a network. Signal media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave, data signal or other transport mechanism. Signal media also includes any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.

As previously mentioned, the hardware element 1714 and the computer-readable medium 1712 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware that, in some embodiments, may be used to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or systems-on-chips, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), complex Programmable Logic Devices (CPLDs), and other implementations in silicon or components of other hardware devices. In this context, a hardware element may serve as a processing device to perform program tasks defined by instructions, modules, and/or logic embodied by the hardware element, as well as a hardware device to store instructions for execution, such as the computer-readable storage medium described previously.

Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 1714. The computing device 1710 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules. Thus, implementing modules as modules executable by the computing device 1710 as software may be implemented at least in part in hardware, for example, using computer-readable storage media of a processing system and/or hardware elements 1714. The instructions and/or functions may be executable/operable by one or more articles of manufacture (e.g., one or more computing devices 1710 and/or processing systems 1711) to implement the techniques, modules, and examples described herein.

In various embodiments, computing device 1710 may assume a variety of different configurations. For example, the computing device 1710 may be implemented as a computer-like device including a personal computer, a desktop computer, a multi-screen computer, a laptop computer, a netbook, and so forth. The computing device 1710 may also be implemented as a mobile device-like device including a mobile device such as a mobile phone, portable music player, portable gaming device, tablet computer, multi-screen computer, or the like. Computing device 1710 may also be implemented as a television-like device that includes devices with or connected to a generally larger screen in a casual viewing environment. These devices include televisions, set-top boxes, game consoles, and the like.

The techniques described herein may be supported by these various configurations of computing device 1710 and are not limited to the specific examples of techniques described herein. Functionality may also be implemented in whole or in part on the "cloud" 1720 using a distributed system, such as through platform 1722 as described below.

Cloud 1720 includes and/or is representative of a platform 1722 for resources 1724. Platform 1722 abstracts underlying functionality of hardware (e.g., servers) and software resources of cloud 1720. The resources 1724 may include applications and/or data that may be used when executing computer processes on servers remote from the computing device 1710. The resources 1724 may also include services provided over the internet and/or over a subscriber network such as a cellular or Wi-Fi network.

Platform 1722 may abstract resources and functionality to connect computing device 1710 with other computing devices. The platform 1722 may also be used to abstract a hierarchy of resources to provide a corresponding level of hierarchy encountered for demand for the resources 1724 implemented via the platform 1722. Thus, in interconnected device embodiments, implementation of functions described herein may be distributed throughout the system 1700. For example, the functionality may be implemented in part on the computing device 1710 and through the platform 1722 that abstracts the functionality of the cloud 1720.

A computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computing device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computing device to perform the GPU resource sharing method provided in the various alternative implementations described above.

It should be understood that embodiments of the disclosure have been described with reference to different functional units for clarity. However, it will be apparent that the functionality of each functional unit may be implemented in a single unit, in a plurality of units or as part of other functional units without departing from the disclosure. For example, functionality illustrated to be performed by a single unit may be performed by a plurality of different units. Thus, references to specific functional units are only to be seen as references to suitable units for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present disclosure may be implemented in a single unit or may be physically and functionally distributed between different units and circuits.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, components or sections, these devices, elements, components or sections should not be limited by these terms. These terms are only used to distinguish one device, element, component or section from another device, element, component or section.

Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the disclosure is limited only by the following claims. Additionally, although individual features may be included in different claims, these may possibly advantageously be combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the terms "a" or "an" do not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.

Claims

1. A method of GPU resource sharing, comprising:

receiving a request to create a container;

calling a kernel module running in a kernel mode of an operating system to create a virtual sub-GPU (graphics processing unit), so that the virtual sub-GPU occupies partial resources of a real GPU;

adapting a virtual sub-GPU to the operating system such that a virtual sub-GPU is available for the container;

mounting the virtual sub-GPU to the container, so that the virtual sub-GPU utilizes occupied partial resources of the real GPU to process the calculation request of the container after the container is created.

2. The method of claim 1, wherein invoking a kernel module running in a kernel state of an operating system to create a virtual sub-GPU, such that the virtual sub-GPU occupies a portion of resources of a real GPU comprises:

responding to a container to be created to request GPU resources, calling a kernel module running in a kernel mode of an operating system to create a virtual sub-GPU, and enabling the virtual sub-GPU to occupy partial resources of a real GPU.

3. The method of claim 1 or 2, further comprising:

receiving a resource parameter indicating an amount of a partial resource of the real GPU requested by the container to be created, an

The method for creating the virtual sub-GPU by calling the kernel module running in the kernel mode of the operating system so that the virtual sub-GPU occupies partial resources of the real GPU comprises the following steps:

calling a kernel module running in a kernel mode of the operating system to create a virtual sub GPU;

sending a resource parameter to the kernel module, such that the kernel module configures, based on the resource parameter, an amount of the partial resource of the real GPU occupied by the virtual sub-GPU in response to determining that the amount of the requested partial resource does not exceed an amount of unoccupied resources of the real GPU.

4. The method of claim 1, wherein mounting the virtual sub-GPU to the container comprises:

setting the device name of the virtual sub-GPU to be in the form of the device name of a real GPU;

mounting the virtual sub-GPU to the container so that the container can identify the virtual sub-GPU through the device name of the virtual sub-GPU.

5. The method of claim 3, wherein the resource parameters include a display memory amount of a real GPU requested by the container to be created, and

wherein sending resource parameters to the kernel module such that the kernel module configures, based on the resource parameters, an amount of the partial resources of the real GPU that are occupied by the virtual sub-GPUs in response to determining that the amount of the requested partial resources does not exceed an amount of unoccupied resources of the real GPU comprises:

sending resource parameters to the kernel module, so that the kernel module configures, in response to determining that the requested video memory amount does not exceed an unoccupied video memory amount of a real GPU, an amount of the virtual sub-GPU occupying video memory of the real GPU based on the requested video memory amount.

6. The method of claim 3, wherein the resource parameters include the computing power of the real GPU requested by the resource parameters including the container to be created, and

sending resource parameters to the kernel module such that the kernel module configures, in response to determining that the requested computing power does not exceed an unoccupied computing power of a real GPU, the computing power of the virtual sub-GPU to occupy the real GPU by the number of time slices of the real GPU allocated for the virtual sub-GPU based on the requested computing power.

7. The method of claim 1, wherein adapting the virtual sub-GPU to the operating system such that the virtual sub-GPU is available to the container comprises:

adding the virtual sub-GPU to a resource control group of the operating system, wherein the resource control group of the operating system is used for limiting, controlling and separating resources occupied by the process, so that the virtual sub-GPU is available for the container.

8. A method of scheduling GPU resource sharing, comprising:

determining node information of a current node, wherein the node information indicates resources of a real GPU of the current node;

sending the node information of the current node to a scheduler, wherein the scheduler is used for scheduling a container to be created and operated on the current node based on the node information;

receiving a scheduling request which is issued by a scheduler to a current node and is related to a created container;

in response to a container to be created requesting resources of a GPU, causing the GPU resource sharing method of claim 1 to be performed so as to cause the container to be created to occupy a portion of the resources of the real GPU of the current node.

9. The method of claim 8, wherein the current node includes a created container, and wherein determining node information for the current node comprises:

and determining the total amount of the real GPU resources of the current node and the amount of the real GPU resources occupied by the created container.

10. The method of claim 8, wherein the resources of the real GPU of the current node comprise a video memory of the real GPU of the current node.

11. An apparatus for GPU resource sharing, comprising:

a receiving module configured to receive a request to create a container;

a creating module configured to invoke a kernel module running in a kernel state of an operating system to create a virtual sub-GPU such that the virtual sub-GPU occupies a portion of resources of a real GPU;

an adaptation module configured to adapt a virtual sub-GPU to the operating system such that the virtual sub-GPU is available for the container;

a mounting module configured to mount the virtual sub-GPU to the container such that the virtual sub-GPU processes computational requests of the container using a portion of occupied real GPU resources after the container is created.

12. An apparatus to schedule GPU resource sharing, comprising:

an information determination module configured to determine node information for a current node, the node information indicating resources of a real GPU of the current node;

an information sending module configured to send node information of a current node to a scheduler, the scheduler being configured to schedule a container to be created and run on the current node based on the node information;

the scheduling request module is configured to receive a scheduling request which is issued by a scheduler to a current node and is related to the creation of the container;

a scheduling module configured to cause the GPU resource sharing method of claim 1 to be performed in response to a container to be created requesting resources of a GPU, so as to cause the container to be created to occupy a portion of resources of a real GPU of a current node.

13. A computing device, comprising:

a memory configured to store computer-executable instructions;

a processor configured to perform the method of any one of claims 1-10 when the computer-executable instructions are executed by the processor.

14. A computer-readable storage medium storing computer-executable instructions that, when executed, perform the method of any one of claims 1-10.