CN114625482A

CN114625482A - Equipment management method and device

Info

Publication number: CN114625482A
Application number: CN202210294026.6A
Authority: CN
Inventors: 安仲奇; 董建波; 唐小川; 张正俣
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2022-03-23
Filing date: 2022-03-23
Publication date: 2022-06-14

Abstract

The embodiment of the application provides a device management method and device. The method comprises the following steps: mounting N GPUs on each container in a plurality of containers, wherein a preset link exists between the N GPUs, and N is an integer larger than 1; virtualizing the GPU which can be called by each container to obtain one or more vGPU instances corresponding to each container; and providing the virtualized vGPU instance for the corresponding container to use. By mounting N GPUs on each container and virtualizing the GPUs which can be called by the containers, the GPUs between the containers are isolated, meanwhile, blocking of preset links between the GPUs can be avoided, and communication between the GPUs is allowed to be carried out through the preset links.

Description

Equipment management method and device

Technical Field

The present application relates to the field of computers, and more particularly, to a device management method and apparatus.

Background

With the continuous development of computer technology, more and more Artificial Intelligence (AI) deep training tasks are deployed and run in the form of containers, and the AI deep training tasks largely depend on Graphics Processing Units (GPUs).

Currently, when mounting a GPU for a container, a system or a user configures the GPU capable of being called for each container. The software responsible for managing the GPUs, i.e., GPU runtime (runtime), may mount a corresponding GPU for each container according to the configuration of the system or the user. For each container, only the GPU mounted in the container can be used to ensure isolation between containers.

In some scenarios, such as distributed training, the same tenant may use multiple containers to perform the same task to improve execution efficiency. Since data sharing may be required among multiple containers, high-speed data transmission among multiple containers is required, and the high-speed data transmission can be realized by using a GPU high-speed interconnection technology with a communication bandwidth much higher than that of a common network. However, the different containers are isolated from each other, meaning that high speed interconnects cannot be used between GPUs of different containers. Therefore, at present, data transmission between containers is mainly implemented by means of shared memory or network transmission. However, this may affect the overall performance. For example, data needs to be transferred through a system main memory by adopting a mode of sharing a memory and the like, and the data is copied for multiple times, so that the communication efficiency is low, the communication performance is poor, the execution efficiency is limited, and the expandability of training is limited.

Disclosure of Invention

The application provides a device management method and device, aiming at improving communication speed and execution efficiency while realizing container isolation.

In a first aspect, the present application provides a device management method, including: mounting a GPU on each container in N containers, wherein a preset link exists between the N GPUs, and N is an integer greater than 1; virtualizing the GPU which can be called by each container to obtain one or more vGPU instances corresponding to each container; and providing the virtualized vGPU instance for the corresponding container to use.

In a second aspect, the present application provides an apparatus for device management, the apparatus comprising: a control module and a virtualization module; the control module is used for mounting N GPU on each container in a plurality of containers, a preset link exists between the N GPUs, and N is an integer larger than 1; the virtualization module is used for virtualizing the GPU which can be called by each container to obtain one or more vGPU instances corresponding to each container; the control module is also used for providing the virtualized vGPU instance for the corresponding container to use.

It should be understood that the respective modules may implement the respective functions by executing the computer program.

In a third aspect, the present application provides a device management apparatus, which includes a processor configured to execute program code to cause the apparatus to implement the method in the first aspect.

In a fourth aspect, the present application provides a chip, where the chip includes at least one processor, and is configured to implement the functions related to the first aspect, such as virtualizing a GPU.

In a fifth aspect, the present application provides a computing device comprising: a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the method of the first aspect when executing the computer program.

In a sixth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the method of the first aspect described above.

Seventh method, the present application provides a computer program product comprising a computer program that, when executed, performs the method of the first aspect.

Based on the scheme, on one hand, according to the GPU which can be called by each container, the virtualized vGPU instance is provided for the container through the virtualization technology, so that each container can only access the GPU which can be called by the container, and GPU isolation among the containers can be ensured; on the other hand, the N GPUs are respectively mounted on each container, so that the situation that the preset links among the GPUs mounted on different containers are disabled when the containers are started is avoided, that is, the preset links among the GPUs can be prevented from being blocked, and therefore, the GPUs are allowed to communicate with each other by high-speed interconnection. Because the communication efficiency of the shared memory or network transmission and the like is far lower than that of the preset link, the communication speed can be greatly improved, the execution efficiency is improved, and good communication performance is ensured.

Drawings

FIG. 1 is a schematic diagram of inter-GPU communication provided by embodiments of the present application;

fig. 2 is a schematic flow chart of a device management method provided in an embodiment of the present application;

FIG. 3 is another schematic diagram of inter-GPU communication provided by embodiments of the present application;

fig. 4 is a schematic block diagram of a device management apparatus provided in an embodiment of the present application;

fig. 5 is another schematic block diagram of a device management apparatus provided in an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

The technical scheme provided by the application can be applied to the fields of Artificial Intelligence (AI) and Deep Learning (DL). Among them, the AI field is a new technology science that studies and develops theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. DL is a new research direction in the field of machine learning (machine learning), which is introduced to machine learning to make it closer to the original AI goal.

Fig. 1 is a schematic diagram of communications between GPUs according to an embodiment of the present disclosure.

In the communication scenario shown in fig. 1, two containers, container 0 and container 1, respectively, are enabled. Each container contains a plurality of work processes (workers) and corresponding communication libraries. Wherein, the container 0 comprises a work process 0, a work process 1 and a work process 2, and the container 1 comprises a work process 3 and a work process 4. It should be understood that multiple work processes may form an aggregate process group, completing the same task in parallel. Each container is loaded with a GPU. As shown in fig. 1, container 0 has GPU0, GPU1, and GPU2 mounted thereon, and container 1 has GPU3 and GPU4 mounted thereon. Each GPU can be provided for the corresponding work process to use. For example, GPU0 may be provided for work process 0, GPU1 may be provided for work process 1, GPU2 may be provided for work process 2, GPU3 may be provided for work process 3, and GPU4 may be provided for work process 4. When a work process in a container uses a GPU, the work process may use the GPU by calling, so that the work process using the GPU may need to frequently perform set communication.

It should be appreciated that a container may be a collection of processes that isolates other resources of the system, with its own independent view of the resources.

It should also be understood that mounting is the process of exposing certain GPUs of a host to a container so that the container can access, use, and use those GPUs.

It should also be understood that collective communication is communication between a set of processes. Collective communications are distinguished from point-to-point communications in that they require all of the communications within a particular group to be engaged in simultaneously, and may be one-to-many, many-to-one, many-to-many. The communication library referred to in the present application is a communication library for collective communication. For the related content of the collective communication and the communication library, reference is made to the prior art, and the detailed description thereof is omitted.

Currently, when mounting GPUs for container 0 and container 1, a system or a user configures GPUs that can be used by container 0 and container 1, for example, GPU0, GPU1, and GPU2 are configured as GPUs that can be used by container 0, and GPU3 and GPU4 are configured as GPUs that can be used by container 1. The GPUs were run with GPU0 mounted for container 0 through GPU2, and GPU3 and GPU4 mounted for container 1. Container 0 may then access GPUs 0-2 and container 1 may access GPUs 3 and 4. Since container 0 and container 1 are isolated from each other, it can be guaranteed at the GPU runtime level that container 0 cannot access GPU3 and GPU4 inside container 1, and container 1 cannot access GPU0 to GPU2 inside container 0.

However, researchers have found that, because container 0 and container 1 are isolated from each other, GPU0, GPU1, and GPU2 in container 0 cannot communicate with GPU3 and GPU4 in container 1 at high speed using the GPU high speed interconnect technology, i.e., high speed communication between GPUs in containers cannot be achieved. Although data transfer between containers is currently performed by means of shared memory, network transmission, etc., GPU1 and GPU4 are interconnected by means of shared memory. However, this may affect the overall performance. For example, data needs to be transferred through a system main memory by using a mode of sharing a memory and the like, and the data is copied for multiple times, so that the communication efficiency is low, the communication performance is poor, and the expandability of training is limited.

In view of this, the present application provides an apparatus management method, on one hand, according to a GPU that can be called by each container, a virtualized vGPU instance is provided to the container through a virtualization technology, so that each container can only access the GPU that can be called by the container, and GPU isolation between the containers can be ensured; on the other hand, the N GPUs are respectively mounted on each container, so that the situation that when the containers are started, preset links among the GPUs mounted on different containers are disabled is avoided, namely, the preset links among the GPUs can be prevented from being blocked, and the GPUs are allowed to communicate with one another by high-speed interconnection. If the preset link is designed as a high-speed communication link for communication between GPUs, the communication efficiency of shared memory or network transmission is far lower than that of the high-speed communication link, so that the communication efficiency can be greatly improved, the execution efficiency can be improved, and the overall performance can be improved.

The device management method provided by the embodiment of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a schematic flowchart of a device management method provided in an embodiment of the present application. The method 200 shown in fig. 2 may be applied to a CPU or to system software on a CPU. The system software may include an operating system and a resource scheduling system, where the resource scheduling system may be used for resource scheduling for the GPU.

The method 200 shown in fig. 2 may include steps 201 through 203. The following describes each step in the method 200 shown in fig. 2 in detail, taking the application of the method 200 to a CPU as an example.

Step 201, mounting N GPUs on each container of a plurality of containers, where a preset link exists between the N GPUs, and N is an integer greater than 1.

Wherein the predetermined link is available for communication between the GPUs. In the embodiment of the present application, the preset link may be designed as a high-speed communication link for inter-GPU communication, and the transmission bandwidth of the high-speed communication link is much higher than that of a normal network, so that a higher communication speed may be provided. By way of example and not limitation, the high-speed communication link may be NVLink from england, or may be a high-speed communication link for inter-GPU communication provided by another GPU vendor, including but not limited to this application.

When the CPU starts the containers, N GPUs may be mounted on each container, where the N GPUs refer to all the GPUs configured for the plurality of containers. The multiple containers may be, for example, multiple containers for the same tenant to perform the same task. Since all the GPUs are mounted on the plurality of containers, the containers do not invalidate the mounted high-speed communication links between the different GPUs when being started, and the containers can be used for data transmission between the GPUs because the high-speed communication links between the GPUs are not blocked.

As shown in fig. 3, the CPU has activated two containers, container 0 and container 1, respectively, with GPUs 0-3 mounted on container 0 and container 1. For container 0 or container 1, there is no isolation between

GPUs

0, 1, 1, and 3, and there is a high-speed communication link between the GPUs. Therefore, high-speed interconnections between GPUs 0, 1, 2, and 3 are possible.

Alternatively, the GPU mounted to each of the plurality of containers may be configured by the system or configured by the user.

In particular, a mountable GPU may be configured for a container by a system or a user. The system may be a resource scheduling system on the CPU, which may be used to configure a mountable GPU for the container. It should be understood that the resource scheduling system may also be provided on the system software of the CPU.

The following exemplary process for configuring a mountable GPU for a container is given.

For example, a resource scheduling system or user may uniformly configure mountable GPUs for multiple containers. For example, container 0 and container 1 may each be configured by a resource scheduling system on the CPU to mount GPU0, GPU1, GPU2, and GPU 3. Thus, the CPU mounts GPUs 0-3 for both container 0 and container 1, depending on the GPU configured.

Step 202, virtualizing the GPUs that can be called by each container, and obtaining one or more vGPU instances corresponding to each container.

The GPU that each container can call, i.e., the GPU that each container can actually use. For example, container 0 in fig. 3 may call GPU0, GPU1, and GPU2, and container 1 may call GPU 3. The GPU that each container can call may be configured by a mapping relationship. The mapping relationship may be a mapping relationship between the GPU and a container configured manually by a resource scheduling system or a user. The mapping relation can be configured separately for each container, and the mapping relation configured for each container is used for indicating the GPU which can be called by the container; the mapping relationship may also be configured uniformly for all containers, and may be used to indicate the GPUs that each container of the plurality of containers is capable of invoking.

When the mapping relationship is configured separately for each container, the resource scheduling system or the user may generate one mapping relationship for each container. For example, the resource scheduling system generates a mapping #1 for the GPUs that container 0 can call, where the mapping #1 indicates that the GPUs that container 0 can call are GPU0, GPU1, and GPU 2; the resource scheduling system also generates a mapping #2 for the GPU that container 1 can invoke, which mapping #2 indicates that the GPU that container 1 can invoke is GPU 3. It should be understood that mapping #1 and mapping #2 are specific examples of the mapping, respectively. When the mapping relationships are individually configured for each container, the mapping relationships configured for the containers are different from each other.

When all containers are uniformly configured, the resource scheduling system or the user uniformly generates mapping relations for all containers. For example, the mapping relationship indicates that the GPUs that container 0 can call are GPU0, GPU1, and GPU3, and the GPU that container 1 can call is GPU 3. In other words, the mapping relationship is the complete set of the mapping relationship #1 and the mapping relationship # 2.

After knowing the GPU that each container can call according to the mapping relation, the CPU can determine the GPU that can be called from the N GPUs mounted in each container, and virtualize the GPU that can be called.

It should be understood that GPU virtualization refers to the packaging of a single GPU device into several logical vGPU instances for concurrent use by different work processes.

Optionally, each container of the plurality of containers includes one or more work processes, each work process being provided with one or more vGPU instances.

When the GPU is virtualized, whether the GPU is virtualized into one vGPU instance or a plurality of vGPU instances can be determined according to the configuration condition of the GPU. When the configuration of the GPU is high and the requirements of a plurality of working processes can be met simultaneously, the GPU can be virtualized into a plurality of vGPU instances; when the configuration of the GPU is low and the requirements of a plurality of working processes cannot be met simultaneously, the GPU can be virtualized into a vGPU instance.

For example, in fig. 3, if the CPU knows that the GPUs that can be called by container 0 are GPU0, GPU1 and GPU2 according to the mapping relationship, and the GPU that can be called by container 1 is GPU 3. Although GPUs 0 through 3 are mounted in both container 0 and container 1, for container 0, the CPU can virtualize GPUs 0 through 2, but not GPU 3; for container 1, the CPU may virtualize GPU3, but not GPU0 through GPU 2. Assuming that the configurations of the GPU0, the GPU1, and the GPU2 are all low and cannot meet the requirements of multiple work processes at the same time, the GPU0, the GPU1, and the GPU2 can be virtualized into a vGPU instance-0, a vGPU instance-1, and a vGPU instance-2, respectively, and assuming that the configuration of the GPU3 is high and can meet the requirements of multiple work processes at the same time, the GPU3 can be virtualized into a vGPU instance-3 and a vGPU instance-4. Thus, worker process 0 in container 0 can be provided with vGPU instance-0, worker process 1 can be provided with vGPU instance-1, and worker process 2 can be provided with vGPU instance-2, and worker process 3 in container 1 can be provided with vGPU instance-3 and worker process 4 can be provided with vGPU instance-4.

It can be seen that, although the GPUs are not isolated from each other from the container, the calls to the resources by the container are still isolated from each other by using virtualization techniques.

The vGPU instance can be obtained by virtualizing each GPU based on the mapping relation when the vGPU runs. The vGPU runtime can be understood as software for virtualizing the GPU and managing the vGPU instance obtained through virtualization.

One possible implementation is to inject a vGPU runtime into each container, which is used to virtualize the callable GPUs.

For example, when the containers are started, the CPU may inject vGPU runtimes into each container, and the CPU may virtualize the GPUs that can be invoked by each container into one or more vGPU instances by invoking the vGPU runtimes in each container according to the mapping relationship. In particular implementations, the CPU may inject vGPU runtime into each container by mounting the host volume.

Optionally, the method further comprises: and providing the mapping relation to the vGPU runtime of each container, wherein the mapping relation is used for indicating the GPU which can be called by each container.

For example, after the resource scheduling system in the CPU generates the mapping relationship, the mapping relationship may be provided to the vGPU runtime, and the vGPU runtime virtualizes the GPU that can be invoked by the container according to the mapping relationship. The resource scheduling system can provide the mapping relation to the vGPU operation in the forms of configuration files, environment variables, command line parameters and the like.

As described above, the mapping relationship may be configured individually for each container, or may be configured uniformly for all containers. When each container is configured independently, the resource scheduling system provides the GPU which can be called by the container to which each vGPU runtime belongs to each vGPU runtime, namely, the contents of mapping relations obtained by each vGPU runtime are different; when all containers are uniformly configured, the resource scheduling system provides the GPU which can be called by each container to each vGPU runtime, namely, the content of the mapping relation obtained by each vGPU runtime is the same, and the vGPU runtime searches the GPU which can be called by the container to which the vGPU runtime belongs in the mapping relation.

It should be understood that the mapping may also be provided by the user.

And step 203, providing the virtualized vGPU instance for a corresponding container to use.

After the CPU virtualizes the GPU that can be called by the container by calling the vGPU runtime, the use of the vGPU instance by the container can be realized in the following manner. It should be understood that the use of a vGPU instance by a container may specifically be the use of a vGPU instance by a worker process in the container.

It should be appreciated that since the vGPU instance is virtualized by the GPU, the use of the vGPU instance corresponding to the GPU by the work processes in the container is also equivalent to the use of the GPU by the work processes in the container.

Optionally, step 203 may specifically include: hijacking calls to the first application program interface API and providing a second API through the vGPU runtime injected into each container.

The first API is an API provided by a GPU manufacturer, and may specifically be a GPU user mode API or a GPU kernel driver API. The second API is an API provided by the vGPU runtime that has the same appearance as the name, appearance, etc. of the API provided by the GPU vendor, and is used to call the vGPU instance in each container.

It should be understood that hijack calls to an API may be understood as modifying the entry of the original API so that it jumps to another API. In the embodiment of the application, the call to the first API is jumped to the second API by hijacking the call to the first API and providing the second PAI. Specifically, when a worker process in a container makes a call to a first API, the vGPU runtime may block the call to the first API by the worker process and provide a second API to the worker process.

Specifically, when the work process in the container needs to use the GPU, a call is usually made to an API provided by the vendor, that is, the first API, at this time, the CPU controls the vGPU runtime in the container to block the call of the work process to the first API, and the vGPU runtime provides an API for calling the vGPU instance in the container, that is, the second API, to the work process. Because the appearance of the first API is completely consistent with that of the second API, the work process in the container can be induced to call the second API, and the corresponding vGPU can be used through the call of the second API. When the vGPU is operated to hijack the API provided by a manufacturer, the vGPU can be hijacked at the position of the GPU user mode API, and the vGPU kernel can also drive the API to be hijacked at the position of the GPU user mode API.

Optionally, the vGPU runtime provides functionality to inject mapping relationships or to modify GPU runtime environment variables.

The GPU runtime may be understood as software for managing the GPU. Environment variables are typically parameters in an operating system that specify the operating system operating environment. The environment variables involved in the embodiments of the present application may be originally used to describe the GPU mounted on each container, and may be, for example, "cut _ VISIBLE _ DEVICES", "HIP _ VISIBLE _ DEVICES", and the like. It is understood that, in the present embodiment, the GPUs mounted on each container are the N GPUs described above. And modifying the environment variable so that the environment variable presents the GPU which can be called by each container.

For example, the vGPU runtime may inject (inject) the mapping inside the second API, and the work process in the container may use the corresponding vGPU instance by calling the second API.

For example, the vGPU runtime of container 0 injects a mapping relationship that characterizes container 0 as being able to call GPU0 to GPU2 into the second API, so worker process 0 can use vGPU instance-0 through a call to the second API, worker process 1 can use vGPU instance-1 through a call to the second API, and worker process 2 can use vGPU instance-2 through a call to the second API. That is, through a call to the second interface, worker process 0 may use GPU0, worker process 1 may use GPU1, and worker process 2 may use GPU 2. While the vGPU runtime of container 1 injected inside the second API a mapping that characterizes container 1 as being able to call GPU3, then worker process 3 may use vGPU instance-3 through a call to the second API and worker process 4 may use vGPU instance-4 through a call to the second API. That is, work process 3 and work process 4 multiplex GPU 3.

For another example, each container has all of the GPUs configured mounted on it, and the vGPU runtime has only virtualized the GPUs that can be used by the container. Since all of the GPUs of the configuration are mounted in the container, the container can access all of the GPUs mounted, and for the non-virtualized GPU, the situation that the container bypasses the second API to access the non-virtualized GPU can exist. For example, container 0 has GPU0 through GPU3 mounted therein, and only GPU0 through GPU2 are virtualized, but container 0 may still access GPU3, and may use GPU3 that should be used by container 1 by bypassing the second API through an illegal approach. For another example, container 1 is also mounted with GPUs 0 to 3, and only GPU3 is virtualized, but container 1 may still access GPUs 0 to 2, and may use GPU3 that should be used by container 0 by bypassing the second API in an illegal manner. If this occurs, the isolation between the container 0 and the container 1 cannot be ensured.

Thus, to avoid the above, the vGPU runtime may provide functionality to modify GPU runtime environment variables, changing the container from "accessible" to "inaccessible" to the non-virtualized GPU. For example, the GPU runtime environment variable describing the GPU mounted on container 0 is modified to make GPU3 "inaccessible" and container 0 "cannot use GPU 3. Similarly, the GPU runtime environment variables describing the mounted GPUs on container 1 are modified to make GPUs 0-2 "inaccessible" and container 1 cannot use GPUs 0-2.

It can be known that, although container 0 and container 1 both mount GPUs 0-3, container 0 and container 1 can only use the vGPU instance corresponding to the GPU that can be called by container 0 and container 1 due to the isolation guarantee between the containers during the vGPU operation, container 0 cannot use GPU3 in container 1, and container 1 cannot use GPUs 0-2 in container 0. Thus, the level at which the vGPU runs guarantees isolation between container 0 and container 1.

It is noted that when the configuration of the GPU is high, the GPU may be virtualized into multiple vGPU instances for use by the container. And deadlock may occur when the communication library is dominated by the GPU for the entire logic. Specifically, when the communication library is dominated by the GPU and the GPU is in shortage of GPU resources or the GPU utilization rate is high, that is, the GPU has almost no available resources, if a plurality of nodes in the container reuse one GPU and interdependence exists between the plurality of nodes, a deadlock phenomenon is very likely to occur.

It should be understood that interdependence may refer to one worker process (e.g., denoted as worker process a) needing to wait for a signal from another worker process (e.g., denoted as worker process b) to continue running on the GPU, while worker process b needs worker process a to release the resources of the GPU before it can be scheduled. However, the working process b can run on the GPU after being scheduled, and sends a signal to the working process a, and the mode of the GPU long-standing persistent kernel determines that the working process a needs to release resources after completing a task. Therefore, the working process a waits for the signal of the working process b, the working process b waits for the resource release of the working process a, and the working process a and the working process b wait for each other, so that a deadlock phenomenon occurs. The long-resident persistent kernel mode may specifically be that a work process running on the GPU needs to be executed and then the resource is released.

It should also be understood that deadlock does not occur when the GPU is provisioned with available resources.

For example, as shown in fig. 3, when the vGPU runtime in container 1 virtualizes GPU3 into vGPU instance-3 and vGPU instance-4, which are provided for work process 3 and work process 4, respectively, then work process 3 and work process 4 essentially use GPU 3. Assuming that the work process 3 is first run on the GPU3, for the work process 3, when the work process 3 runs to a certain node on the GPU3, a signal of the work process 4 is required to continue running. However, if the resources of GPU3 are full at this time, then work process 4 cannot run on GPU 3. Since the work process 4 cannot be run, a signal to continue the running of the work process 3 cannot be given, so that the work process 3 is in a waiting state. For worker process 4, worker process 4 can only run on GPU3 if GPU3 frees up the resources occupied by worker process 3. However, while the worker process 3 is waiting for the worker process 4, the GPU3 cannot release the resource, and the worker process 4 is always in the waiting state. Therefore, the worker process 3 and the worker process 4 wait for each other, and a deadlock phenomenon occurs.

Therefore, in order to avoid a phenomenon that deadlock may occur when multiple work processes multiplex one GPU, the method 200 may further include:

and scheduling the work process based on the control logic in the communication library, so that the work process responds to the scheduling of the CPU and calls the resources in the CPU to perform calculation.

Specifically, the control logic of the communication library is offloaded from the GPU to the CPU, and the operating system of the CPU may be responsible for scheduling the work process based on the control logic of the communication library. Because the operating system runs on the CPU, the operating system can ensure that the possibility of resource exhaustion is very low. Thus, interdependent work processes may be scheduled onto the CPU for computation. On the other hand, unlike the mode of the GPU long-standing persistent kernel, the CPU controls the communication logic without loading complex logic, such as a work process depending on external conditions, and the GPU can be finished running when the GPU has available resources, so that the situation that the GPU resources are occupied but another unscheduled work process is waited does not occur, and the problem of deadlock does not exist. Instead, the CPU may take turns scheduling interdependent work processes as resources are exhausted. Therefore, the deadlock phenomenon caused by mutual waiting of mutually dependent working processes can be avoided.

In one implementation, all work processes are handed over to the CPU for processing. That is, either dependent or independent work processes can be scheduled by the CPU.

For the mutually dependent work processes, the CPU can adopt a mechanism of alternate scheduling. For example, if the work process 3 and the work process 4 in the container 1 are mutually dependent, the operating system of the CPU may schedule the work process 3 first, and when the work process 3 runs to a certain node, it needs a signal of the work process 4 to continue running, and when the resource of the CPU is currently occupied, the CPU may schedule the work process 3, and schedule the work process 4, so that the work process 4 runs. At this time, the scheduled work process 4 may give a signal to the work process 3 to continue to operate, and the CPU may schedule the work process 4, schedule the work process 3, and allow the work process 3 to continue to operate. The working process 3 and the working process 4 are scheduled to run in turn by the circulation, so that the problem that a certain working process occupies resources and is not released in the GPU can be avoided, and the deadlock phenomenon can be avoided.

For independent work processes, the CPU can carry out scheduling according to the resource occupation condition. For example, if the work process 3 and the work process 4 in the container 1 are independent of each other, the CPU operating system may schedule the work process 3 first, and allow the work process 3 to run. If the CPU has the remaining resources, the work process 4 can continue to be scheduled. If the CPU has no residual resources, the CPU can schedule the work process 3 and schedule the work process 4, or after the work process 3 runs and releases the resources, the work process 4 is scheduled.

In another implementation mode, the dependent work processes are handed to the CPU for processing, and the independent work processes are handed to the GPU for processing. The manner of processing the mutually dependent work processes by the CPU is the same as that realized in the former way, and is not described herein again. For the mode of processing the independent work processes by the GPU, because the GPU does not have the capacity of scheduling the work processes in turn, the GPU can wait for the previous work process to run, releases resources and then lets the next work process run.

It should also be understood that the control logic is determined by the CPU, but the present application is not limited to the specific executor of data transmission, and may be a Direct Memory Access (DMA) engine, a network card, a GPU program, or the CPU itself.

Based on the scheme, on one hand, according to the GPU which can be called by each container, the virtualized vGPU instance is provided for the containers through the virtualization technology, so that each container can only access the GPU which can be called by each container, and GPU isolation among the containers can be guaranteed; on the other hand, the N GPUs are respectively mounted on each container, so that the situation that when the containers are started, preset links among the GPUs mounted on different containers are disabled is avoided, namely, the preset links among the GPUs can be prevented from being blocked, and the GPUs are allowed to communicate with one another through the preset links. Since the predetermined link between the GPUs can be designed as a high-speed communication link, the communication between the GPUs using the high-speed communication link is allowed. Because the communication efficiency of the shared memory or network transmission is far lower than that of a high-speed link, the communication speed can be greatly improved, the execution efficiency is improved, and good communication performance is ensured. In addition, the control logic of the set communication library is unloaded to the CPU by the GPU, so that the phenomenon that deadlock possibly occurs when multiple working processes multiplex the same GPU is avoided.

The method provided by the embodiment of the present application is described in detail above with reference to fig. 2 to 3. Hereinafter, the apparatus provided in the embodiment of the present application will be described in detail with reference to fig. 4 to 5.

Fig. 4 is a schematic block diagram of an apparatus provided by an embodiment of the present application. As shown in fig. 4, the apparatus 400 may include: a control module 410 and a virtualization module 420. The modules in the apparatus 400 can be used to implement the corresponding flow of the CPU in the method 200 shown in fig. 2. For example, the control module 410 may be used to perform

steps

201 and 203 in the method 200, and the virtualization module 420 may be used to perform step 202 in the method 200.

Specifically, the control module 410 may be configured to mount N GPUs on each of a plurality of containers, where a preset link exists between the N GPUs, and N is an integer greater than 1; virtualization module 420 may be configured to virtualize GPUs that each container can call, resulting in one or more vGPU instances corresponding to each container; control module 410 is further configured to provide the virtualized vGPU instance to the corresponding container for use.

Optionally, the control module 410 may be further configured to inject a vGPU runtime into each container, and the vGPU runtime injected into each container is configured to virtualize the GPU that can be invoked.

Optionally, the control module 410 may be specifically configured to hijack a call to the first application program interface API through the vGPU runtime injected into each container, and provide a second API, where the first API is a GPU user mode API or a GPU kernel driver API provided by a GPU vendor, and the second API is used to call the vGPU instance in each container.

Optionally, the control module 410 may be further configured to provide a mapping relationship to the vGPU runtime of each container, where the mapping relationship is used to indicate the GPUs that each container can invoke.

Optionally, the GPU mounted to each container of the plurality of containers is configured by the system or configured by the user.

Optionally, the control module 410 may be further configured to schedule a work process based on control logic in the communication library, so that the work process invokes a resource in the CPU to perform a computation in response to the scheduling of the CPU.

It should be understood that the division of the modules in the embodiments of the present application is illustrative, and is only one logical function division, and there may be other division manners in actual implementation. In addition, functional modules in the embodiments of the present application may be integrated into one processor, may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

Fig. 5 is another schematic block diagram of an apparatus provided by an embodiment of the present application. The apparatus 500 may be used to implement the functions of the CPU in the method 200 described above. The apparatus 500 may be a system-on-a-chip. In the embodiment of the present application, the chip system may be composed of a chip, and may also include a chip and other discrete devices.

As shown in fig. 5, the apparatus 500 may include at least one processor 510 for implementing the functions of the CPU in the method 200 provided by the embodiment of the present application.

Illustratively, when the apparatus 500 is used to implement the functions of the CPU in the method 200 provided in the embodiment of the present application, the processor 510 may be configured to mount N GPUs on each of a plurality of containers, where a preset link exists between the N GPUs, and N is an integer greater than 1; virtualizing the GPU which can be called by each container to obtain one or more vGPU instances corresponding to each container; and providing the virtualized vGPU instance for the corresponding container to use. For details, reference is made to the detailed description in the method example, which is not repeated herein.

The apparatus 500 may also include at least one memory 520 for storing program instructions and/or data. The memory 520 is coupled to the processor 510. The coupling in the embodiments of the present application is an indirect coupling or communication connection between devices, units or modules, and may be in an electrical, mechanical or other form, which is used for information interaction between the devices, units or modules. The processor 510 may operate in conjunction with the memory 520. Processor 510 may execute program instructions stored in memory 520. At least one of the at least one memory may be included in the processor.

The apparatus 500 may also include a communication interface 530 for communicating with other devices over a transmission medium, such that the apparatus 500 may communicate with other devices. The communication interface 530 may be, for example, a transceiver, an interface, a bus, a circuit, or a device capable of performing a transceiving function. Processor 510 may utilize communication interface 530 to send and receive data and/or information and to implement the methods performed by the CPU in the corresponding embodiment of fig. 2.

The specific connection medium between the processor 510, the memory 520 and the communication interface 530 is not limited in the embodiments of the present application. In fig. 5, the processor 510, the memory 520, and the communication interface 530 are connected by a bus. The bus lines are shown in fig. 5 as thick lines, and the connection between other components is merely illustrative and not intended to be limiting. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 5, but that does not indicate only one bus or one type of bus.

It should be understood that the processor in the embodiments of the present application may be an integrated circuit chip having signal processing capability. In implementation, the steps of the above method embodiments may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and combines hardware thereof to complete the steps of the method.

It will also be appreciated that the memory in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate SDRAM, enhanced SDRAM, SLDRAM, Synchronous Link DRAM (SLDRAM), and direct rambus RAM (DR RAM). It should be noted that the memory of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

The present application further provides a chip, where the chip includes at least one processor, and is configured to implement the functions related to the CPU in the embodiment shown in fig. 2.

In one possible design, the chip further includes a memory for storing program instructions and data, the memory being located within the processor or external to the processor.

The present application further provides a computing device, the electronic device comprising: a processor, a memory and a computer program stored on the memory and executable on the processor, the processor implementing the method of the embodiment shown in fig. 2 when executing the computer program.

The present application also provides a computer-readable storage medium having stored thereon a computer program (also referred to as code, or instructions). When executed, the computer program causes the computer to perform the method of the embodiment shown in fig. 2.

The present application also provides a computer program product comprising a computer program which, when enabled, implements the method of the embodiment shown in fig. 2.

As used in this specification, the terms "unit," "module," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution.

Those of ordinary skill in the art will appreciate that the various illustrative logical blocks and steps (step) described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application. In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, device and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the unit is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

In the above embodiments, the functions of the functional units may be fully or partially implemented by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions (programs). The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer program instructions (program) are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in, or transmitted from, a computer-readable storage medium to another computer-readable storage medium, for example, from one website, computer, server, or data center, over a wired (e.g., coaxial cable, fiber optics, Digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.) network, the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more integrated servers, data centers, etc., the available medium may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., digital video disks, DVD), or semiconductor media (e.g., Solid State Disk (SSD)), etc.

This functionality, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for device management, the method comprising:

mounting N Graphic Processing Units (GPUs) on each container in a plurality of containers, wherein a preset link exists between the N GPUs, and N is an integer greater than 1;

virtualizing a GPU (graphics processing unit) which can be called by each container to obtain one or more vGPU instances corresponding to each container;

and providing the virtualized vGPU instance for the corresponding container to use.

2. The method of claim 1, wherein the method further comprises:

and injecting vGPU runtime into each container, wherein the vGPU runtime injected into each container is used for virtualizing the GPU capable of being called.

3. The method of claim 2, wherein said providing virtualized vGPU instances to corresponding containers for use comprises:

hijacking the call to the first application program interface API through the vGPU runtime injected into each container, and providing a second API, wherein the first API is a GPU user mode API or a GPU kernel driving API provided by a GPU manufacturer, and the second API is used for calling the vGPU instance in each container.

4. The method of claim 2 or 3, wherein the vGPU runtime provides functionality to inject the mapping relationships or functionality to modify GPU runtime environment variables.

5. The method of claim 2 or 3, wherein the method further comprises:

and providing a mapping relation to the vGPU runtime of each container, wherein the mapping relation is used for indicating the GPU which can be called by each container.

6. The method of claim 1, wherein the GPU mounted for each container in the plurality of containers is configured by a system or by a user.

7. The method of claim 1, wherein each container of the plurality of containers comprises one or more worker processes, each worker process being provided with one or more vGPU instances.

8. The method of claim 7, applied to a Central Processing Unit (CPU), the method further comprising:

and scheduling a work process based on control logic in a communication library, so that the work process responds to the scheduling of the CPU and calls resources in the CPU to perform calculation.

9. A device management apparatus, comprising means for performing the method of any of claims 1 to 8.

10. A device management apparatus comprising a processor for executing the program code to cause the apparatus to implement the method of any of claims 1 to 8.

11. A chip, comprising: at least one processor configured to implement the functions involved in the method of any one of claims 1 to 8.

12. A computing device, comprising: processor, memory and computer program stored on the memory and executable on the processor, which when executed by the processor implements the method of any one of claims 1 to 8.

13. A computer program product, characterized in that it comprises a computer program which, when executed, implements the method according to any one of claims 1 to 8.

14. A computer-readable storage medium, in which a computer program is stored which, when executed, implements the method of any one of claims 1 to 8.