CN110597635A

CN110597635A - Method and device for distributing graphics processing resources, computer equipment and storage medium

Info

Publication number: CN110597635A
Application number: CN201910868303.8A
Authority: CN
Inventors: 查冲
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-12
Filing date: 2019-09-12
Publication date: 2019-12-20
Anticipated expiration: 2039-09-12
Also published as: CN110597635B

Abstract

The application relates to a method, a device, computer equipment and a storage medium for distributing graphics processing resources, wherein the method comprises the following steps: when the graphics processing resource mounted on a first object is in an idle state and meets a suspension release condition, releasing the mounting of the graphics processing resource aiming at the first object; allocating the graphics processing resource after being unmounted to a second object requesting the graphics processing resource; when a graphics processing resource request sent by the first object is received, obtaining graphics processing resources which are not mounted currently and meet the graphics processing resource request; and loading the acquired graphic processing resource to the first object. The scheme of the application can improve the utilization rate of the graphic processing resources.

Description

Method and device for distributing graphics processing resources, computer equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and an apparatus for allocating graphics processing resources, a computer device, and a storage medium.

Background

With the rapid development of science and technology, various advanced technologies are emerging continuously. The graphics processing resource GPU is getting more and more popular due to its good computing power. GPUs are often used to perform computational processing in a variety of scenarios. For example, for artificial intelligence model training.

In the conventional method, GPU resources are allocated to fixed objects. The GPU assigned to a fixed object, whether idle or used, is always occupied solely by the object assigned. This results in a lower utilization of GPU resources.

Disclosure of Invention

Accordingly, it is desirable to provide a method, an apparatus, a computer device and a storage medium for allocating graphics processing resources, which solve the problem of relatively low utilization of graphics processing resources in the conventional method.

A method of graphics processing resource allocation, the method comprising:

when the graphics processing resource mounted on the first object is in an idle state and meets the unhook condition, then

Un-mount the graphics processing resource for the first object;

allocating the graphics processing resource after being unmounted to a second object requesting the graphics processing resource;

when a graphics processing resource request sent by the first object is received, then

Acquiring the graphics processing resources which are not mounted currently and meet the graphics processing resource request;

and loading the acquired graphic processing resource to the first object.

In one embodiment, the method further comprises:

querying the graphics processing resources which are loaded on the first object and are in an idle state;

detecting the duration of the graphics processing resource in an idle state;

and when the duration exceeds a duration threshold, judging that the graphics processing resource meets an unhooking condition.

In one embodiment, the obtaining graphics processing resources that are not currently mounted and that satisfy the graphics processing resource request comprises:

searching for the graphics processing resources which are configured on the local machine and are not mounted;

and screening the graphics processing resources meeting the graphics processing resource request from the searched graphics processing resources.

In one embodiment, the first object and the second object are different containers for model training;

the obtaining the graphics processing resource which is not mounted currently and meets the graphics processing resource request further comprises:

when the graphics processing resource searched from the local machine can not meet the graphics processing resource request, then

Screening containers carrying over-sale marks from a container management center; the container management center comprises containers created aiming at the graphic processing resources of all the machines in the cluster;

determining a target container from the screened containers; the graphics processing resource mounted in the target container meets the graphics processing resource request;

and releasing the graphics processing resources mounted in the target container.

In one embodiment, the determining the target container from the screened containers comprises:

acquiring the expected model training completion time corresponding to the screened container;

and selecting a target container meeting the graphic processing resource request according to the sequence of the training completion time of the prediction model from far to near to the current time.

In one embodiment, the determining the target container from the screened containers further comprises:

for the container with the predicted model training completion time consistent with the current time, then

Acquiring the resource utilization rate corresponding to the container;

and screening target containers from the containers according to the sequence of the resource utilization rate from low to high.

In one embodiment, the method further comprises:

carrying out breakpoint marking on model training in the target container;

and when the unmounted graphic processing resource exists, the unmounted graphic processing resource is mounted to the target container again, so that the target container continues model training according to the breakpoint mark.

In one embodiment, the graphics processing resource is a graphics processing resource card; the method further comprises the following steps:

when the local computer is restarted, acquiring a registration file corresponding to the local computer; recording the corresponding relation between the serial number of the graphic processing resource card configured in the local machine and the identifier of the container mounted by the graphic processing resource card in the registration file;

and according to the corresponding relation, recovering the graphic processing resource card configured in the local machine from being hung on the container represented by the identifier corresponding to the serial number of the graphic processing resource card.

In one embodiment, the method further comprises:

when a container carrying a over-sale mark exists in the container management center and the first object fails to mount the graphic processing resource, generating monitoring alarm information;

and sending the monitoring alarm information.

In one embodiment, the graphics processing resource is a graphics processing resource card; the graphics processing resource request comprises the number of graphics processing resource cards;

the acquiring the currently unmounted graphics processing resource which meets the graphics processing resource request comprises:

acquiring the graphics processing resource cards which are not mounted currently and meet the quantity;

the loading the acquired graphics processing resources to the first object comprises:

and mounting the acquired graphics processing resource card on the first object.

In one embodiment, the first object and the second object are containers used for artificial intelligence training of a game character model; the game role model is used for representing roles in a game scene.

A graphics processing resource allocation apparatus, the apparatus comprising:

the device comprises a uninstalling module, a judging module and a judging module, wherein the uninstalling module is used for uninstalling the graphics processing resource of a first object aiming at the first object when the graphics processing resource of the first object is in an idle state and meets an uninstalling condition;

the resource allocation module is used for allocating the graphics processing resources after the mounting is removed to a second object requesting the graphics processing resources;

the resource acquisition module is used for acquiring the graphics processing resources which are not mounted currently and meet the graphics processing resource request when the graphics processing resource request sent by the first object is received;

the resource allocation module is further configured to load the acquired graphics processing resource to the first object.

A computer device comprising a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, causes the processor to perform the steps of:

Un-mount the graphics processing resource for the first object;

and loading the acquired graphic processing resource to the first object.

A computer-readable storage medium having stored thereon a computer program which, when executed by a processor, causes the processor to perform the steps of:

Un-mount the graphics processing resource for the first object;

and loading the acquired graphic processing resource to the first object.

According to the graphics processing resource allocation method, the graphics processing resource allocation device, the computer equipment and the storage medium, when the graphics processing resource mounted on a first object is in an idle state and meets the condition of being dismounted, the mounting of the graphics processing resource is released aiming at the first object; and allocating the graphics processing resource after the mounting is released to a second object requesting the graphics processing resource. That is, when a graphics processing resource is idle, the graphics processing resource is suspended and the graphics processing resource after being suspended is allocated to another object for use. When the first object requests the graphics processing resource again, the graphics processing resource which is not mounted currently and meets the graphics processing resource request is mounted on the first object. The recovery of the graphics processing resources of the first object is realized, namely, the mounting is recovered when the graphics processing resources are needed again, so that the first object is ensured to normally use the graphics processing resources. The processing of using time to mount and releasing the mount in idle time to distribute to other objects realizes dynamic mount, and improves the utilization rate of the graphic processing resources compared with the traditional method that the graphic processing resources are always singly occupied by fixed objects.

Drawings

FIG. 1 is a diagram illustrating an exemplary implementation of a method for allocating graphics processing resources;

FIG. 2 is a flow diagram that illustrates a method for allocating graphics processing resources, according to one embodiment;

FIG. 3 is a schematic diagram illustrating a method for graphics processing resources in one embodiment;

FIG. 4 is a diagram illustrating an exemplary scenario for implementing a method for allocating graphics processing resources according to an embodiment;

FIG. 5 is a simplified diagram of a container management center recording information in one embodiment;

FIG. 6 is a simplified flowchart diagram of a method for allocating graphics processing resources, according to one embodiment;

FIG. 7 is a diagram illustrating a design structure for information used to request graphics processing resources, according to one embodiment;

FIG. 8 is a timing diagram of a graphics processing resource allocation method in one embodiment;

FIG. 9 is a block diagram of an apparatus for graphics processing resource allocation in one embodiment;

FIG. 10 is a block diagram of an apparatus for allocating graphics processing resources in another embodiment;

FIG. 11 is a block diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

FIG. 1 is a diagram illustrating an exemplary embodiment of a method for allocating graphics processing resources. Referring to fig. 1, the application scenario includes a first object 110, a second object 120, and a computer device 130. Both the first object 110 and the second object 120 are objects requesting graphics processing resources. The computer device 130 is a machine that provides graphics processing resources. The first object 110, the second object 120, and the computer device 130 may be implemented as separate servers or as a server cluster comprised of a plurality of physical servers.

In one embodiment, computer device 130 may be a single machine in a cluster.

When the graphics processing resource mounted to the first object 110 is in an idle state and an off-mount condition is satisfied, the computer device 130 may then un-mount the graphics processing resource to the first object 110. The computer device 130 may allocate the graphics processing resource after the unmount to the second object 120 requesting the graphics processing resource. When receiving a graphics processing resource request sent by the first object 110, the computer device 130 may obtain a graphics processing resource that is not currently mounted and that satisfies the graphics processing resource request; and loading the acquired graphics processing resource to the first object 110.

FIG. 2 is a flowchart illustrating a method for allocating graphics processing resources according to an embodiment. The method for allocating graphics processing resources in this embodiment may be applied to a computer device, and is mainly illustrated by the computer device 130 in fig. 1. Referring to fig. 2, the method specifically includes the following steps:

s202, when the graphics processing resource mounted on the first object is in an idle state and meets the suspension release condition, the mounting of the graphics processing resource is released aiming at the first object.

The object refers to an object that needs to mount a graphics processing resource to complete corresponding processing. It should be noted that the first object and the second object in the embodiments of the present application are used to distinguish different objects.

In one embodiment, the first object may comprise an object for model training. The model training can be basic machine learning model training or artificial intelligence model training.

The first object may be other objects that require the use of graphics processing resources, and is not limited to the object that performs model training.

The graphics Processing resource, namely the gpu (graphics Processing unit) resource, converts and drives the display information required by the computer system, provides a line scanning signal to the display, controls the display of the display correctly, is an important part for connecting the display and the pc motherboard, and is also one of the important resources of "man-machine conversation".

In one embodiment, the graphics processing resources may be graphics processing resource cards, i.e., GPU cards. I.e., to carry and provide graphics processing resources in the form of resource cards. Therefore, step S202 may be that when the graphics processing resource card mounted to the first object is in an idle state and the un-mounting condition is satisfied, the mounting of the graphics processing resource card is released for the first object.

An idle state refers to a state where graphics processing resources are unused. It will be appreciated that the graphics processing resource mounted (mounted) to the first object, even though in the idle state, is still occupied by the first object alone and cannot be used by other objects.

It will be appreciated that in some processes there may be periods of time when graphics processing resources are not required. For example, when the artificial intelligence learning uses the graphics processing resource for model training, the graphics processing resource has an idle time period because of the preparation time of the data layer such as data cleaning or network transmission.

Specifically, the computer device may detect, in a polling manner, usage states of graphics processing resources that are mounted to the first object, the states including an in-use state and an idle state. The computer device may screen the graphics processing resources in the idle state from the first object, determine whether the graphics processing resources in the idle state satisfy an unlinking condition, and unlink the graphics processing resources for the first object when the unlinking condition is satisfied.

It will be appreciated that the un-mounted graphics processing resource is no longer occupied by the first object, and that the graphics processing resource that is not mounted can be allocated for use by objects other than the first object.

The unmount condition, i.e., the unmount condition, refers to a condition for unmounting the graphics processing resource in the idle state from the first object. It will be appreciated that the un-mounting of the graphics processing resource from the first object illustrates the un-occupation of the graphics processing resource by the first object that is partially idle and satisfies the un-mounting condition.

In one embodiment, the unhook condition may include a duration of time that a graphics processing resource that is suspended from the first object is in an idle state exceeding a time threshold.

In one embodiment, the method further comprises: querying the graphics processing resources which are loaded on the first object and are in an idle state; detecting the duration of the graphics processing resource in an idle state; and when the duration exceeds the duration threshold, judging that the graphics processing resource meets the unhooking condition.

The duration is a duration that is always in an idle state. For example, the gpu resource a is always in an idle state for 30s of detection, and then 30s is the duration of the gpu resource a.

Specifically, the computer device obtains the duration of the graphics processing resource in the idle state, compares the duration with a preset time threshold, and determines that the graphics processing resource in the idle state meets the release condition when the duration in the idle state is greater than or equal to the preset time threshold. Further, an unlink (umount) operation process may be performed for the graphics processing resource.

In other embodiments, the unhook condition may also include the first object being out of use of graphics processing resources. I.e. it is equivalent to the first object no longer needing to use graphics processing resources. For example, the first object is an object for performing artificial intelligence model training, and after the artificial intelligence model training is completed by the first object, the graphics processing resource is used up, and the graphics processing resource can satisfy the un-mount condition, so that the graphics processing resource is un-mounted.

It is understood that the graphics processing resource mounted to the first object belongs to a graphics processing resource that can be over-sold. Over-selling refers to the process of the graphics processing resource being unmounted from the mounted object and then again being assigned to mount. Graphics processing resources that can be over-sold can be unmounted from the mounted objects and reused for resource allocation processing to mount to other objects. Thus, graphics processing resources unpinned from a first object may be allocated for use by a second object requesting graphics processing resources.

S204, the graphics processing resource after the mount is released is distributed to the second object requesting the graphics processing resource.

Wherein the second object is an object that requests graphics processing resources and is different from the first object.

It will be appreciated that the un-mounted graphics processing resource is no longer occupied by the first object and can therefore be used to allocate to other objects requiring graphics processing resources.

In particular, the second object may send a graphics processing resource request to the computer device. The computer device may allocate the unmounted graphics processing resource to a second object requesting the graphics processing resource upon receiving a graphics processing resource request sent by the second object. Thereby realizing reasonable utilization of the part of the graphics processing resources.

In one embodiment, the computer device may re-mount the un-mounted graphics processing resource to the second object requesting the graphics processing resource. In this way, the second object can occupy the portion of the graphics processing resource, thereby implementing the corresponding processing using the portion of the graphics processing resource.

S206, when receiving the graphics processing resource request sent by the first object, obtaining the graphics processing resource which is not mounted currently and satisfies the graphics processing resource request.

It will be appreciated that the first object may subsequently require graphics processing resources. For example, after the preparation stage of the data layer, such as data cleaning or network transmission, the first object may need the graphics processing resource for subsequent processing. Therefore, the first object may send a graphics processing resource request to the computer device to request graphics processing resources again.

When receiving a graphics processing resource request sent by a first object, the computer device may select a graphics processing resource that satisfies the graphics processing resource request from among the unmounted graphics processing resources.

It is understood that a graphics processing resource that is not currently mounted refers to a graphics processing resource that is not mounted for use by an object.

It should be noted that the computer device may directly find an uninstalled graphics processing resource that satisfies the graphics processing resource request from the computer device native. The computer device may also obtain, directly from the container management center, a graphics processing resource that is not currently mounted and that satisfies the graphics processing resource request. The computer equipment can also search the unmounted graphic processing resources meeting the graphic processing resource request from the local machine preferentially, and when the unmounted graphic processing resources in the local machine can not meet the graphic processing resource request, the graphic processing resources which are not mounted currently and meet the graphic processing resource request are obtained from a container in the container management center.

It is to be understood that the graphics processing resource that is to be disconnected from the first object in step S202 is a graphics processing resource that is provided locally in the computer device, i.e., the graphics processing resource is provided to the first object by the computer device. Therefore, when the first object again requires graphics processing resources, the computer device still performs the relevant allocation process of the graphics processing resources.

The container is created corresponding to the graphics processing resources of each machine in the cluster and used for storing the corresponding graphics processing resources. The container management center includes a plurality of containers.

S208, the acquired graphic processing resource is hung on the first object.

Specifically, the computer device may perform mount operation processing on the acquired graphics processing resource, and further mount the acquired graphics processing resource on the first object, so that the first object may perform data processing according to the graphics processing resource mounted again.

In the method for allocating the graphics processing resources, when the graphics processing resources mounted on a first object are in an idle state and meet a suspension release condition, the mounting of the graphics processing resources is released for the first object; and allocating the graphics processing resource after the mounting is released to a second object requesting the graphics processing resource. That is, when a graphics processing resource is idle, the graphics processing resource is suspended and the graphics processing resource after being suspended is allocated to another object for use. When the first object requests the graphics processing resource again, the graphics processing resource which is not mounted currently and meets the graphics processing resource request is mounted on the first object. And returning the graphics processing resources of the first object is realized, so that the first object is ensured to normally use the graphics processing resources. That is, when graphics processing resources need to be used, mounting is realized. The processing of using time to mount and releasing the mount in idle time to distribute to other objects realizes dynamic mount, and improves the utilization rate of the graphic processing resources compared with the traditional method that the graphic processing resources are always singly occupied by fixed objects.

In addition, the graphics processing resources can be disconnected and allocated to other objects for use, which is equivalent to the over-selling of the graphics processing resources, so that the selling price of the graphics processing resources for a single object can be reduced, and the cost is saved under the condition of meeting the normal requirement of the graphics processing resources.

Then, by the method, even if the graphical processing resources are subjected to the de-hanging re-allocation processing, the first object does not influence the normal use of the graphical processing resources, the exclusive experience can be reserved, and seamless transition to dynamic mounting is realized.

Finally, the idle graphic processing resources are released to be allocated to other objects with requirements, so that the requirements of the graphic processing resources with higher objects can be met, and further the processing requirements with high computing power are met (for example, the artificial intelligence training requirements requiring the computing power of the high graphic processing resources can be met).

In one embodiment, the acquiring the graphics processing resource that is not currently mounted and satisfies the graphics processing resource request in step S206 includes: searching for the graphics processing resources which are configured on the local machine and are not mounted; and screening the graphics processing resources meeting the graphics processing resource request from the searched graphics processing resources.

The local computer is a local computer device. It is understood that the computer device is natively configured with graphics processing resources. An unmounted graphics processing resource refers to a graphics processing resource that is not mounted to an object.

Upon receiving a graphics processing resource request sent by the first object, the computer device may look up an unmounted graphics processing resource from the natively configured graphics processing resources. Further, the computer device may filter the graphics processing resources that satisfy the graphics processing resource request from the found graphics processing resources.

In one embodiment, the graphics processing resource is a graphics processing resource card. The computer device is provided with a preset number of graphic processing resource cards. When the first object that is unlinked sends a graphics processing resource request to request graphics processing resources again, the computer device may search for an unlinked graphics processing resource card from the graphics processing resource cards configured by the computer device, and allocate the searched unlinked graphics processing resource card to the first object in response to the graphics processing resource request to mount the graphics processing resource card to the first object again.

In one embodiment, when the graphics processing resource found in the local computer cannot meet the graphics processing resource request, the graphics processing resource which is not currently mounted and meets the graphics processing resource request is obtained from a container in the container management center.

In the above embodiment, the graphics processing resource configured on the local device and not mounted is preferentially searched, so that compared with obtaining the graphics processing resource through network communication with the container management center, the efficiency of obtaining the graphics processing resource is improved, and the network communication resource is also saved.

In one embodiment, the first object and the second object are different containers for model training. The obtaining the graphics processing resource which is not mounted currently and meets the graphics processing resource request further comprises: when the graphic processing resource searched from the local machine can not meet the graphic processing resource request, screening a container carrying a reselling mark from a container management center; determining a target container from the screened containers; the graphics processing resource mounted in the target container meets the graphics processing resource request; and releasing the graphics processing resources mounted in the target container.

The container is created for the graphics processing resources of each machine in the cluster. It can be appreciated that when graphics processing resources are mounted in a container, model training can be performed according to the mounted graphics processing resources. The target container is a container for releasing the mounted graphics processing resources.

In one embodiment, the first object and the second object may be different containers for artificial intelligence model training.

And the container management center is used for managing the containers. It will be appreciated that a container management center is used to manage containers created for the graphics processing resources of the machines in the cluster.

It should be noted that one container may store one or more graphics processing resources. When the graphics processing resources are graphics processing resource cards, then one container may correspond to one or more graphics processing resource cards.

It is understood that a container is an aggregation of services, i.e., a container, which is a lightweight, portable, self-contained software package that enables applications to run in the same manner anywhere, thereby achieving virtualization.

In one embodiment, the computer device may create the container through a Docker technique. Docker is an open source application container engine that allows developers to package their applications and rely on packages into a portable container.

And the over-sale mark is used for marking a container which is hung with graphics processing resources capable of being over-sold. That is, graphics processing resources in a container carrying a reselling mark can be resellerd. It will be appreciated that the container will be registered with the management center prior to creation, and thus marked for over-sale.

Over-selling, i.e., the process of de-mounting graphics processing resources from the mounted objects and re-allocating them. The graphics processing resources that can be over-sold can be removed from the mounted objects and reused for resource allocation processing.

It should be noted that the container management center includes containers carrying the over-sale marks and containers not carrying the over-sale marks. Graphics processing resources mounted in containers that do not carry a reselling mark cannot be reselled, i.e., cannot be unmounted from the mounted object and allocated again.

It will be appreciated that since the graphics processing resources mounted in the first object may be unmounted, when the first object is a container for model training, the first object belongs to a container carrying the over-sell tags. The term "detach" in the embodiments of the present application means to detach the mount.

Specifically, the container management center includes container information therein. The container information is used to record the related information of each container. When the graphics processing resource found from the local computer cannot meet the graphics processing resource request, the computer device may send a message to the container management center to request to obtain the graphics processing resource from the container management center. The container management center can screen the containers carrying the over-sale marks according to the container information. Further, the container management center can determine a target container from the screened containers carrying the over-sale marks; the graphics processing resources mounted in the target container satisfy the graphics processing resource request.

It is understood that there may be one or more target containers, and when there are multiple target containers, the graphics processing resources mounted by the multiple target containers are aggregated to satisfy the graphics processing resource request.

Further, the container management center may release graphics processing resources in the target container. It can be understood that the released graphics processing resource belongs to an unmounted graphics processing resource. The computer device may mount the freed graphics processing resource allocation to the first object.

In one embodiment, the container management center may destroy the target container, thereby freeing up graphics processing resources in the target container.

FIG. 3 is a schematic diagram illustrating a method for graphics processing resources in one embodiment. It is understood that fig. 3 exemplifies the first object and the second object as containers for Artificial Intelligence model training (i.e., AI training container, IA is english abbreviation for Artificial Intelligence, and is collectively referred to as "architecture Intelligence"), and the graphics processing resource is a GPU card (graphics processing resource card). Referring to fig. 3, the mounting and unmounting processes are performed for the GPU card of a single computer device. The single computer device is provided with 8 GPU cards, and the single computer device is a master computer of the GPU cards. After the master machine is initialized, 8 GPU cards in the master machine are all in an uninstalled state and are uniformly managed by a management and control program in a kernel state. The management and control program is divided into two layers of a user mode and a kernel mode, the kernel mode manages the GPU card, and the user mode is mainly used for communicating with the container management center to achieve recycling of the ultra-sold GPU. The AI training container 302 (corresponding to the first object) needs to use the GPU card, and then applies for the GPU from the hypervisor, i.e., sends a graphics processing resource request. The hypervisor may allocate a GPU card in an uninstalled state to the AI training container 302, and install the allocated GPU card to the AI training container 302. The management and control program may determine whether the GPU card is idle by using a polling policy, and when polling finds that the GPU card mounted in the AI training container 302 is in an idle state and the idle time exceeds a configured time threshold, execute umount uninstallation operation, and the uninstalled GPU card is hosted by the management and control program and may be reallocated when receiving an application requirement of the GPU card. Assuming that the AI training container 304 (corresponding to the second object) requests a GPU card from the hypervisor, the GPU card that is detached from the AI training container 302 may be used to assign to the AI training container 304 to mount the GPU card for the AI training container 304. When the detached AI training container 302 again requires GPU card resources, it may preferentially obtain an unmounted GPU card from the control program, and if there is a GPU card, allocate the GPU card to the AI training container 302 in response to the requirement, and if there is no unmounted GPU card or the existing GPU card cannot meet the requirement, the control program may send a message to the container management center to apply for the GPU. The container management center can screen the containers carrying the over-sale marks from the multiple containers, and then determines a target container meeting the GPU card requirement from the containers carrying the over-sale marks. And then, returning the GPU card to the management and control program by destroying the target container. The returned GPU card is also in an uninstalled state. The hypervisor may then mount the returned GPU card assignments to AI training container 302. It should be noted that the AI training container 302 and the AI training container 304 are different model training services.

Note that the AI training container in fig. 3 is a container of the same nature as the container in the container management center. The AI training container is drawn from the container management center for illustration purposes only to show the principles of the graphics processing resource allocation methodology. In essence, each container in the container management center may mount graphics processing resources for AI model training, and different containers may be used to train different AI models.

In the above embodiment, when the graphics processing resource found in the local computer cannot satisfy the graphics processing resource request, the graphics processing resource may be released from the reselling container, so as to return the graphics processing resource of the first object.

In one embodiment, the method further comprises: carrying out breakpoint marking on model training in the target container; and when the unmounted graphic processing resource exists, the unmounted graphic processing resource is mounted to the target container again, so that the target container continues model training according to the breakpoint mark.

Specifically, after the graphics processing resources mounted on the target container are released, the model training performed in the target container is interrupted because there is no graphics processing resource, and at this time, breakpoint marking may be performed on the model training in the target container to mark the stage of the model training in the target container. When the unmounted graphics processing resources appear again, the unmounted graphics processing resources can be mounted to the target container again, so that the target container continues to perform model training according to the breakpoint markers.

It can be understood that the re-occurrence of the graphics processing resource that is not mounted may be obtained after the mounted graphics processing resource is in an idle state and is dismounted when the dismounting condition is satisfied.

In one embodiment, the first object and the second object are both containers for artificial intelligence training of a game character model; the game role model is used for representing roles in a man-machine battle game scene.

For example, a hero model in a royal glory game is a game character model.

FIG. 4 is a diagram illustrating an example of a scenario for allocating graphics processing resources according to an embodiment. Referring to fig. 4, a case of using the king reinforcement training in the royal glory game will be described as an example. That is, in the royal glory game, man-machine engagement is involved, and therefore, the royal is intensively trained on the hero engagement data to train the robot hero model. However, the hero model is computationally intensive in training, primarily based on graphics processing resource (GPU) computing power, and requires a large-scale GPU computing power. Therefore, according to the method provided by the application, the detached graphic processing resources can be used for the reinforcement training of the king in the supermarket GPU container, so that the requirement of the reinforcement training of the king on GPU resources is met. And breakpoint calculation logic is added for the over-sale GPU container (the container carrying the over-sale mark), and when the over-sale GPU container needs GPU resources again, the GPU card is returned.

In the above embodiment, by adding the breakpoint marker, when the graphics processing resource which is not mounted is newly available, the model training in the target container can be continued. Namely, through breakpoint calculation and dynamic mounting, efficient utilization of graphics processing resources is achieved, and therefore high requirements for the graphics processing resources are met.

In one embodiment, determining the target container from the screened containers comprises: acquiring the expected model training completion time corresponding to the screened container; and selecting a target container meeting the graphic processing resource request according to the sequence of the training completion time of the prediction model from far to near to the current time.

The predicted model training completion time is a predicted model training completion time.

Specifically, after the container management center screens the containers carrying the over-sale marks, the predicted model training completion time corresponding to each screened container can be determined according to the summarized container information. Further, the container management center may select a target container satisfying the graphics processing resource request according to a sequence from a time of completion of the model training to a time of completion of the model training.

It is to be understood that the target containers satisfying the graphics processing resource request are selected in order of the expected model training completion time from the current time, i.e., the farther the expected model training completion time is from the current time, the more preferentially the container is selected as the target container. For example, 10 containers carrying over-sale labels are screened, and 5 graphics processing resource cards are required, so that the container with the estimated model training completion time 5 times before the current time is selected as the target container. It is understood that the longer the model training completion time is predicted from the current time, the shorter the model training in the container is expected to be completed, and the longer the model training time is, the target container can be selected. And if the time is shorter than the current time, the model training in the container can be completed without long time, so that the model training of the container can be completed preferentially without releasing the mounted graphics processing resources.

In the above embodiment, the target containers satisfying the graphics processing resource request are selected according to the sequence from the time of the predicted model training completion to the time of the current time, and the graphics processing resources can be released without affecting the containers that are about to complete the model training. Therefore, the graphic processing resources of the first object are restored, model training of other containers can be influenced to the minimum extent, and the accuracy of the restoration is improved.

In one embodiment, the method further comprises: aiming at a container with the expected model training completion time being consistent with the current time, acquiring the resource utilization rate corresponding to the container; and screening target containers from the containers according to the sequence of the resource utilization rate from low to high.

The distance between the predicted model training completion time and the current time is the same, or the distance difference between the predicted model training completion time and the current time is within a preset range.

Resource usage refers to the usage of resources in a container.

In one embodiment, the resource usage may include at least one of graphics processing resource usage, central processor resource usage, and video memory usage in the container.

It will be appreciated that when the resource usage is only a single usage, the target containers may be sorted from the containers in order of low to high usage. That is, the lower the usage, the more preferred the container is to be selected as the target container.

In one embodiment, when the resource utilization includes a utilization of graphics processing resources, a utilization of central processing unit resources, and a utilization of video memory, the step of screening the target container from the containers in order of the resource utilization from low to high includes: screening target containers from the containers according to the sequence of the utilization rate of the graphic processing resources from low to high; when the utilization rates of the graphic processing resources are consistent, screening target containers according to the sequence of the utilization rates of the central processing unit resources of the containers from low to high; when the utilization rates of the central processing unit resources are consistent, the target containers are screened according to the sequence that the utilization rates of the display memories of the containers are from low to high.

It can be understood that the container with low resource utilization rate is preferentially selected as the target container, so that the influence on the data processing currently performed in the container can be reduced as much as possible. Because for a container with a larger resource utilization rate, if the graphics processing resource in the container is released, the current data processing in the container will be influenced more, whereas for a container with a smaller resource utilization rate, if the graphics processing resource in the container is released, the influence is relatively smaller.

FIG. 5 is a simplified diagram of a container management center recording information in one embodiment. Referring to fig. 4, the container management center records information such as a container id, a tag indicating whether the container id corresponds to a reselling service, GPU usage, cpu usage, and video memory usage. The estimated training time is the time required for model training using the graphics processing resources in the container. It will be appreciated that the projected training time can be used to determine a projected model training completion time. For example, if the current time is 12:00 and the expected training time is 2 hours, the expected model training completion time is 14: 00.

In the above embodiment, the target containers are screened from the containers in the order from low to high of the resource utilization rates corresponding to the containers. Graphics processing resources can be freed up with as little impact as possible on the data processing currently performed in the container. Therefore, the graphic processing resources of the first object are restored, the data processing of other containers can be influenced to the minimum extent, and the accuracy of the restoration is improved.

In one embodiment, the graphics processing resource is a graphics processing resource card. The method further comprises the following steps: when the local computer is restarted, acquiring a registration file corresponding to the local computer; recording the corresponding relation between the serial number of the graphic processing resource card configured in the local machine and the identifier of the container mounted by the graphic processing resource card in the registration file; and according to the corresponding relation, recovering the graphic processing resource card configured in the local machine from being hung on the container represented by the identifier corresponding to the serial number of the graphic processing resource card.

It is understood that a registration file is set in the native machine. In the registration file, a correspondence table between the serial number of the graphics processing resource card arranged in the local computer and the identifier of the container mounted on the graphics processing resource card is recorded. The correspondence table may be maintained by a management and control program in the computer device.

When the native machine is restarted, the graphics processing resource cards that the container mounts in the container management center may be out of order, i.e., the container management center does not know which graphics processing resource card the container should mount. Thereby affecting the accuracy of subsequent determination of target containers in the container management center for freeing graphics processing resources.

Therefore, when the local computer is restarted, the computer device can register a corresponding relation table between the serial number of the graphics processing resource card recorded in the file and the identifier of the container mounted by the graphics processing resource card, and restore the graphics processing resource card configured in the local computer to the container represented by the identifier corresponding to the serial number of the graphics processing resource card.

It will be appreciated that the cause of a native reboot may include a post-failure reboot or an unsolicited reboot. This is not limitative.

In one embodiment, the method further comprises: when a container carrying a over-sale mark exists in the container management center and the first object fails to mount the graphic processing resource, generating monitoring alarm information; and sending the monitoring alarm information.

It can be understood that if the container management center has a container carrying a over-sale mark, that is, a graphics processing resource capable of being over-sold exists, but the returning is not triggered or the returning fails, or the failure of allocating the graphics processing resource to the first object occurs, which results in the failure of re-mounting the graphics processing resource by the first object after the first object is detached, it indicates that an abnormality occurs, that is, monitoring alarm information can be generated and sent, and then abnormality diagnosis processing can be performed. Thereby achieving security monitoring.

FIG. 6 is a simplified flowchart of a method for allocating graphics processing resources, according to one embodiment. Referring to fig. 6, when the first object after being detached applies for the GPU resource to the hypervisor again, the hypervisor may determine whether the native computer has an unmounted GPU card, and if so, register the serial number of the unmounted GPU card in a file (i.e., write in the register file), and allocate the GPU card to the first object for mounting again. If the machine does not have the GPU card which is not mounted, whether a container which is over-sold (namely the container carrying the over-sale mark) exists in the container management center or not can be judged, if yes, the container is destroyed, the GPU card in the container is released and returned, and the serial number of the returned GPU card is registered into a file. And then, the GPU card is distributed to the first object to be mounted again. If the returning is not triggered in the returning GPU joint or the returning fails, or the failure occurs in registering files or distributing GPU snap ring joints, the first object after being hung fails to mount the GPU card again, and monitoring and alarming can be carried out.

In the above embodiment, therefore, when the native machine is restarted, the graphics processing resource card configured in the native machine is recovered to be mounted on the corresponding container according to the registration file, so that the recovery of the site before restarting is completed. In this way, it is possible to more accurately determine a target container for releasing graphics processing resources in the subsequent container management center.

In one embodiment, the graphics processing resource is a graphics processing resource card; the graphics processing resource request includes a number of graphics processing resource cards. Step S206 includes: and acquiring the graphics processing resource cards which are not mounted currently and meet the quantity. Step S208 includes: and mounting the acquired graphics processing resource card on the first object.

The number of graphics processing resource cards refers to the number of graphics processing resource cards requested in the graphics processing resource request.

In particular, the computer device may filter graphics processing resource cards that satisfy the requested number from among the currently unmounted graphics processing resource cards. And further mounting the acquired graphics processing resource card on the first object.

In one embodiment, step S206 includes: the management and control program sends information for requesting the graphics processing resources to the container management center so as to instruct the container management center to release the graphics processing resource cards meeting the quantity of the graphics processing resource cards carried in the information from the container.

In the above embodiments, the number of graphics processing resource cards required is specified, and the method is not limited to a specific container for releasing the graphics processing resource cards, thereby realizing flexible dynamic allocation.

In one embodiment, the information for requesting graphics processing resources includes an identification of the specified target container and a number of graphics processing resource cards. The container management center may search for a target container corresponding to the specified identification of the target container and release graphics processing resource cards satisfying the number according to the searched target container.

In one embodiment, the information for requesting graphics processing resources may further include a serial number specifying the graphics processing resource card being disconnected. The container management center may search for a target container corresponding to the identifier of the designated target container and release the graphics processing resources satisfying the number according to the serial number designating the unlinked graphics processing resource card.

FIG. 7 is a diagram illustrating a design structure of information used to request graphics processing resources, in one embodiment. Referring to fig. 7, dockerid indicates the identifier of the designated container, gpu _ num indicates the number of graphics processing resource cards requested, and gpu _ serial indicates the serial number of the designated CPU card to be disconnected.

In the embodiment, when the graphics processing resource is requested to the container management center, the serial number of the target container or the graphics processing resource card can be specified, so that the pertinence and the accuracy of the graphics processing resource request are improved.

Fig. 8 is a timing diagram illustrating an exemplary method for allocating graphics processing resources, which includes the following steps:

1) a first container for carrying out a first AI (artificial intelligence) training service applies for a GPU card (namely a graphic processing resource card) from a management and control program arranged on a master computer;

2) the control program mounts the GPU card to a first container;

3) the management and control program can inquire the GPU card which is loaded on the first object and is in an idle state; detecting the duration of the GPU card in an idle state; when the duration exceeds a duration threshold, judging that the GPU card meets an off-hook condition;

4) the hypervisor can mount the GPU card aiming at the first container;

5) and the second container for performing the second AI training service requests the GPU card from the management and control program.

6) The hypervisor may allocate the disconnected GPU card to a second container;

7) when the first container has resource requirements again, sending a GPU card request to the control program;

8) the control program preferentially searches the GPU card which is configured on the mother machine and is not mounted;

9) when the mother machine does not have GPU cards meeting the request, the management and control program can send information to the container management center to request a specified number of GPU cards;

10) the container management center can screen the containers carrying the over-sale marks; acquiring the expected model training completion time corresponding to the screened container; selecting target containers meeting the specified number of GPU cards according to the sequence of the training completion time of the prediction model from far to near to the current time; aiming at a container with the expected model training completion time being consistent with the current time, acquiring the resource utilization rate corresponding to the container; screening target containers from the containers according to the sequence of the resource utilization rate from low to high;

11) the container management center can release the GPU card mounted in the target container;

12) the hypervisor may mount the acquired GPU card to the first object again.

In one embodiment, when a container carrying a over-sale mark exists in the container management center, but the first object fails to mount the GPU card, generating and sending monitoring alarm information;

in one embodiment, the container management center may perform breakpoint tagging on model training in the target container; when the GPU card which is not mounted exists, re-mounting the GPU card which is not mounted to the target container, and enabling the target container to continue model training according to the breakpoint mark;

in one embodiment, when the local computer is restarted, a registration file corresponding to the local computer is obtained; recording a corresponding relation between a serial number of a GPU card configured on the local machine and an identifier of a container mounted by the GPU card in the registration file; and according to the corresponding relation, the GPU cards configured in the local machine are recovered to be hung on the containers represented by the identifiers corresponding to the serial numbers of the GPU cards.

As shown in FIG. 9, in one embodiment, a graphics processing resource allocation apparatus 900 is provided, the apparatus 900 comprising: a uninstallation module 902, a resource allocation module 904, and a resource acquisition module 906, wherein:

an uninstalling module 902, configured to uninstall, when a graphics processing resource installed on a first object is in an idle state and an uninstalling condition is satisfied, the graphics processing resource is uninstalled for the first object.

A resource allocation module 904, configured to allocate the graphics processing resource after being unmounted to a second object requesting the graphics processing resource.

A resource obtaining module 906, configured to, when receiving a graphics processing resource request sent by the first object, obtain a graphics processing resource that is not currently mounted and meets the graphics processing resource request.

The resource allocation module 904 is further configured to load the acquired graphics processing resource to the first object.

In one embodiment, the uninstall module 902 is further configured to query the graphics processing resource that is installed on the first object and is in an idle state; detecting the duration of the graphics processing resource in an idle state; and when the duration exceeds a duration threshold, judging that the graphics processing resource meets an unhooking condition.

In one embodiment, the resource obtaining module 906 is further configured to find graphics processing resources configured locally and not mounted; and screening the graphics processing resources meeting the graphics processing resource request from the searched graphics processing resources.

In one embodiment, the first object and the second object are different containers for model training; the resource obtaining module 906 is further configured to, when the graphics processing resource found in the local computer cannot meet the graphics processing resource request, screen a container carrying a reselling mark from a container management center; the container management center comprises containers created aiming at the graphic processing resources of all the machines in the cluster; determining a target container from the screened containers; the graphics processing resource mounted in the target container meets the graphics processing resource request; and releasing the graphics processing resources mounted in the target container.

In one embodiment, the resource obtaining module 906 is further configured to obtain a predicted model training completion time corresponding to the screened container; and selecting a target container meeting the graphic processing resource request according to the sequence of the training completion time of the prediction model from far to near to the current time.

In one embodiment, the resource obtaining module 906 is further configured to, for a container with a predicted model training completion time being consistent with the distance from the current time, obtain a resource usage rate corresponding to the container; and screening target containers from the containers according to the sequence of the resource utilization rate from low to high.

In one embodiment, the resource allocation module 904 is further configured to mark breakpoints for model training in the target container; and when the unmounted graphic processing resource exists, the unmounted graphic processing resource is mounted to the target container again, so that the target container continues model training according to the breakpoint mark.

In one embodiment, the graphics processing resource is a graphics processing resource card; the device also includes:

a recovery module 908, configured to obtain a registration file corresponding to the local computer when the local computer is restarted; recording the corresponding relation between the serial number of the graphic processing resource card configured in the local machine and the identifier of the container mounted by the graphic processing resource card in the registration file; and according to the corresponding relation, recovering the graphic processing resource card configured in the local machine from being hung on the container represented by the identifier corresponding to the serial number of the graphic processing resource card.

As shown in fig. 10, in one embodiment, the apparatus 900 further comprises: a recovery module 908 and an alert module 910, wherein:

an alarm module 910, configured to generate monitoring alarm information when a container carrying a reselling mark exists in the container management center but the first object fails to mount the graphics processing resource; and sending the monitoring alarm information.

In one embodiment, the graphics processing resource is a graphics processing resource card; the graphics processing resource request includes a number of graphics processing resource cards. The resource obtaining module 906 is further configured to obtain graphics processing resource cards that are not currently mounted and meet the number; the resource allocation module 904 is further configured to mount the acquired graphics processing resource card on the first object.

In one embodiment, the first object and the second object are both containers for artificial intelligence training of a game character model; the game role model is used for representing roles in a game scene.

FIG. 11 is a diagram showing an internal configuration of a computer device according to an embodiment. Referring to FIG. 11, the computer device may be the computer device 130 of FIG. 1. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the memory includes a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device may store an operating system and a computer program. The computer program, when executed, causes a processor to perform a method of graphics processing resource allocation. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The internal memory may store a computer program that, when executed by the processor, causes the processor to perform a method for allocating graphics processing resources. The network interface of the computer device is used for network communication.

Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, the graphics processing resource allocation apparatus provided in the present application may be implemented in the form of a computer program that is executable on a computer device as shown in fig. 11, and a non-volatile storage medium of the computer device may store various program modules constituting the graphics processing resource allocation apparatus. Such as the unmount module 902, the resource allocation module 904, and the resource acquisition module 906 shown in fig. 9. The computer program comprising the program modules is used for causing the computer device to execute the steps of the graphics processing resource allocation method according to the embodiments of the present application described in the present specification, for example, the computer device may release the graphics processing resource from being mounted on the first object when the graphics processing resource mounted on the first object is in an idle state and satisfies a release condition through a release module 902 in the graphics processing resource allocation apparatus 900 shown in fig. 9. The computer device may allocate the un-mounted graphics processing resource to a second object requesting graphics processing resource through the resource allocation module 904. The computer device may obtain, by the resource obtaining module 906, when receiving the graphics processing resource request sent by the first object, a graphics processing resource that is not currently mounted and that satisfies the graphics processing resource request. The computer device may load the acquired graphics processing resources to the first object via the resource allocation module 904.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, causes the processor to perform the steps of the graphics processing resource allocation method described above. The steps of the graphics processing resource allocation method herein may be steps in the graphics processing resource allocation methods of the various embodiments described above.

In one embodiment, a computer readable storage medium is provided, storing a computer program that, when executed by a processor, causes the processor to perform the steps of the graphics processing resource allocation method described above. The steps of the graphics processing resource allocation method herein may be steps in the graphics processing resource allocation methods of the various embodiments described above.

It should be noted that "first" and "second" in the embodiments of the present application are used only for distinction, and are not used for limitation in terms of size, order, dependency, and the like.

It should be understood that although the steps in the embodiments of the present application are not necessarily performed in the order indicated by the step numbers. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in various embodiments may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least some of the sub-steps or stages of other steps.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of graphics processing resource allocation, the method comprising:

Un-mount the graphics processing resource for the first object;

and loading the acquired graphic processing resource to the first object.

2. The method of claim 1, further comprising:

detecting the duration of the graphics processing resource in an idle state;

3. The method of claim 1, wherein obtaining graphics processing resources that are not currently mounted and that satisfy the graphics processing resource request comprises:

4. The method of claim 3, wherein the first object and the second object are different containers for model training;

5. The method of claim 4, wherein the identifying a target container from the screened containers comprises:

6. The method of claim 5, wherein said identifying a target container from the screened containers further comprises:

Acquiring the resource utilization rate corresponding to the container;

7. The method of claim 4, further comprising:

carrying out breakpoint marking on model training in the target container;

8. The method of claim 4, wherein the graphics processing resource is a graphics processing resource card; the method further comprises the following steps:

9. The method of claim 4, further comprising:

and sending the monitoring alarm information.

10. The method of claim 1, wherein the graphics processing resource is a graphics processing resource card; the graphics processing resource request comprises the number of graphics processing resource cards;

11. The method of any one of claims 1 to 10, wherein the first object and the second object are both containers for artificial intelligence training of a game character model; the game role model is used for representing roles in a game scene.

12. An apparatus for graphics processing resource allocation, the apparatus comprising:

13. The apparatus of claim 12, wherein the resource obtaining module is further configured to find an unmounted graphics processing resource configured locally; and screening the graphics processing resources meeting the graphics processing resource request from the searched graphics processing resources.

14. A computer arrangement comprising a memory and a processor, the memory having stored therein a computer program which, when executed by the processor, causes the processor to carry out the steps of the method of any one of claims 1 to 11.

15. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 11.