CN109634731A

CN109634731A - A kind of GPU resource packet scheduling implementation method and device based on AI cloud

Info

Publication number: CN109634731A
Application number: CN201811442185.6A
Authority: CN
Inventors: 房体盈
Original assignee: Zhengzhou Yunhai Information Technology Co Ltd
Current assignee: Zhengzhou Yunhai Information Technology Co Ltd
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2019-04-16

Abstract

The embodiment of the invention discloses a kind of GPU resource packet scheduling implementation methods and device based on AI cloud, this method comprises: creating multiple groupings according to GPU type；According to the resource distribution situation of each host by each host assignment into corresponding grouping；The type of business being able to carry out to the GPU type of each grouping is classified, when obtaining a pending business, required GPU type is determined according to the type of business of the business, and according to the required GPU type host that acquisition GPU is in idle condition from corresponding grouping.By the example scheme, more convenient management resource is realized, whole resource utilization is improved, so that the task of high-performance GPU card is needed preferentially to obtain high-performance GPU resource, task is distributed more evenly, had both been improved GPU resource utilization rate, and had been also improved the training effectiveness of algorithm engineering teacher.

Description

A kind of GPU resource packet scheduling implementation method and device based on AI cloud

Technical field

The present embodiments relate to AI cloud, espespecially a kind of GPU resource packet scheduling implementation method based on AI cloud and Device.

Background technique

In the artificial intelligence AI epoch, algorithm engineering teacher needs to carry out a large amount of deep learning tasks, holds usually using docker Device can significantly improve training speed as training environment, using expensive GPU card, a large amount of AI servers for being furnished with GPU card (docker container host) passes through k8s (Kubernetes abbreviation, container cluster management system of increasing income) Lai Jinhang clustering money Source control, then AI cloud platform can be managed collectively docker container cluster by k8s again.When algorithm engineering, teacher needs using GPU When resource, it is necessary to which how the distribution of GPU resource makes the needing high-performance GPU card of the task preferentially to obtain GPU card to be exactly one It is a need to face solve the problems, such as.

Summary of the invention

The embodiment of the invention provides a kind of GPU resource packet scheduling implementation method and device based on AI cloud, Neng Gougeng Add convenient management resource, whole resource utilization is improved, so that the task of high-performance GPU card is needed preferentially to obtain high-performance GPU resource, task distribute more evenly, had both improved GPU resource utilization rate, and also improved the training effectiveness of algorithm engineering teacher.

In order to reach purpose of the embodiment of the present invention, the embodiment of the invention provides a kind of figures based on artificial intelligence AI cloud Processor GPU resource packet scheduling implementation method, the method may include:

Multiple groupings are created according to GPU type；

According to the resource distribution situation of each host by each host assignment into corresponding grouping；

The type of business being able to carry out to the GPU type of each grouping is classified, to obtain a pending industry When business, required GPU type is determined according to the type of business of the business, and according to the required GPU type from corresponding point The host that GPU is in idle condition is obtained in group.

In an exemplary embodiment of the present invention, the resource distribution situation may include: configured GPU type；Institute Stating GPU type includes: P100, V100 or without GPU.

In an exemplary embodiment of the present invention, the resource distribution situation according to each host is by each host assignment It may include: the GPU type configured according to each host into corresponding grouping, by calling the container cluster management system that increases income The management interface of system k8s, stamps corresponding label to each host.

In an exemplary embodiment of the present invention, described that GPU is obtained from corresponding grouping according to the required GPU type The host being in idle condition may include: that the k8s is determined and the required GPU class according to the label on each host Whole hosts in the corresponding grouping of type, and the host that acquisition GPU is in idle condition from whole hosts.

In an exemplary embodiment of the present invention, the method can also include: when obtaining a pending business, The corresponding container of the good pending business is configured, and after obtaining the host that the GPU is in idle condition, it will be described Container is put into the host that the GPU is in idle condition.

In an exemplary embodiment of the present invention, the method can also include: to be performed in the pending business Afterwards, container described in auto-destruct.

The embodiment of the invention also provides a kind of, and the graphics processor GPU resource packet scheduling based on artificial intelligence AI cloud is real Existing device, the apparatus may include: grouping module, distribution module, categorization module and acquisition module；

The grouping module, for creating multiple groupings according to GPU type；

The distribution module, for being grouped each host assignment to corresponding according to the resource distribution situation of each host In；

The categorization module, the type of business for being able to carry out to the GPU type of each grouping are classified, in institute When stating acquisition module one pending business of acquisition, required GPU type, and root are determined according to the type of business of the business The host that GPU is in idle condition is obtained from corresponding grouping according to the required GPU type.

In an exemplary embodiment of the present invention, the resource distribution situation may include: configured GPU type；Institute Stating GPU type may include: P100, V100 or without GPU.

In an exemplary embodiment of the present invention, the distribution module will be each according to the resource distribution situation of each host Host assignment may include: the GPU type configured according to each host into corresponding grouping, by calling sets of containers of increasing income The management interface of cluster management system k8s stamps corresponding label to each host.

In an exemplary embodiment of the present invention, the acquisition module is grouped according to the required GPU type from corresponding It is middle obtain the host that is in idle condition may include: by the k8s according to the label on each host determine with it is described Whole hosts in the corresponding grouping of required GPU type, and obtain what GPU was in idle condition from whole hosts Host.

In an exemplary embodiment of the present invention, described device can also include: configuration module；

The configuration module, it is good described wait hold for configuring when the acquisition module obtains a pending business The corresponding container of capable business, and after the acquisition module obtains the host that the GPU is in idle condition, by the appearance Device is put into the host that the GPU is in idle condition.

In an exemplary embodiment of the present invention, the configuration module can be also used for: in the pending business quilt After execution, container described in auto-destruct.

The embodiment of the present invention includes: multiple groupings are created according to GPU type；It will according to the resource distribution situation of each host Each host assignment is into corresponding grouping；The type of business being able to carry out to the GPU type of each grouping is classified, with When obtaining a pending business, required GPU type is determined according to the type of business of the business, and required according to this GPU type obtains the host that GPU is in idle condition from corresponding grouping.By the example scheme, realize more square Just management resource, improves whole resource utilization, so that the task of high-performance GPU card is needed preferentially to obtain high-performance GPU Resource, task distribute more evenly, had both improved GPU resource utilization rate, and also improved the training effectiveness of algorithm engineering teacher.

The other feature and advantage of the embodiment of the present invention will illustrate in the following description, also, partly from explanation It is become apparent in book, or understand through the implementation of the invention.The purpose of the embodiment of the present invention and other advantages can pass through Specifically noted structure is achieved and obtained in the specification, claims and drawings.

Detailed description of the invention

Attached drawing is used to provide one for further understanding technical solution of the embodiment of the present invention, and constituting specification Point, it is used to explain the present invention the technical solution of embodiment together with embodiments herein, does not constitute to the embodiment of the present invention The limitation of technical solution.

Fig. 1 is the GPU resource packet scheduling implementation method flow chart based on AI cloud of the embodiment of the present invention；

Fig. 2 is the GPU resource packet scheduling realization device composition block diagram based on AI cloud of the embodiment of the present invention.

Specific embodiment

Understand in order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with attached drawing pair The embodiment of the present invention is described in detail.It should be noted that in the absence of conflict, embodiment and reality in the application The feature applied in example can mutual any combination.

Step shown in the flowchart of the accompanying drawings can be in a computer system such as a set of computer executable instructions It executes.Also, although logical order is shown in flow charts, and it in some cases, can be to be different from herein suitable Sequence executes shown or described step.

In order to reach purpose of the embodiment of the present invention, the embodiment of the invention provides a kind of figures based on artificial intelligence AI cloud Processor GPU resource packet scheduling implementation method, as shown in Figure 1, the method may include S101-S102:

S101, multiple groupings are created according to GPU type.

In an exemplary embodiment of the present invention, the GPU type may include: P100, V100 or without GPU.

In an exemplary embodiment of the present invention, it can be created by preset grouping creating device and be marked with GPU first Sign multiple groupings of (can be according to model such as P100, V100, nothing).Being grouped creating device creation (can be according to model with GPU label Such as P100, V100, nothing) grouping when, rationally grouping can be created according to real resource situation, if nothing may be selected without GPU.

S102, according to the resource distribution situation of each host by each host assignment into corresponding grouping.

In an exemplary embodiment of the present invention, the resource distribution situation may include: configured GPU type；I.e. Whether it is configured with GPU in the host, it, can be by the grouping of the host assignment to no GPU, if configuration if do not configured There is GPU, then may further determine that GPU concrete type that the host is configured is, such as P100, V100, it can should Host estimates that the specific GPU type is assigned in corresponding grouping.

In an exemplary embodiment of the present invention, each host can be assigned in each grouping by host packet device (while labelling for host, in order to which k8s is scheduled by host label).

In an exemplary embodiment of the present invention, host packet device assigns to host in each grouping, the process of grouping The process namely to label to host is labelled by calling k8s management interface to host, and k8s can pass through host mark Label are to be scheduled.

S103, the type of business being able to carry out to the GPU type of each grouping are classified, with pending in acquisition one Business when, determine required GPU type according to the type of business of the business, and according to the required GPU type from corresponding Grouping in obtain the host that is in idle condition of GPU.

In an exemplary embodiment of the present invention, grouping resources can be applied to by certain business by grouping application apparatus On, if P100, V100 grouping can be used for deep learning business, the grouping of GPU not can be used for not needing the business of GPU, for example, There is no the grouping of GPU to can be used for creating visualization business (visualization business does not need GPU resource).

In an exemplary embodiment of the present invention, host resource is reasonably grouped by host packet device, is provided Source dispatching device according to business it needs to be determined that the particular requirement to host suitable host is selected as screening conditions in turn Create container.

In an exemplary embodiment of the present invention, when user submits the deep learning of a GPU for needing P100 model to appoint When business, resource management apparatus has configured containers size, and specifies and container is placed on the host with P100 label；K8s is logical It crosses dispatching device and filters out the whole hosts for being equipped with P100 label, then choose idle GPU points further according to load on host computers situation It is used to user.

In an exemplary embodiment of the present invention, the method can also include: to be performed in the pending business Afterwards, container described in auto-destruct, to save host resource.

In an exemplary embodiment of the present invention, by designing reasonable resource grouping and application strategy, according to GPU type (or model) is grouped resource, so that can not to occupy the host resource with GPU (CPU, interior for the business that does not need GPU Deposit, GPU), prevent the host equipped with GPU because CPU, memory can not create the container with GPU due to the deficiencies of, and make GPU not busy Set waste of resource；To improve GPU utilization rate.

The embodiment of the invention also provides a kind of, and the graphics processor GPU resource packet scheduling based on artificial intelligence AI cloud is real Existing device 1, it should be noted that any embodiment in above-mentioned embodiment of the method can be applied in the Installation practice, This is no longer going to repeat them.As shown in Fig. 2, the apparatus may include: grouping module 11, distribution module 12, categorization module 13 With acquisition module 14；

The grouping module 11, for creating multiple groupings according to GPU type；

The distribution module 12, for the resource distribution situation according to each host by each host assignment to corresponding point In group；

The categorization module 13, the type of business for being able to carry out to the GPU type of each grouping are classified, with When the acquisition module 14 obtains a pending business, required GPU type is determined according to the type of business of the business, And the host that GPU is in idle condition is obtained from corresponding grouping according to the required GPU type.

In an exemplary embodiment of the present invention, the distribution module 12 will be every according to the resource distribution situation of each host A host assignment may include: the GPU type configured according to each host into corresponding grouping, by calling container of increasing income The management interface of cluster management system k8s stamps corresponding label to each host.

In an exemplary embodiment of the present invention, the acquisition module 14 is according to the required GPU type from corresponding point Obtained in group the host that is in idle condition of GPU may include: by the k8s according to the label on each host determine with Whole hosts in the corresponding grouping of the required GPU type, and obtain GPU from whole hosts and be in idle shape The host of state.

In an exemplary embodiment of the present invention, described device can also include: configuration module 15；

The configuration module 15, for when the acquisition module obtains a pending business, configure it is good described to The corresponding container of the business of execution, and after the acquisition module obtains the host that the GPU is in idle condition, it will be described Container is put into the host that the GPU is in idle condition.

In an exemplary embodiment of the present invention, the configuration module 15 can be also used for: in the pending business After being performed, container described in auto-destruct.

It will appreciated by the skilled person that whole or certain steps, system, dress in method disclosed hereinabove Functional module/unit in setting may be implemented as software, firmware, hardware and its combination appropriate.In hardware embodiment, Division between the functional module/unit referred in the above description not necessarily corresponds to the division of physical assemblies；For example, one Physical assemblies can have multiple functions or a function or step and can be executed by several physical assemblies cooperations.Certain groups Part or all components may be implemented as by processor, such as the software that digital signal processor or microprocessor execute, or by It is embodied as hardware, or is implemented as integrated circuit, such as specific integrated circuit.Such software can be distributed in computer-readable On medium, computer-readable medium may include computer storage medium (or non-transitory medium) and communication media (or temporarily Property medium).As known to a person of ordinary skill in the art, term computer storage medium is included in for storing information (such as Computer readable instructions, data structure, program module or other data) any method or technique in the volatibility implemented and non- Volatibility, removable and nonremovable medium.Computer storage medium include but is not limited to RAM, ROM, EEPROM, flash memory or its His memory technology, CD-ROM, digital versatile disc (DVD) or other optical disc storages, magnetic holder, tape, disk storage or other Magnetic memory apparatus or any other medium that can be used for storing desired information and can be accessed by a computer.This Outside, known to a person of ordinary skill in the art to be, communication media generally comprises computer readable instructions, data structure, program mould Other data in the modulated data signal of block or such as carrier wave or other transmission mechanisms etc, and may include any information Delivery media.

Claims

1. a kind of graphics processor GPU resource packet scheduling implementation method based on artificial intelligence AI cloud, which is characterized in that described Method includes:

Multiple groupings are created according to GPU type；

The type of business being able to carry out to the GPU type of each grouping is classified, with when obtaining a pending business, Required GPU type is determined according to the type of business of the business, and is obtained from corresponding grouping according to the required GPU type Take the host that GPU is in idle condition.

2. the GPU resource packet scheduling implementation method according to claim 1 based on AI cloud, which is characterized in that the money Source configuring condition includes: configured GPU type；The GPU type includes: P100, V100 or without GPU.

3. the GPU resource packet scheduling implementation method according to claim 1 or 2 based on AI cloud, which is characterized in that described It by each host assignment include: to be configured according to each host into corresponding grouping according to the resource distribution situation of each host GPU type, by call open source container cluster management system k8s management interface, stamp corresponding label to each host.

4. the GPU resource grouping scheduling method according to claim 3 based on AI cloud, which is characterized in that the basis should It includes: the k8s according to each host that required GPU type, which obtains the host that GPU is in idle condition from corresponding grouping, On label determine whole hosts in grouping corresponding with the required GPU type, and from whole hosts Obtain the host that GPU is in idle condition.

5. the GPU resource grouping scheduling method according to claim 1 based on AI cloud, which is characterized in that the method is also It include: the corresponding container of the good pending business of configuration when obtaining a pending business, and described in the acquisition After the host that GPU is in idle condition, the container is put into the host that the GPU is in idle condition.

6. the GPU resource grouping scheduling method according to claim 5 based on AI cloud, which is characterized in that the method is also It include: the container described in auto-destruct after the pending business is performed.

7. a kind of graphics processor GPU resource packet scheduling realization device based on artificial intelligence AI cloud, which is characterized in that described Device includes: grouping module, distribution module, categorization module and obtains module；

The grouping module, for creating multiple groupings according to GPU type；

The distribution module, for according to the resource distribution situation of each host by each host assignment into corresponding grouping；

The categorization module, the type of business for being able to carry out to the GPU type of each grouping are classified, to obtain described When modulus block obtains a pending business, required GPU type is determined according to the type of business of the business, and according to this Required GPU type obtains the host that GPU is in idle condition from corresponding grouping.

8. the GPU resource packet scheduling realization device according to claim 7 based on AI cloud, which is characterized in that the money Source configuring condition includes: configured GPU type；The GPU type includes: P100, V100 or without GPU.

9. the GPU resource packet scheduling realization device according to claim 7 or 8 based on AI cloud, which is characterized in that described Each host assignment is included: according to each master into corresponding grouping according to the resource distribution situation of each host by distribution module The GPU type that machine is configured stamps phase to each host by calling the management interface of open source container cluster management system k8s The label answered.

10. the GPU resource packet scheduling apparatus according to claim 9 based on AI cloud, which is characterized in that the acquisition mould It includes: by the k8s that root tuber, which obtains the host that GPU is in idle condition from corresponding grouping according to the required GPU type, Whole hosts in grouping corresponding with the required GPU type are determined according to the label on each host, and from institute It states and obtains the host that GPU is in idle condition in whole hosts.