CN109992407B

CN109992407B - YARN cluster GPU resource scheduling method, device and medium

Info

Publication number: CN109992407B
Application number: CN201810001081.5A
Authority: CN
Inventors: 丛鹏宇
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Communications Ltd Research Institute
Priority date: 2018-01-02
Filing date: 2018-01-02
Publication date: 2020-11-20
Anticipated expiration: 2038-01-02
Also published as: CN109992407A

Abstract

The invention discloses a YARN cluster GPU resource scheduling method, device and medium, which are used for reducing the complexity of GPU resource scheduling while GPU resource scheduling is realized in a YARN cluster. The YARN cluster GPU resource scheduling method adds GPU labels to GPU nodes contained in the YARN cluster; the method comprises the following steps: receiving a task needing to schedule GPU resources, wherein the task carries the number of GPU resources needed for completing the task; determining available GPU nodes in the YARN cluster according to the GPU labels and inquiring the residual GPU resource quantity of each available GPU node; determining a GPU resource scheduling strategy according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node; and scheduling GPU resources for the tasks according to the determined resource scheduling strategy.

Description

YARN cluster GPU resource scheduling method, device and medium

Technical Field

The invention relates to the technical field of big data processing, in particular to a YARN cluster GPU resource scheduling method, device and medium.

Background

This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.

YARN (Another Resource coordinator) is a cluster Resource management system used on the Hadoop platform, and supports management and scheduling of CPU and memory resources. Due to strong computing power, a GPU (Graphics Processing Unit) is widely used in the relevant field of machine learning, and greatly accelerates the training speed of the model. The conventional YARN does not support the scheduling of GPU resources, and in order to support the scheduling of GPU resources, there are two main current schemes as follows.

The first scheme is to directly modify the source code of the YARN and expand the scheduling function of the YARN to support the scheduling of the GPU resources. The advantage of this scheme is better compatibility. However, due to the complexity of the YARN source code and some limitations of the YARN source code, the scheme is difficult to implement, Bug is difficult to debug, the development period is long, and the time cost and the labor cost are high. The second solution is to implement an independent GPU resource management system, which is specifically responsible for the management and scheduling of GPU resources. The advantage of this scheme is that it has no YARN limitation and is flexible. However, since the complete system needs to be developed independently, the development cost is high, and more operation and maintenance costs are brought, and the complexity of the whole system is increased.

Disclosure of Invention

The embodiment of the invention provides a YARN cluster GPU resource scheduling method, device and medium, which are used for reducing the complexity of GPU resource scheduling while GPU resource scheduling is realized in a YARN cluster.

In a first aspect, a method for scheduling YARN cluster GPU resources is provided, in which GPU labels are added to GPU nodes of a graphics processor included in another resource coordinator YARN cluster;

the method comprises the following steps:

receiving a task needing to schedule GPU resources, wherein the task carries the number of GPU resources needed for completing the task;

determining available GPU nodes in the YARN cluster according to the GPU labels and inquiring the residual GPU resource quantity of each available GPU node;

determining a GPU resource scheduling strategy according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node;

and scheduling GPU resources for the tasks according to the determined resource scheduling strategy.

Optionally, determining a GPU resource scheduling policy according to the number of GPU resources required to complete the process and the number of remaining GPU resources of all available GPU nodes, specifically including:

if at least one GPU node is determined to exist according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, and the number of the residual GPU resources of the GPU node is not less than the number of the GPU resources needed for completing the task, determining a GPU resource scheduling strategy to select any GPU node with the number of the residual GPU resources not less than the number of the GPU resources needed for completing the task, and distributing the task to the GPU node for processing in a centralized mode;

and if the number of the residual GPU resources of any GPU node is smaller than the number of the GPU resources required for completing the task according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, determining a GPU resource scheduling strategy to be that the task is randomly distributed to a plurality of GPU nodes for processing.

Optionally, if it is determined that the GPU resource scheduling policy is to select any GPU node whose remaining GPU resource amount is not less than the GPU resource amount required to complete the task, and allocate the task to the GPU node in a centralized manner for processing, after the GPU resource is scheduled for the task according to the determined resource scheduling policy, the method further includes:

modifying the selected GPU node label into a preset label; and

after the task is completed and submitted, the method further comprises the following steps:

and modifying the selected GPU node label as an initial label.

Optionally, before receiving a task requiring scheduling of GPU resources, setting the available GPU resource amount of each GPU node included in the YARN cluster according to the following method:

and aiming at each GPU node, setting the CPU virtual core value of the YARN parameter corresponding to the node as integral multiple of the available GPU resource number of the GPU node according to the available GPU resource number of the GPU node.

In a second aspect, a YARN cluster GPU resource scheduling apparatus is provided, which adds a GPU tag to a GPU node of a graphics processor included in another YARN cluster as a resource coordinator; and

the apparatus, comprising:

the system comprises a receiving unit, a scheduling unit and a scheduling unit, wherein the receiving unit is used for receiving a task needing to schedule GPU resources, and the task carries the number of the GPU resources needed by the task;

a first determining unit, configured to determine available GPU nodes in the YARN cluster according to the GPU tag and query the remaining GPU resource amount of each available GPU node;

a second determining unit, configured to determine a GPU resource scheduling policy according to the number of GPU resources required to complete the processing and the number of remaining GPU resources of each available GPU node;

and the resource scheduling unit is used for scheduling GPU resources for the tasks according to the determined resource scheduling strategy.

Optionally, the second determining unit is specifically configured to determine that at least one GPU node exists according to the number of the GPU resources required for completing the task and the number of the remaining GPU resources of each available GPU node, and if the number of the remaining GPU resources of the GPU node is not less than the number of the GPU resources required for completing the task, determine that the GPU resource scheduling policy is to select any GPU node whose number of the remaining GPU resources is not less than the number of the GPU resources required for completing the task, and allocate the task to the GPU node for processing in a centralized manner; and if the number of the residual GPU resources of any GPU node is smaller than the number of the GPU resources required for completing the task according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, determining a GPU resource scheduling strategy to be that the task is randomly distributed to a plurality of GPU nodes for processing.

Optionally, the apparatus further comprises:

a tag modifying unit, configured to, if the second determining unit determines that the GPU resource scheduling policy is to select any GPU node whose remaining GPU resource amount is not less than the GPU resource amount required to complete the task, allocate the task to the GPU node in a centralized manner for processing, and modify a tag of the selected GPU node to be a preset tag after the resource scheduling resource schedules the GPU resource for the task according to the determined resource scheduling policy; and after the task is submitted, modifying the selected GPU node label as an initial label.

Optionally, the apparatus further comprises:

a resource setting unit, configured to set, before the receiving unit receives a task that needs to schedule GPU resources, an available GPU resource amount of each GPU node included in the YARN cluster according to the following method: and aiming at each GPU node, setting the CPU virtual core value of the YARN parameter corresponding to the node as integral multiple of the available GPU resource number of the GPU node according to the available GPU resource number of the GPU node.

In a third aspect, a computing device is provided, comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program that, when executed by the processing unit, causes the processing unit to perform the steps of any of the above methods.

In a fourth aspect, there is provided a computer readable medium storing a computer program executable by a computing device, the program, when run on the computing device, causing the computing device to perform the steps of any of the methods described above.

In the method, the device and the medium for scheduling the YARN cluster GPU resources, provided by the embodiment of the invention, the GPU labels are added to the GPU nodes contained in the YARN cluster, when the GPU resources are required to be scheduled for the received tasks, the available GPU nodes in the YARN cluster and the residual GPU resource quantity of each available GPU node are determined according to the GPU labels, the residual GPU resource quantity of each available GPU node is inquired, the GPU resource scheduling strategy is determined according to the required GPU resource quantity and the residual GPU resource quantity of each available GPU node, and GPU resource scheduling is carried out on the basis of the GPU resource scheduling strategy.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:

FIG. 1 is a schematic diagram of adding GPU labels to GPU nodes included in a YARN cluster according to an embodiment of the present invention;

FIG. 2 is a schematic diagram illustrating an implementation flow of a YARN cluster GPU resource scheduling method in an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a YARN cluster GPU resource scheduling device in an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a computing device according to an embodiment of the invention.

Detailed Description

In order to realize GPU resource scheduling in the YARN cluster and reduce the complexity of GPU resource scheduling, the embodiment of the invention provides a YARN cluster GPU resource scheduling method, device and medium.

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are merely for illustrating and explaining the present invention, and are not intended to limit the present invention, and that the embodiments and features of the embodiments in the present invention may be combined with each other without conflict.

In the embodiment of the invention, based on the label function of the Hadoop YARN Cluster, a CPU label is added to a CPU node in the YARN Cluster, and a GPU label is added to a GPU node in the YARN Cluster (YARN Cluster), as shown in fig. 1.

In specific implementation, since the YARN does not support the function of directly acquiring the available GPU resources of the GPU node, in the embodiment of the present invention, the available GPU resources of the GPU node may be acquired in the following manner: and aiming at each GPU node, setting the CPU virtual core value of the YARN parameter corresponding to the node as integral multiple of the available GPU resource number of the GPU node according to the available GPU resource number of the GPU node.

For example, assuming that the GPU resource of a certain GPU node is x, the number of CPU virtual cores (vcore) in the corner parameter of the node is set to an integer multiple of x, and may be set to x vcores without loss of generality. The GPU resources needed for completing the received task are the required vcore number for completing the task, and the task occupies a plurality of vcore and then occupies a plurality of GPUs. Therefore, the number of the available GPU resources of the GPU node can be obtained by inquiring the number of the available vcore of each node.

It should be noted that the GPU resources involved in the embodiments of the present invention may be, but are not limited to, the number of cards of the GPU node.

As shown in fig. 2, which is a schematic diagram of an implementation flow of the YARN cluster GPU resource scheduling method provided by the embodiment of the present invention, the method may include the following steps:

s21, receiving a task needing to schedule GPU resources, wherein the task carries the number of GPU resources needed for completing the task.

S22, determining the available GPU nodes in the YARN cluster according to the GPU labels and inquiring the residual GPU resource quantity of each available GPU node.

And S23, determining a GPU resource scheduling strategy according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node.

And S24, scheduling GPU resources for the tasks according to the determined resource scheduling strategies.

In specific implementation, in step S23, the GPU resource scheduling policy may be determined according to the following manner: if at least one GPU node is determined to exist according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, and the number of the residual GPU resources of the GPU node is not less than the number of the GPU resources needed for completing the task, determining a GPU resource scheduling strategy to select any GPU node with the number of the residual GPU resources not less than the number of the GPU resources needed for completing the task, and distributing the task to the GPU node for processing in a centralized mode; and if the number of the residual GPU resources of any GPU node is smaller than the number of the GPU resources required for completing the task according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, determining a GPU resource scheduling strategy to be that the task is randomly distributed to a plurality of GPU nodes for processing.

In specific implementation, after receiving a task requiring GPU resources, determining available GPU nodes included in the YARN cluster, and querying the number of GPU resources of each available GPU node. According to the number y of GPU resources required for completing the task and the number X of the residual GPU resources of each current node, sequencing all available GPU nodes according to the number X of the residual GPU resources of each node, and assuming that N available GPU nodes exist, sequencing all the available GPU nodes according to the number of the residual available GPU resourcesQuantity ordering, assumed to be x₁,x₂,x₃,...,x_NIf x is present_i≥y＞x_i+1And determining that the GPU resource scheduling strategy is a centralized scheduling strategy, namely selecting any GPU node with the residual GPU resource quantity not less than the GPU resource quantity required by the task to process the task, and during specific implementation, distributing all workers of the task to one GPU node to complete the task in a centralized manner. And modifying the tag of the selected GPU node to be a preset tag, for example, the tag of the selected GPU node may be modified to be GPU _ XXX, where XXX may be a random number to identify that the GPU node is not currently available. And after the task is submitted, modifying the selected GPU node label as an initial label, namely the initial GPU label.

Based on this, in the YARN cluster GPU resource scheduling method provided in the embodiment of the present invention, if it is determined that the GPU resource scheduling policy is to select any GPU node whose remaining GPU resource amount is not less than the GPU resource amount required to complete the task, and allocate the task to the GPU node in a centralized manner for processing, after the GPU resource is scheduled for the task according to the determined resource scheduling policy, the method further includes:

modifying the selected GPU node label into a preset label; and

and modifying the selected GPU node label as an initial label.

In specific implementation, when the GPU node task is completed or fails to be executed, the YARN automatically releases the vcore of the GPU and simultaneously releases a corresponding number of GPU cards.

In order that the invention may be well understood, the following description is illustrative of the practice of the invention with reference to specific examples.

In specific implementation, a plurality of deep learning types need to be directly applied to a Hadoop-based big data platform, such as caffe, Tensorflow and the like.

Taking open source project Tensorflow on spark (TFoS) as an example, the tool realizes cluster distributed work and management of the Tensorflow of a mainstream deep learning tool. However, when a task is sent to a Hadoop cluster by using Spark, there is a problem that GPU resources cannot be scheduled, and according to the embodiment of the present invention, the operation of TFoS on a Hadoop platform can be implemented, and the specific implementation steps are as follows:

1. and starting a label scheduling strategy of the YARN.

yarn.node-labels.enabled＝true。

2. And setting the vcore number of the GPU node, and limiting the vcore number which can be used by each task.

a) yarn.

b)yarn.scheduler.maximum-allocation-vcores＝N。

3. And labeling the GPU nodes and the CPU nodes.

a)yarn rmadmin-addToClusterNodeLabels cpu,gpu。

b)yarn rmadmin-replaceLabelsOnNode"GPUserver_xx＝gpu CPUserver＝cpu"。

c) The yarn rmadmin-replaceLabelsOnNode "GPUserver _ xx ═ gppu ═ CPUserver ═ cpu" (for centralized scheduling, a random number is used to ensure that the gpu label is unique).

4. A capacity scheduler is used in the YARN, and access rights for a specified queue (e.g., queue qgpu) to GPU nodes are set.

a)yarn.scheduler.capacity.root.qgpu.accessible-node-labels＝cpu,gpu。

b)yarn.scheduler.capacity.root.qgpu.accessible-node-labels.cpu.capacity＝100。

c)yarn.scheduler.capacity.root.qgpu.accessible-node-labels.cpu.capacity＝100。

5. The TFoS task is submitted specifying the queue name, the AM, and the label used by the executor.

a)--queue qgpu。

b)--conf spark.yarn.am.nodeLabelExpression＝"cpu"。

c)--conf spark.yarn.executor.nodeLabelExpression＝"gpu"。

For a centralized scheduling policy, the special tags defined above need to be specified.

-conf spark.yarn.executor.nodeLabelExpression＝"gpu***"。

After successful submission, the tag is changed back to:

yarn rmadmin-replaceLabelsOnNode"GPUserver_xx＝gpu”。

in the YARN cluster GPU resource scheduling method provided by the embodiment of the invention, GPU labels are added to GPU nodes contained in the YARN cluster, when GPU resources are required to be scheduled for received tasks, available GPU nodes in the YARN cluster and the number of the remaining GPU resources of each available GPU node are determined according to the GPU labels, the number of the remaining GPU resources of each available GPU node is inquired, a GPU resource scheduling strategy is determined according to the number of the needed GPU resources and the number of the remaining GPU resources of each available GPU node, and GPU resource scheduling is carried out on the basis of the GPU resource scheduling strategy.

The YARN cluster GPU resource scheduling method provided by the embodiment of the invention has the advantages of low complexity, simple realization, full utilization of the functions of the YARN, low development workload, time cost saving, research and development cost saving and no increase of operation and maintenance cost.

Based on the same inventive concept, the embodiment of the invention also provides a YARN cluster GPU resource scheduling device, and because the method of the YARN cluster GPU resource scheduling device for solving the problems of the device and the equipment is similar, the implementation of the device can refer to the implementation of the method, and repeated parts are not described again.

As shown in fig. 3, which is a schematic structural diagram of a YARN cluster GPU resource scheduling device provided in the embodiment of the present invention, includes:

a receiving unit 31, configured to receive a task that needs to schedule GPU resources, where the task carries a quantity of GPU resources needed to complete the task;

a first determining unit 32, configured to determine available GPU nodes in the YARN cluster according to the GPU tag and query the remaining GPU resource amount of each available GPU node;

a second determining unit 33, configured to determine a GPU resource scheduling policy according to the number of GPU resources required for completing the processing and the number of remaining GPU resources of each available GPU node;

and the resource scheduling unit 34 is configured to schedule GPU resources for the task according to the determined resource scheduling policy.

Optionally, the apparatus further comprises:

For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same or in multiple pieces of software or hardware in practicing the invention.

Having described the YARN cluster GPU resource scheduling method and apparatus of an exemplary embodiment of the present invention, a computing apparatus according to another exemplary embodiment of the present invention is described next.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

In some possible embodiments, a computing device according to the present invention may comprise at least one processing unit, and at least one memory unit. Wherein the storage unit stores program code, which, when executed by the processing unit, causes the processing unit to perform the steps of the YARN cluster GPU resource scheduling method according to various exemplary embodiments of the present invention described above in this specification. For example, the processing unit may execute step S21 shown in fig. 2, receive a task requiring GPU resource scheduling, where the task carries the number of GPU resources required to complete the task, and step S22, determine available GPU nodes in the YARN cluster according to the GPU tag and query the remaining number of GPU resources of each available GPU node, step S23, determine a GPU resource scheduling policy according to the number of GPU resources required to complete and the remaining number of GPU resources of each available GPU node, and step S24, schedule GPU resources for the task according to the determined resource scheduling policy.

The computing device 40 according to this embodiment of the invention is described below with reference to fig. 4. The computing device 40 shown in fig. 4 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention.

As shown in fig. 4, the computing apparatus 40 is embodied in the form of a general purpose computing device. Components of computing device 40 may include, but are not limited to: the at least one processing unit 41, the at least one memory unit 42, and a bus 43 connecting the various system components (including the memory unit 42 and the processing unit 41).

Bus 43 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.

The storage unit 42 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)421 and/or cache memory 422, and may further include Read Only Memory (ROM) 423.

The storage unit 42 may also include a program/utility 425 having a set (at least one) of program modules 424, such program modules 424 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

Computing device 40 may also communicate with one or more external devices 44 (e.g., keyboard, pointing device, etc.), with one or more devices that enable a user to interact with computing device 40, and/or with any devices (e.g., router, modem, etc.) that enable computing device 40 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 45. Also, computing device 40 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through network adapter 46. As shown, the network adapter 46 communicates with other modules for the computing device 40 over the bus 43. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computing device 40, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

In some possible embodiments, the various aspects of the YARN cluster GPU resource scheduling method provided by the present invention may also be implemented in the form of a program product, which includes program code for causing a computer device to perform the steps of the method for YARN cluster GPU resource scheduling according to various exemplary embodiments of the present invention described above in this specification when the program product is run on the computer device, for example, the computer device may perform step S21 shown in fig. 2, receive a task that needs to schedule GPU resources, the task carrying an amount of GPU resources needed to complete the task, and step S22, determine available GPU nodes in the YARN cluster according to the GPU tags and query the remaining amount of GPU resources per available GPU node, step S23, determine the amount of GPU resources needed to complete and the remaining amount of GPU resources per available GPU node according to the GPU tags, and S24, scheduling GPU resources for the task according to the determined resource scheduling strategy.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product for YARN cluster GPU resource scheduling of embodiments of the present invention may employ a portable compact disk read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the units described above may be embodied in one unit, according to embodiments of the invention. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A YARN cluster GPU resource scheduling method is characterized in that GPU labels are added to GPU nodes of a graphics processor contained in another YARN cluster;

the method comprises the following steps:

scheduling GPU resources for the tasks according to the determined resource scheduling strategy;

determining a GPU resource scheduling strategy according to the number of the needed GPU resources and the number of the residual GPU resources of all available GPU nodes, and specifically comprising the following steps:

2. The method according to claim 1, wherein if it is determined that the GPU resource scheduling policy is to select any GPU node having a remaining GPU resource amount not less than the GPU resource amount required to complete the task, and allocate the task collectively to the GPU node for processing, after scheduling the GPU resource for the task according to the determined resource scheduling policy, the method further comprises:

modifying the selected GPU node label into a preset label; and

and modifying the selected GPU node label as an initial label.

3. The method of claim 1 or 2 wherein prior to receiving a task requiring scheduling of GPU resources, the number of available GPU resources for each GPU node included in the YARN cluster is set as follows:

4. A YARN cluster GPU resource scheduling device is characterized in that GPU labels are added to GPU nodes of a graphics processor contained in another YARN cluster; and

the apparatus, comprising:

the resource scheduling unit is used for scheduling GPU resources for the tasks according to the determined resource scheduling strategy;

the second determining unit is specifically configured to determine that at least one GPU node exists according to the number of the GPU resources required for completing the task and the number of the remaining GPU resources of each available GPU node, and if the number of the remaining GPU resources of the GPU node is not less than the number of the GPU resources required for completing the task, determine that a GPU resource scheduling policy is to select any GPU node having the number of the remaining GPU resources not less than the number of the GPU resources required for completing the task, and allocate the task to the GPU node for processing in a centralized manner; and if the number of the residual GPU resources of any GPU node is smaller than the number of the GPU resources required for completing the task according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, determining a GPU resource scheduling strategy to be that the task is randomly distributed to a plurality of GPU nodes for processing.

5. The apparatus of claim 4, further comprising:

6. The apparatus of claim 4 or 5, further comprising:

7. A computing device comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program that, when executed by the processing unit, causes the processing unit to perform the steps of the method of any of claims 1 to 3.

8. A computer-readable medium, in which a computer program is stored which is executable by a computing device, the program, when run on the computing device, causing the computing device to perform the steps of the method of any one of claims 1 to 3.