CN109992407B - YARN cluster GPU resource scheduling method, device and medium - Google Patents

YARN cluster GPU resource scheduling method, device and medium Download PDF

Info

Publication number
CN109992407B
CN109992407B CN201810001081.5A CN201810001081A CN109992407B CN 109992407 B CN109992407 B CN 109992407B CN 201810001081 A CN201810001081 A CN 201810001081A CN 109992407 B CN109992407 B CN 109992407B
Authority
CN
China
Prior art keywords
gpu
node
task
resources
resource
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810001081.5A
Other languages
Chinese (zh)
Other versions
CN109992407A (en
Inventor
丛鹏宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Communications Ltd Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Communications Ltd Research Institute filed Critical China Mobile Communications Group Co Ltd
Priority to CN201810001081.5A priority Critical patent/CN109992407B/en
Publication of CN109992407A publication Critical patent/CN109992407A/en
Application granted granted Critical
Publication of CN109992407B publication Critical patent/CN109992407B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing

Abstract

The invention discloses a YARN cluster GPU resource scheduling method, device and medium, which are used for reducing the complexity of GPU resource scheduling while GPU resource scheduling is realized in a YARN cluster. The YARN cluster GPU resource scheduling method adds GPU labels to GPU nodes contained in the YARN cluster; the method comprises the following steps: receiving a task needing to schedule GPU resources, wherein the task carries the number of GPU resources needed for completing the task; determining available GPU nodes in the YARN cluster according to the GPU labels and inquiring the residual GPU resource quantity of each available GPU node; determining a GPU resource scheduling strategy according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node; and scheduling GPU resources for the tasks according to the determined resource scheduling strategy.

Description

YARN cluster GPU resource scheduling method, device and medium
Technical Field
The invention relates to the technical field of big data processing, in particular to a YARN cluster GPU resource scheduling method, device and medium.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
YARN (Another Resource coordinator) is a cluster Resource management system used on the Hadoop platform, and supports management and scheduling of CPU and memory resources. Due to strong computing power, a GPU (Graphics Processing Unit) is widely used in the relevant field of machine learning, and greatly accelerates the training speed of the model. The conventional YARN does not support the scheduling of GPU resources, and in order to support the scheduling of GPU resources, there are two main current schemes as follows.
The first scheme is to directly modify the source code of the YARN and expand the scheduling function of the YARN to support the scheduling of the GPU resources. The advantage of this scheme is better compatibility. However, due to the complexity of the YARN source code and some limitations of the YARN source code, the scheme is difficult to implement, Bug is difficult to debug, the development period is long, and the time cost and the labor cost are high. The second solution is to implement an independent GPU resource management system, which is specifically responsible for the management and scheduling of GPU resources. The advantage of this scheme is that it has no YARN limitation and is flexible. However, since the complete system needs to be developed independently, the development cost is high, and more operation and maintenance costs are brought, and the complexity of the whole system is increased.
Disclosure of Invention
The embodiment of the invention provides a YARN cluster GPU resource scheduling method, device and medium, which are used for reducing the complexity of GPU resource scheduling while GPU resource scheduling is realized in a YARN cluster.
In a first aspect, a method for scheduling YARN cluster GPU resources is provided, in which GPU labels are added to GPU nodes of a graphics processor included in another resource coordinator YARN cluster;
the method comprises the following steps:
receiving a task needing to schedule GPU resources, wherein the task carries the number of GPU resources needed for completing the task;
determining available GPU nodes in the YARN cluster according to the GPU labels and inquiring the residual GPU resource quantity of each available GPU node;
determining a GPU resource scheduling strategy according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node;
and scheduling GPU resources for the tasks according to the determined resource scheduling strategy.
Optionally, determining a GPU resource scheduling policy according to the number of GPU resources required to complete the process and the number of remaining GPU resources of all available GPU nodes, specifically including:
if at least one GPU node is determined to exist according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, and the number of the residual GPU resources of the GPU node is not less than the number of the GPU resources needed for completing the task, determining a GPU resource scheduling strategy to select any GPU node with the number of the residual GPU resources not less than the number of the GPU resources needed for completing the task, and distributing the task to the GPU node for processing in a centralized mode;
and if the number of the residual GPU resources of any GPU node is smaller than the number of the GPU resources required for completing the task according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, determining a GPU resource scheduling strategy to be that the task is randomly distributed to a plurality of GPU nodes for processing.
Optionally, if it is determined that the GPU resource scheduling policy is to select any GPU node whose remaining GPU resource amount is not less than the GPU resource amount required to complete the task, and allocate the task to the GPU node in a centralized manner for processing, after the GPU resource is scheduled for the task according to the determined resource scheduling policy, the method further includes:
modifying the selected GPU node label into a preset label; and
after the task is completed and submitted, the method further comprises the following steps:
and modifying the selected GPU node label as an initial label.
Optionally, before receiving a task requiring scheduling of GPU resources, setting the available GPU resource amount of each GPU node included in the YARN cluster according to the following method:
and aiming at each GPU node, setting the CPU virtual core value of the YARN parameter corresponding to the node as integral multiple of the available GPU resource number of the GPU node according to the available GPU resource number of the GPU node.
In a second aspect, a YARN cluster GPU resource scheduling apparatus is provided, which adds a GPU tag to a GPU node of a graphics processor included in another YARN cluster as a resource coordinator; and
the apparatus, comprising:
the system comprises a receiving unit, a scheduling unit and a scheduling unit, wherein the receiving unit is used for receiving a task needing to schedule GPU resources, and the task carries the number of the GPU resources needed by the task;
a first determining unit, configured to determine available GPU nodes in the YARN cluster according to the GPU tag and query the remaining GPU resource amount of each available GPU node;
a second determining unit, configured to determine a GPU resource scheduling policy according to the number of GPU resources required to complete the processing and the number of remaining GPU resources of each available GPU node;
and the resource scheduling unit is used for scheduling GPU resources for the tasks according to the determined resource scheduling strategy.
Optionally, the second determining unit is specifically configured to determine that at least one GPU node exists according to the number of the GPU resources required for completing the task and the number of the remaining GPU resources of each available GPU node, and if the number of the remaining GPU resources of the GPU node is not less than the number of the GPU resources required for completing the task, determine that the GPU resource scheduling policy is to select any GPU node whose number of the remaining GPU resources is not less than the number of the GPU resources required for completing the task, and allocate the task to the GPU node for processing in a centralized manner; and if the number of the residual GPU resources of any GPU node is smaller than the number of the GPU resources required for completing the task according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, determining a GPU resource scheduling strategy to be that the task is randomly distributed to a plurality of GPU nodes for processing.
Optionally, the apparatus further comprises:
a tag modifying unit, configured to, if the second determining unit determines that the GPU resource scheduling policy is to select any GPU node whose remaining GPU resource amount is not less than the GPU resource amount required to complete the task, allocate the task to the GPU node in a centralized manner for processing, and modify a tag of the selected GPU node to be a preset tag after the resource scheduling resource schedules the GPU resource for the task according to the determined resource scheduling policy; and after the task is submitted, modifying the selected GPU node label as an initial label.
Optionally, the apparatus further comprises:
a resource setting unit, configured to set, before the receiving unit receives a task that needs to schedule GPU resources, an available GPU resource amount of each GPU node included in the YARN cluster according to the following method: and aiming at each GPU node, setting the CPU virtual core value of the YARN parameter corresponding to the node as integral multiple of the available GPU resource number of the GPU node according to the available GPU resource number of the GPU node.
In a third aspect, a computing device is provided, comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program that, when executed by the processing unit, causes the processing unit to perform the steps of any of the above methods.
In a fourth aspect, there is provided a computer readable medium storing a computer program executable by a computing device, the program, when run on the computing device, causing the computing device to perform the steps of any of the methods described above.
In the method, the device and the medium for scheduling the YARN cluster GPU resources, provided by the embodiment of the invention, the GPU labels are added to the GPU nodes contained in the YARN cluster, when the GPU resources are required to be scheduled for the received tasks, the available GPU nodes in the YARN cluster and the residual GPU resource quantity of each available GPU node are determined according to the GPU labels, the residual GPU resource quantity of each available GPU node is inquired, the GPU resource scheduling strategy is determined according to the required GPU resource quantity and the residual GPU resource quantity of each available GPU node, and GPU resource scheduling is carried out on the basis of the GPU resource scheduling strategy.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a schematic diagram of adding GPU labels to GPU nodes included in a YARN cluster according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating an implementation flow of a YARN cluster GPU resource scheduling method in an embodiment of the present invention;
FIG. 3 is a schematic structural diagram of a YARN cluster GPU resource scheduling device in an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a computing device according to an embodiment of the invention.
Detailed Description
In order to realize GPU resource scheduling in the YARN cluster and reduce the complexity of GPU resource scheduling, the embodiment of the invention provides a YARN cluster GPU resource scheduling method, device and medium.
The preferred embodiments of the present invention will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are merely for illustrating and explaining the present invention, and are not intended to limit the present invention, and that the embodiments and features of the embodiments in the present invention may be combined with each other without conflict.
In the embodiment of the invention, based on the label function of the Hadoop YARN Cluster, a CPU label is added to a CPU node in the YARN Cluster, and a GPU label is added to a GPU node in the YARN Cluster (YARN Cluster), as shown in fig. 1.
In specific implementation, since the YARN does not support the function of directly acquiring the available GPU resources of the GPU node, in the embodiment of the present invention, the available GPU resources of the GPU node may be acquired in the following manner: and aiming at each GPU node, setting the CPU virtual core value of the YARN parameter corresponding to the node as integral multiple of the available GPU resource number of the GPU node according to the available GPU resource number of the GPU node.
For example, assuming that the GPU resource of a certain GPU node is x, the number of CPU virtual cores (vcore) in the corner parameter of the node is set to an integer multiple of x, and may be set to x vcores without loss of generality. The GPU resources needed for completing the received task are the required vcore number for completing the task, and the task occupies a plurality of vcore and then occupies a plurality of GPUs. Therefore, the number of the available GPU resources of the GPU node can be obtained by inquiring the number of the available vcore of each node.
It should be noted that the GPU resources involved in the embodiments of the present invention may be, but are not limited to, the number of cards of the GPU node.
As shown in fig. 2, which is a schematic diagram of an implementation flow of the YARN cluster GPU resource scheduling method provided by the embodiment of the present invention, the method may include the following steps:
s21, receiving a task needing to schedule GPU resources, wherein the task carries the number of GPU resources needed for completing the task.
S22, determining the available GPU nodes in the YARN cluster according to the GPU labels and inquiring the residual GPU resource quantity of each available GPU node.
And S23, determining a GPU resource scheduling strategy according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node.
And S24, scheduling GPU resources for the tasks according to the determined resource scheduling strategies.
In specific implementation, in step S23, the GPU resource scheduling policy may be determined according to the following manner: if at least one GPU node is determined to exist according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, and the number of the residual GPU resources of the GPU node is not less than the number of the GPU resources needed for completing the task, determining a GPU resource scheduling strategy to select any GPU node with the number of the residual GPU resources not less than the number of the GPU resources needed for completing the task, and distributing the task to the GPU node for processing in a centralized mode; and if the number of the residual GPU resources of any GPU node is smaller than the number of the GPU resources required for completing the task according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, determining a GPU resource scheduling strategy to be that the task is randomly distributed to a plurality of GPU nodes for processing.
In specific implementation, after receiving a task requiring GPU resources, determining available GPU nodes included in the YARN cluster, and querying the number of GPU resources of each available GPU node. According to the number y of GPU resources required for completing the task and the number X of the residual GPU resources of each current node, sequencing all available GPU nodes according to the number X of the residual GPU resources of each node, and assuming that N available GPU nodes exist, sequencing all the available GPU nodes according to the number of the residual available GPU resourcesQuantity ordering, assumed to be x1,x2,x3,...,xNIf x is presenti≥y>xi+1And determining that the GPU resource scheduling strategy is a centralized scheduling strategy, namely selecting any GPU node with the residual GPU resource quantity not less than the GPU resource quantity required by the task to process the task, and during specific implementation, distributing all workers of the task to one GPU node to complete the task in a centralized manner. And modifying the tag of the selected GPU node to be a preset tag, for example, the tag of the selected GPU node may be modified to be GPU _ XXX, where XXX may be a random number to identify that the GPU node is not currently available. And after the task is submitted, modifying the selected GPU node label as an initial label, namely the initial GPU label.
Based on this, in the YARN cluster GPU resource scheduling method provided in the embodiment of the present invention, if it is determined that the GPU resource scheduling policy is to select any GPU node whose remaining GPU resource amount is not less than the GPU resource amount required to complete the task, and allocate the task to the GPU node in a centralized manner for processing, after the GPU resource is scheduled for the task according to the determined resource scheduling policy, the method further includes:
modifying the selected GPU node label into a preset label; and
after the task is completed and submitted, the method further comprises the following steps:
and modifying the selected GPU node label as an initial label.
In specific implementation, when the GPU node task is completed or fails to be executed, the YARN automatically releases the vcore of the GPU and simultaneously releases a corresponding number of GPU cards.
In order that the invention may be well understood, the following description is illustrative of the practice of the invention with reference to specific examples.
In specific implementation, a plurality of deep learning types need to be directly applied to a Hadoop-based big data platform, such as caffe, Tensorflow and the like.
Taking open source project Tensorflow on spark (TFoS) as an example, the tool realizes cluster distributed work and management of the Tensorflow of a mainstream deep learning tool. However, when a task is sent to a Hadoop cluster by using Spark, there is a problem that GPU resources cannot be scheduled, and according to the embodiment of the present invention, the operation of TFoS on a Hadoop platform can be implemented, and the specific implementation steps are as follows:
1. and starting a label scheduling strategy of the YARN.
yarn.node-labels.enabled=true。
2. And setting the vcore number of the GPU node, and limiting the vcore number which can be used by each task.
a) yarn.
b)yarn.scheduler.maximum-allocation-vcores=N。
3. And labeling the GPU nodes and the CPU nodes.
a)yarn rmadmin-addToClusterNodeLabels cpu,gpu。
b)yarn rmadmin-replaceLabelsOnNode"GPUserver_xx=gpu CPUserver=cpu"。
c) The yarn rmadmin-replaceLabelsOnNode "GPUserver _ xx ═ gppu ═ CPUserver ═ cpu" (for centralized scheduling, a random number is used to ensure that the gpu label is unique).
4. A capacity scheduler is used in the YARN, and access rights for a specified queue (e.g., queue qgpu) to GPU nodes are set.
a)yarn.scheduler.capacity.root.qgpu.accessible-node-labels=cpu,gpu。
b)yarn.scheduler.capacity.root.qgpu.accessible-node-labels.cpu.capacity=100。
c)yarn.scheduler.capacity.root.qgpu.accessible-node-labels.cpu.capacity=100。
5. The TFoS task is submitted specifying the queue name, the AM, and the label used by the executor.
a)--queue qgpu。
b)--conf spark.yarn.am.nodeLabelExpression="cpu"。
c)--conf spark.yarn.executor.nodeLabelExpression="gpu"。
For a centralized scheduling policy, the special tags defined above need to be specified.
-conf spark.yarn.executor.nodeLabelExpression="gpu***"。
After successful submission, the tag is changed back to:
yarn rmadmin-replaceLabelsOnNode"GPUserver_xx=gpu”。
in the YARN cluster GPU resource scheduling method provided by the embodiment of the invention, GPU labels are added to GPU nodes contained in the YARN cluster, when GPU resources are required to be scheduled for received tasks, available GPU nodes in the YARN cluster and the number of the remaining GPU resources of each available GPU node are determined according to the GPU labels, the number of the remaining GPU resources of each available GPU node is inquired, a GPU resource scheduling strategy is determined according to the number of the needed GPU resources and the number of the remaining GPU resources of each available GPU node, and GPU resource scheduling is carried out on the basis of the GPU resource scheduling strategy.
The YARN cluster GPU resource scheduling method provided by the embodiment of the invention has the advantages of low complexity, simple realization, full utilization of the functions of the YARN, low development workload, time cost saving, research and development cost saving and no increase of operation and maintenance cost.
Based on the same inventive concept, the embodiment of the invention also provides a YARN cluster GPU resource scheduling device, and because the method of the YARN cluster GPU resource scheduling device for solving the problems of the device and the equipment is similar, the implementation of the device can refer to the implementation of the method, and repeated parts are not described again.
As shown in fig. 3, which is a schematic structural diagram of a YARN cluster GPU resource scheduling device provided in the embodiment of the present invention, includes:
a receiving unit 31, configured to receive a task that needs to schedule GPU resources, where the task carries a quantity of GPU resources needed to complete the task;
a first determining unit 32, configured to determine available GPU nodes in the YARN cluster according to the GPU tag and query the remaining GPU resource amount of each available GPU node;
a second determining unit 33, configured to determine a GPU resource scheduling policy according to the number of GPU resources required for completing the processing and the number of remaining GPU resources of each available GPU node;
and the resource scheduling unit 34 is configured to schedule GPU resources for the task according to the determined resource scheduling policy.
Optionally, the second determining unit is specifically configured to determine that at least one GPU node exists according to the number of the GPU resources required for completing the task and the number of the remaining GPU resources of each available GPU node, and if the number of the remaining GPU resources of the GPU node is not less than the number of the GPU resources required for completing the task, determine that the GPU resource scheduling policy is to select any GPU node whose number of the remaining GPU resources is not less than the number of the GPU resources required for completing the task, and allocate the task to the GPU node for processing in a centralized manner; and if the number of the residual GPU resources of any GPU node is smaller than the number of the GPU resources required for completing the task according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, determining a GPU resource scheduling strategy to be that the task is randomly distributed to a plurality of GPU nodes for processing.
Optionally, the apparatus further comprises:
a tag modifying unit, configured to, if the second determining unit determines that the GPU resource scheduling policy is to select any GPU node whose remaining GPU resource amount is not less than the GPU resource amount required to complete the task, allocate the task to the GPU node in a centralized manner for processing, and modify a tag of the selected GPU node to be a preset tag after the resource scheduling resource schedules the GPU resource for the task according to the determined resource scheduling policy; and after the task is submitted, modifying the selected GPU node label as an initial label.
Optionally, the apparatus further comprises:
a resource setting unit, configured to set, before the receiving unit receives a task that needs to schedule GPU resources, an available GPU resource amount of each GPU node included in the YARN cluster according to the following method: and aiming at each GPU node, setting the CPU virtual core value of the YARN parameter corresponding to the node as integral multiple of the available GPU resource number of the GPU node according to the available GPU resource number of the GPU node.
For convenience of description, the above parts are separately described as modules (or units) according to functional division. Of course, the functionality of the various modules (or units) may be implemented in the same or in multiple pieces of software or hardware in practicing the invention.
Having described the YARN cluster GPU resource scheduling method and apparatus of an exemplary embodiment of the present invention, a computing apparatus according to another exemplary embodiment of the present invention is described next.
As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or program product. Thus, various aspects of the invention may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
In some possible embodiments, a computing device according to the present invention may comprise at least one processing unit, and at least one memory unit. Wherein the storage unit stores program code, which, when executed by the processing unit, causes the processing unit to perform the steps of the YARN cluster GPU resource scheduling method according to various exemplary embodiments of the present invention described above in this specification. For example, the processing unit may execute step S21 shown in fig. 2, receive a task requiring GPU resource scheduling, where the task carries the number of GPU resources required to complete the task, and step S22, determine available GPU nodes in the YARN cluster according to the GPU tag and query the remaining number of GPU resources of each available GPU node, step S23, determine a GPU resource scheduling policy according to the number of GPU resources required to complete and the remaining number of GPU resources of each available GPU node, and step S24, schedule GPU resources for the task according to the determined resource scheduling policy.
The computing device 40 according to this embodiment of the invention is described below with reference to fig. 4. The computing device 40 shown in fig. 4 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention.
As shown in fig. 4, the computing apparatus 40 is embodied in the form of a general purpose computing device. Components of computing device 40 may include, but are not limited to: the at least one processing unit 41, the at least one memory unit 42, and a bus 43 connecting the various system components (including the memory unit 42 and the processing unit 41).
Bus 43 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a processor, or a local bus using any of a variety of bus architectures.
The storage unit 42 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)421 and/or cache memory 422, and may further include Read Only Memory (ROM) 423.
The storage unit 42 may also include a program/utility 425 having a set (at least one) of program modules 424, such program modules 424 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
Computing device 40 may also communicate with one or more external devices 44 (e.g., keyboard, pointing device, etc.), with one or more devices that enable a user to interact with computing device 40, and/or with any devices (e.g., router, modem, etc.) that enable computing device 40 to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface 45. Also, computing device 40 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) through network adapter 46. As shown, the network adapter 46 communicates with other modules for the computing device 40 over the bus 43. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with computing device 40, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
In some possible embodiments, the various aspects of the YARN cluster GPU resource scheduling method provided by the present invention may also be implemented in the form of a program product, which includes program code for causing a computer device to perform the steps of the method for YARN cluster GPU resource scheduling according to various exemplary embodiments of the present invention described above in this specification when the program product is run on the computer device, for example, the computer device may perform step S21 shown in fig. 2, receive a task that needs to schedule GPU resources, the task carrying an amount of GPU resources needed to complete the task, and step S22, determine available GPU nodes in the YARN cluster according to the GPU tags and query the remaining amount of GPU resources per available GPU node, step S23, determine the amount of GPU resources needed to complete and the remaining amount of GPU resources per available GPU node according to the GPU tags, and S24, scheduling GPU resources for the task according to the determined resource scheduling strategy.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product for YARN cluster GPU resource scheduling of embodiments of the present invention may employ a portable compact disk read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present invention is not limited in this regard and, in the present document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device over any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., over the internet using an internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the units described above may be embodied in one unit, according to embodiments of the invention. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.
Moreover, while the operations of the method of the invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (8)

1. A YARN cluster GPU resource scheduling method is characterized in that GPU labels are added to GPU nodes of a graphics processor contained in another YARN cluster;
the method comprises the following steps:
receiving a task needing to schedule GPU resources, wherein the task carries the number of GPU resources needed for completing the task;
determining available GPU nodes in the YARN cluster according to the GPU labels and inquiring the residual GPU resource quantity of each available GPU node;
determining a GPU resource scheduling strategy according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node;
scheduling GPU resources for the tasks according to the determined resource scheduling strategy;
determining a GPU resource scheduling strategy according to the number of the needed GPU resources and the number of the residual GPU resources of all available GPU nodes, and specifically comprising the following steps:
if at least one GPU node is determined to exist according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, and the number of the residual GPU resources of the GPU node is not less than the number of the GPU resources needed for completing the task, determining a GPU resource scheduling strategy to select any GPU node with the number of the residual GPU resources not less than the number of the GPU resources needed for completing the task, and distributing the task to the GPU node for processing in a centralized mode;
and if the number of the residual GPU resources of any GPU node is smaller than the number of the GPU resources required for completing the task according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, determining a GPU resource scheduling strategy to be that the task is randomly distributed to a plurality of GPU nodes for processing.
2. The method according to claim 1, wherein if it is determined that the GPU resource scheduling policy is to select any GPU node having a remaining GPU resource amount not less than the GPU resource amount required to complete the task, and allocate the task collectively to the GPU node for processing, after scheduling the GPU resource for the task according to the determined resource scheduling policy, the method further comprises:
modifying the selected GPU node label into a preset label; and
after the task is completed and submitted, the method further comprises the following steps:
and modifying the selected GPU node label as an initial label.
3. The method of claim 1 or 2 wherein prior to receiving a task requiring scheduling of GPU resources, the number of available GPU resources for each GPU node included in the YARN cluster is set as follows:
and aiming at each GPU node, setting the CPU virtual core value of the YARN parameter corresponding to the node as integral multiple of the available GPU resource number of the GPU node according to the available GPU resource number of the GPU node.
4. A YARN cluster GPU resource scheduling device is characterized in that GPU labels are added to GPU nodes of a graphics processor contained in another YARN cluster; and
the apparatus, comprising:
the system comprises a receiving unit, a scheduling unit and a scheduling unit, wherein the receiving unit is used for receiving a task needing to schedule GPU resources, and the task carries the number of the GPU resources needed by the task;
a first determining unit, configured to determine available GPU nodes in the YARN cluster according to the GPU tag and query the remaining GPU resource amount of each available GPU node;
a second determining unit, configured to determine a GPU resource scheduling policy according to the number of GPU resources required to complete the processing and the number of remaining GPU resources of each available GPU node;
the resource scheduling unit is used for scheduling GPU resources for the tasks according to the determined resource scheduling strategy;
the second determining unit is specifically configured to determine that at least one GPU node exists according to the number of the GPU resources required for completing the task and the number of the remaining GPU resources of each available GPU node, and if the number of the remaining GPU resources of the GPU node is not less than the number of the GPU resources required for completing the task, determine that a GPU resource scheduling policy is to select any GPU node having the number of the remaining GPU resources not less than the number of the GPU resources required for completing the task, and allocate the task to the GPU node for processing in a centralized manner; and if the number of the residual GPU resources of any GPU node is smaller than the number of the GPU resources required for completing the task according to the number of the needed GPU resources and the number of the residual GPU resources of each available GPU node, determining a GPU resource scheduling strategy to be that the task is randomly distributed to a plurality of GPU nodes for processing.
5. The apparatus of claim 4, further comprising:
a tag modifying unit, configured to, if the second determining unit determines that the GPU resource scheduling policy is to select any GPU node whose remaining GPU resource amount is not less than the GPU resource amount required to complete the task, allocate the task to the GPU node in a centralized manner for processing, and modify a tag of the selected GPU node to be a preset tag after the resource scheduling resource schedules the GPU resource for the task according to the determined resource scheduling policy; and after the task is submitted, modifying the selected GPU node label as an initial label.
6. The apparatus of claim 4 or 5, further comprising:
a resource setting unit, configured to set, before the receiving unit receives a task that needs to schedule GPU resources, an available GPU resource amount of each GPU node included in the YARN cluster according to the following method: and aiming at each GPU node, setting the CPU virtual core value of the YARN parameter corresponding to the node as integral multiple of the available GPU resource number of the GPU node according to the available GPU resource number of the GPU node.
7. A computing device comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program that, when executed by the processing unit, causes the processing unit to perform the steps of the method of any of claims 1 to 3.
8. A computer-readable medium, in which a computer program is stored which is executable by a computing device, the program, when run on the computing device, causing the computing device to perform the steps of the method of any one of claims 1 to 3.
CN201810001081.5A 2018-01-02 2018-01-02 YARN cluster GPU resource scheduling method, device and medium Active CN109992407B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810001081.5A CN109992407B (en) 2018-01-02 2018-01-02 YARN cluster GPU resource scheduling method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810001081.5A CN109992407B (en) 2018-01-02 2018-01-02 YARN cluster GPU resource scheduling method, device and medium

Publications (2)

Publication Number Publication Date
CN109992407A CN109992407A (en) 2019-07-09
CN109992407B true CN109992407B (en) 2020-11-20

Family

ID=67128224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810001081.5A Active CN109992407B (en) 2018-01-02 2018-01-02 YARN cluster GPU resource scheduling method, device and medium

Country Status (1)

Country Link
CN (1) CN109992407B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110413412B (en) * 2019-07-19 2022-03-25 苏州浪潮智能科技有限公司 GPU (graphics processing Unit) cluster resource allocation method and device
KR20210020570A (en) 2019-08-16 2021-02-24 삼성전자주식회사 Electronic apparatus and method for controlling thereof
CN110704186B (en) * 2019-09-25 2022-05-24 国家计算机网络与信息安全管理中心 Computing resource allocation method and device based on hybrid distribution architecture and storage medium
CN110941481A (en) * 2019-10-22 2020-03-31 华为技术有限公司 Resource scheduling method, device and system
CN111176846B (en) * 2019-12-30 2023-06-13 云知声智能科技股份有限公司 Task allocation method and device
CN111190718A (en) * 2020-01-07 2020-05-22 第四范式(北京)技术有限公司 Method, device and system for realizing task scheduling
CN113344311A (en) * 2020-03-03 2021-09-03 北京国双科技有限公司 Task execution method and device, storage medium, processor and electronic equipment
CN113535332A (en) * 2021-08-11 2021-10-22 北京字节跳动网络技术有限公司 Cluster resource scheduling method and device, computer equipment and storage medium
CN113961327A (en) * 2021-10-27 2022-01-21 北京科杰科技有限公司 Resource scheduling management method for large-scale Hadoop cluster

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677467A (en) * 2015-12-31 2016-06-15 中国科学院深圳先进技术研究院 Yarn resource scheduler based on quantified labels
CN106959891A (en) * 2017-03-30 2017-07-18 山东超越数控电子有限公司 A kind of cluster management method and system for realizing GPU scheduling

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9940169B2 (en) * 2015-07-23 2018-04-10 Pearson Education, Inc. Real-time partitioned processing streaming

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677467A (en) * 2015-12-31 2016-06-15 中国科学院深圳先进技术研究院 Yarn resource scheduler based on quantified labels
CN106959891A (en) * 2017-03-30 2017-07-18 山东超越数控电子有限公司 A kind of cluster management method and system for realizing GPU scheduling

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于Hadoop集群的大规模分布式深度学习";Cyprien Noel等;《https://www.csdn.net/article/2015-10-01/2825840》;20151003;第1-3页 *

Also Published As

Publication number Publication date
CN109992407A (en) 2019-07-09

Similar Documents

Publication Publication Date Title
CN109992407B (en) YARN cluster GPU resource scheduling method, device and medium
CN108537543B (en) Parallel processing method, device, equipment and storage medium for blockchain data
US9396028B2 (en) Scheduling workloads and making provision decisions of computer resources in a computing environment
CN109144710B (en) Resource scheduling method, device and computer readable storage medium
CN107943577B (en) Method and device for scheduling tasks
US9501318B2 (en) Scheduling and execution of tasks based on resource availability
KR101893982B1 (en) Distributed processing system, scheduler node and scheduling method of distributed processing system, and apparatus for generating program thereof
CN110389816B (en) Method, apparatus and computer readable medium for resource scheduling
US8595735B2 (en) Holistic task scheduling for distributed computing
US10977076B2 (en) Method and apparatus for processing a heterogeneous cluster-oriented task
US20140344813A1 (en) Scheduling homogeneous and heterogeneous workloads with runtime elasticity in a parallel processing environment
US11507419B2 (en) Method,electronic device and computer program product for scheduling computer resources in a task processing environment
US20090282413A1 (en) Scalable Scheduling of Tasks in Heterogeneous Systems
CN109257399B (en) Cloud platform application program management method, management platform and storage medium
CN111209077A (en) Deep learning framework design method
WO2023116067A1 (en) Power service decomposition method and system for 5g cloud-edge-end collaboration
CN113157379A (en) Cluster node resource scheduling method and device
US9471387B2 (en) Scheduling in job execution
CN115827250A (en) Data storage method, device and equipment
CN113658351B (en) Method and device for producing product, electronic equipment and storage medium
CN113204425B (en) Method, device, electronic equipment and storage medium for process management internal thread
CN114490062A (en) Local disk scheduling method and device, electronic equipment and storage medium
EP4123449A1 (en) Resource scheduling method and related device
CN114138488A (en) Cloud-native implementation method and system based on elastic high-performance computing
CN106933646B (en) Method and device for creating virtual machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant