CN112612600A - Resource scheduling method and device based on DCU and computer equipment - Google Patents

Resource scheduling method and device based on DCU and computer equipment Download PDF

Info

Publication number
CN112612600A
CN112612600A CN202011381447.XA CN202011381447A CN112612600A CN 112612600 A CN112612600 A CN 112612600A CN 202011381447 A CN202011381447 A CN 202011381447A CN 112612600 A CN112612600 A CN 112612600A
Authority
CN
China
Prior art keywords
job
resource
dcu
execution
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011381447.XA
Other languages
Chinese (zh)
Inventor
王建敏
原帅
吕灼恒
南亚
苏垚
余彬
于洁
郭珂
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Shuguang Nanjing Computing Technology Co ltd
Original Assignee
Dawning Information Industry Beijing Co Ltd
Dawning Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dawning Information Industry Beijing Co Ltd, Dawning Information Industry Co Ltd filed Critical Dawning Information Industry Beijing Co Ltd
Priority to CN202011381447.XA priority Critical patent/CN112612600A/en
Publication of CN112612600A publication Critical patent/CN112612600A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5038Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a resource scheduling method and device based on a DCU (distributed channel Unit) and computer equipment. The resource scheduling method based on the DCU comprises the following steps: receiving a plurality of deep learning application job tasks, wherein the job tasks comprise resource demand information and job task types; acquiring available resource use state information under a DCU development environment of a deep learning accelerator; determining an execution sequence and an execution node of the job task according to the resource demand information, the job task type and the available resource use state information; and scheduling the job tasks to the corresponding execution nodes according to the execution sequence so that the execution nodes execute the job tasks of the corresponding types. The resource scheduling method, device and computer equipment based on the DCU support the resource scheduling of the DCU, and are flexible in application and rich in functions.

Description

Resource scheduling method and device based on DCU and computer equipment
Technical Field
The invention relates to the technical field of servers, in particular to a resource scheduling method and device based on a DCU (distributed channel Unit) and computer equipment.
Background
The main mechanisms of the current mainstream container orchestration platform include four major parts, namely application deployment, planning, updating and maintenance. Through the container arrangement mode, mutual isolation among containers can be realized; each container has its own file system; the processes between the containers cannot influence each other, and computing resources can be distinguished. The dispatching system combines the docker container technology, can encapsulate deep learning frames such as Caffe and TensorFlow, and improves the efficiency of deep learning application. Meanwhile, the scheduling system platform also provides the functions of training task submission, resource state monitoring and the like, and realizes the scheduling and allocation of resources.
At present, a traditional scheduling system only supports Deep learning frameworks such as Caffe and TensorFlow accelerated by a Graphics Processing Unit (GPU) of NVIDIA type for training, and does not support other GPU types such as DCU (Deep learning accelerator), which is poor in expansibility and single in function.
Disclosure of Invention
The object of the present invention is to solve at least to some extent one of the above mentioned technical problems.
Therefore, the first objective of the present invention is to provide a resource scheduling method based on DCU, which supports resource scheduling of DCU, and has flexible application and rich functions.
The second objective of the present invention is to provide a resource scheduling apparatus based on DCU.
A third object of the invention is to propose a computer device.
A fourth object of the invention is to propose a non-transitory computer-readable storage medium.
In order to achieve the above object, an embodiment of a first aspect of the present invention provides a DCU-based scheduling method, where the method includes:
receiving a plurality of deep learning application tasks, wherein the operation tasks comprise resource demand information and operation task types;
acquiring available resource use state information under a DCU development environment of a deep learning accelerator;
determining an execution sequence and an execution node of the job task according to the resource demand information, the job task type and the available resource use state information;
and scheduling the job tasks to corresponding execution nodes according to the execution sequence so that the execution nodes execute the job tasks of the corresponding types.
Optionally, determining an execution sequence and an execution node of the job task according to the resource demand information, the job task type, and the available resource usage state information includes:
the job task also comprises a user job priority, and the job task type has a corresponding scheduling type priority;
determining an actual job priority corresponding to the job task according to the user job priority, the resource demand information and the scheduling type priority;
determining the execution sequence of the job tasks according to the actual job priority; and
and determining the execution node with the same type as the job task.
And determining the actual job priority corresponding to the job task according to the user job priority, the resource demand information and the scheduling type priority, refining the step of determining the execution sequence of the job task, and ensuring the optimal resource allocation.
Optionally, when the execution node executes the job task, the method further includes:
and creating resource information corresponding to the job task, and recording the resource information to a resource record table.
And recording the resource information to a resource record table, counting the resources and facilitating the management of the resources.
Optionally, the method further comprises:
receiving a cancel command for canceling the job task in the process of executing the job task by the execution node;
determining resource use information of the job task according to the cancel command;
and updating a resource record table according to the resource use information.
And canceling the job task, releasing the resources occupied by the job task, and updating the resource statistics.
Optionally, the method further comprises:
judging whether the execution of the job task is completed;
and if the execution is finished, saving the job task into a historical task table.
And the job tasks are stored in the historical task list, so that the job tasks can be traced back conveniently.
Optionally, the deep learning application includes one or more of Caffe, TensorFlow, Pytorth, and Keras.
And various deep learning applications are supported, and the compatibility is good.
Optionally, the resource requirement information includes one or more of a CPU number, a memory size, and a DCU number.
And various resource information is supported, and the resource requirements can be acquired more accurately.
The resource scheduling method based on the DCU supports the resource scheduling of the DCU, and is flexible in application and rich in functions.
In order to achieve the above object, a second aspect of the present invention provides a resource scheduling apparatus based on DCU, including:
the system comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a plurality of deep learning application job tasks, and the job tasks comprise resource demand information and job task types;
the acquisition module is used for acquiring available resource use state information under the DCU development environment of the deep learning accelerator;
the determining module is used for determining the execution sequence and the execution node of the job task according to the resource demand information, the job task type and the available resource use state information;
and the scheduling module is used for scheduling the job tasks to corresponding execution nodes according to the execution sequence so as to enable the execution nodes to execute the job tasks of the corresponding types.
Optionally, the determining module is configured to:
the job task also comprises a user job priority, and the job task type has a corresponding scheduling type priority;
determining an actual job priority corresponding to the job task according to the user job priority, the resource demand information and the scheduling type priority;
determining the execution sequence of the job tasks according to the actual job priority; and
and determining the execution node with the same type as the job task.
Optionally, the apparatus further comprises a creating module, configured to:
and when the execution node executes the job task, creating resource information corresponding to the job task, and recording the resource information to a resource record table.
Optionally, the apparatus further includes an update module, configured to:
receiving a cancel command for canceling the job task in the process of executing the job task by the execution node;
determining resource use information of the job task according to the cancel command;
and updating a resource record table according to the resource use information.
Optionally, the apparatus further comprises a storage module, configured to:
judging whether the execution of the job task is completed;
and if the execution is finished, saving the job task into a historical task table.
Optionally, the deep learning application includes one or more of Caffe, TensorFlow, Pytorth, and Keras.
Optionally, the resource requirement information includes one or more of a CPU number, a memory size, and a DCU number.
The resource scheduling device based on the DCU supports the resource scheduling of the DCU, and is flexible in application and rich in functions.
In order to achieve the above object, an embodiment of a third aspect of the present invention provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor executes the computer program to implement the DCU-based resource scheduling method according to the embodiment of the first aspect.
In order to achieve the above object, a non-transitory computer-readable storage medium is further provided in a fourth embodiment of the present invention, and a computer program is stored thereon, where the computer program is configured to, when executed by a processor, implement a DCU-based resource scheduling method according to the first embodiment.
Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention. In the drawings:
FIG. 1 is a flow chart of a DCU-based resource scheduling method according to an embodiment of the present invention;
FIG. 2 is a flowchart of a DCU-based resource scheduling method according to another embodiment of the present invention;
FIG. 3 is a flowchart of a DCU-based resource scheduling method according to another embodiment of the present invention;
FIG. 4 is a flowchart of a DCU-based resource scheduling method according to a further embodiment of the present invention;
FIG. 5 is a diagram illustrating a DCU-based deep learning application architecture according to an embodiment of the present invention;
FIG. 6 is a flowchart of a DCU-based resource scheduling method according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of a DCU-based resource scheduling apparatus according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a DCU-based resource scheduling apparatus according to another embodiment of the present invention;
fig. 9 is a schematic structural diagram of a DCU-based resource scheduling apparatus according to another embodiment of the present invention;
fig. 10 is a schematic structural diagram of a DCU-based resource scheduling apparatus according to still another embodiment of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments may be combined with each other without conflict. The present invention will be described in detail below with reference to the embodiments with reference to the attached drawings.
The present invention is described in further detail below with reference to specific examples, which are not to be construed as limiting the scope of the invention as claimed.
The following describes a resource scheduling method, device and computer equipment based on DCU according to an embodiment of the present invention with reference to the accompanying drawings.
Fig. 1 is a flowchart of a DCU-based resource scheduling method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
s1, receiving a plurality of deep learning application job tasks.
In this embodiment, the scheduling system is combined with the DCU, and the scheduling system can receive job tasks of different types of deep learning applications, such as Caffe, tensrflow, Pytorth, and Keras. The job task may include resource requirement information and a job task type. The resource requirement information includes, but is not limited to, the number of CPUs, the memory size, and the number of DCUs.
And S2, acquiring the available resource use state information under the DCU development environment of the deep learning accelerator.
The scheduling system can dynamically acquire the actual use condition of resources in real time through the acquisition plug-in, such as the resource occupation condition of each node, so as to determine which nodes have available resources to provide for job tasks.
Before that, the components and installation packages required by the bottom DCU/GPU can be pre-installed and deployed, and the configuration of the DCU/GPU development environment is completed.
And S3, determining the execution sequence and the execution node of the job task according to the resource demand information, the job task type and the available resource use state information.
The job task also comprises a user job priority, and the job task type has a corresponding scheduling type priority. Therefore, the scheduling system can determine the actual job priority corresponding to the job task according to the user job priority, the resource demand information and the scheduling type priority, so that the execution sequence of the received job task is determined according to the actual job priority.
Further, an execution node of the same type as the job task may also be determined.
For example, the scheduling system may employ a priority plug-in to determine the priority of job tasks.
Wherein the priority plug-in supports both pre-selected policies and preferred policies.
The pre-selection policy is a mandatory rule, and if no node satisfies the job task request, the job task is suspended until a node can satisfy the scheduling condition. The pre-selection strategy will filter out nodes that do not meet the strategy according to the configured pre-selection strategy, and the remaining nodes will be candidates to become the input of the preferred process. And executing a preselected algorithm to screen the nodes in the queue. The screening conditions may include that the ports do not conflict, cpu and memory resources QoS (if any) must be met, the mount volume (if any) type must match, the node selector rule must match, the hard affinity rule must match, the state of the node (condition) must be normal, the target _ permission hard rule, and so on.
The preferred strategy is as follows: and calculating the integral of each candidate node according to a configured preferred strategy, and ranking according to the integral, wherein the node with the highest score wins, and the Pod will be bound with the node. And executing a preferred algorithm to score the nodes remained in the queue, wherein each item has the respective weight: the method comprises the steps of the balance of the whole CPU and memory resources, whether a required mirror image exists on a node, whether a port of the same resource has scheduling, a node affinity soft rule, a target _ permission soft rule and the like.
And S4, scheduling the job tasks to the corresponding execution nodes according to the execution sequence so that the execution nodes execute the job tasks of the corresponding types.
According to the resource scheduling method based on the DCU, disclosed by the embodiment of the invention, the execution sequence and the execution nodes of the job tasks are determined by receiving the deep learning application job tasks according to the resource demand information, the job task types and the available resource use state information, and then the job tasks are scheduled to the corresponding execution nodes according to the execution sequence, so that the execution nodes execute the job tasks of the corresponding types, the resource scheduling supporting the DCU is realized, the application is flexible, and the functions are rich.
In another embodiment of the present invention, as shown in fig. 2, the method further comprises:
when the execution node executes the job task, S5 creates resource information corresponding to the job task, and records the resource information in the resource record table.
After the job task is successfully scheduled, a unique label can be generated according to the UUID of the user to correspond to the job task, and the resources occupied by the task are recorded, so that a resource record table is generated. For example, how many CPUs are occupied by the job task, how many memories are occupied by the job task, and how many DCUs correspond to the job task. That is, the resource record table records the resource usage of each node executing the corresponding job task, and the management of the scheduling system is facilitated by counting the used resources.
In yet another embodiment of the present invention, as shown in fig. 3, the method further comprises:
s6, during the execution of the job task by the executing node, receives a cancel command to cancel the job task.
S7, according to the cancel command, the resource use information of the job task is determined.
And S8, updating the resource record table according to the resource use information.
In the process of executing the job task, a user can stop or delete the job task through a web page or a command line, the scheduling system can acquire detailed information corresponding to the job task, such as an index of a CPU, a size of a memory, an index of a DCU, and the like, update the information into a resource record table, then cancel the task, stop a container through a docker command, delete task information, release resources occupied by the job task, and notify the scheduling system of the completion of the task. The resource record table is updated, specifically, the released resource information is updated into the resource record table, and the statistical condition in the resource record table is updated, so that the scheduling system always keeps the latest state when acquiring the resource information.
In yet another embodiment of the present invention, as shown in fig. 4, the method further comprises:
s9, judging whether the execution of the job task is completed.
S10, if the execution is finished, the job task is saved in the history task table.
After the task is completed, the task can be stored in a historical task table of the scheduling system. And then, the resource use condition and the resource use rate report can be counted by using the user name, the task type, the predefined time or the user-defined time period and the like, so that the resource use condition of the multi-dimensional backtracking historical task is realized.
The following is a detailed description of a specific embodiment.
At present, the existing scheduling system uses deep learning frames such as Caffe and TensorFlow accelerated by a GPU for training, cannot be directly integrated with a DCU (distributed control Unit) to run deep learning application, does not support the combination of the DCU and a docker, cannot start a development environment with the DCU, and cannot manage resources such as the DCU.
In order to solve the above problems, this embodiment provides a resource scheduling method for deep learning application based on a DCU, which integrates a scheduling system with the DCU, and the DCU is combined with a docker, starts a development environment with the DCU, and automatically discovers resources (nodes capable of executing job tasks, and available resources corresponding to the nodes) by collecting plug-ins and pushes the resources to the scheduling system, so as to manage the resources. When creating job tasks, a unique ID tag is generated through UUIDs of users to distinguish job tasks of other users. If the user creates the job task again, the previously generated ID tags can be used, and the resource use conditions of the job tasks of different users can be respectively counted through the ID tags.
Specifically, the DCU is used as a deep learning accelerator to be combined with the docker, a driver of the DCU needs to be installed on the docker host, a mirror image of the container is required to contain a DCU development kit, and a development environment of the DCU is provided when the container is started. The scheduling system of the embodiment is compatible with two GPU types, namely DCU and NVIDIA. The deep learning application architecture of the DCU can be shown in fig. 5, and includes deep learning APP applications, such as Caffe, tensrflow, Pytorth, Keras, and the like; a service layer, Webapp; a scheduling management layer, namely an open source platform Kubernetes and a job scheduling system SLURM for automatic container operation; and the bottom development environment is the DCU/GPU development environment.
Firstly, automatically installing and deploying components and installation packages required by a bottom DCU/GPU to complete the DCU/GPU development environment configuration. And secondly, submitting different types of deep learning application job tasks through a web page or a command line, storing the submitted job task information into a Json file under a specified directory, wherein the resource information requested by the job task comprises the number of CPUs (central processing units), the size of a memory and the number of DCUs (distributed control units). And the acquisition plug-in of the scheduling system automatically discovers node resources such as DCU/GPU (distributed control Unit/graphics processing Unit), dynamically calculates according to the actual use condition of the resources, regularly pushes idle resources and statistical information of the current use condition of the resources to each deep learning application framework, and writes the statistical information into a configuration file. And finally, scheduling management is carried out on resource information such as DCU/GPU and the like through a pre-processing script and a post-processing script of the scheduling system, the use condition of user task resources is counted, and the resource counting information of historical tasks is traced back in a multi-dimensional mode.
The specific implementation method can be as shown in fig. 6.
S601, submitting a job task through a web page or a command line and generating a Json file.
Different types of deep learning application tasks are submitted via a web page or command line. And the submitted task information is saved to a Json file under a specified directory. The resources required in the task information include the number of CPUs, the size of the memory, and the number of DCUs.
S602, the dispatching system dynamically counts resources.
The scheduling system can dynamically count the actual use condition of the resources by utilizing the acquisition plug-in, and regularly pushes idle resources, the current use condition of the resources and the like to each deep learning frame Caffe, TensorFlow, Pythritth and Keras.
And S603, calculating the job priority by the scheduling system priority plug-in.
And calculating the actual priority of the user job through a priority plug-in according to the user job priority, the actual job resource requirement and the scheduling type priority, and scheduling the task submitted by the user to the optimal node corresponding to the scheduling type. The scheduling type priority refers to a scheduling priority corresponding to the job task type, and if a certain job task is a DCU type, the priority is superior to the GPU type. The actual priority is the order of priority in which the job tasks are actually executed.
S604, judging whether the resources are matched.
And judging whether the resource use conditions of the optimal nodes can be matched or not according to the resource demand conditions in the tasks.
And S605, if the matching can not be carried out, queuing the job task and waiting for available resources.
S606, if matching is available, the resource allocation is successful.
S607, the job is successfully scheduled, and a unique label is generated.
The tag may determine the uniqueness of the job task. If the user wants to trace back the historical state information of a certain job task, the user can inquire through the label.
S608, creating resources and generating a resource record table.
The scheduling system finds the preprocessing scripts of the corresponding types according to the task types (such as DCUs, GPUs and the like), and runs the preprocessing scripts on the management nodes of the scheduling system to obtain the detailed information (the number of CPUs, the size of memories and the number of DCUs) of the job tasks. Based on the information, the relevant resources are created through the initiation of a docker command. And all the acquired task detailed information is recorded in a resource record table of the scheduling system.
And S609, whether the job state is updated or not.
The user can stop or delete the task through the web page or the command line, and when the operation occurs, the resource state is changed. If there is a change, it needs to be updated.
The scheduling system calls a post-processing script (a script after the execution of the job task) corresponding to the task type according to the task type, runs the post-processing script at a management node of the scheduling system to acquire detailed information of the task, and completely records and updates the acquired detailed information of the task into a resource record table.
S610, if the resource statistics record is updated, the resource statistics record is updated.
That is, the task detail information is all recorded and updated in the resource record table.
S611, if the update is not carried out, the operation is finished, and the resources are released.
In addition, after the user task is completed, the task can be saved in a historical task table of the scheduling system. Therefore, the resource use condition and the resource use rate report can be counted according to the user name, the task type, the predefined time or the user-defined time period and the like, and the resource use condition of the multi-dimensional backtracking historical task is achieved.
In order to implement the above embodiments, the present invention further provides a resource scheduling apparatus based on DCU.
Fig. 7 is a schematic structural diagram of a DCU-based resource scheduling apparatus according to an embodiment of the present invention.
As shown in fig. 7, the apparatus includes a receiving module 71, an obtaining module 72, a determining module 73, and a scheduling module 74.
A receiving module 71, configured to receive a plurality of deep learning application job tasks, where a job task includes resource requirement information and a job task type.
Deep learning applications include one or more of Caffe, TensorFlow, Pythritth, Keras.
The resource requirement information comprises one or more of the number of CPUs, the size of the memory and the number of DCUs.
And the obtaining module 72 is configured to obtain available resource usage state information in a DCU development environment of the deep learning accelerator.
And the determining module 73 is used for determining the execution sequence and the execution node of the job task according to the resource demand information, the job task type and the available resource use state information.
Optionally, the determining module 73 is configured to:
the job task also comprises a user job priority, and the job task type has a corresponding scheduling type priority;
determining the actual job priority corresponding to the job task according to the user job priority, the resource demand information and the scheduling type priority;
determining the execution sequence of the job tasks according to the actual job priority; and
and determining the execution nodes with the same type as the job task.
And the scheduling module 74 is configured to schedule the job tasks to the corresponding execution nodes according to the execution sequence, so that the execution nodes execute the job tasks of the corresponding types.
In another embodiment of the present invention, as shown in FIG. 8, the apparatus further comprises a creation module 75.
The creating module 75 is configured to create resource information corresponding to the job task when the execution node executes the job task, and record the resource information in the resource record table.
In yet another embodiment of the present invention, as shown in FIG. 9, the apparatus further comprises an update module 76.
An update module 76 for:
receiving a cancel command for canceling the job task in the process of executing the job task by the execution node;
determining resource use information of the job task according to the cancel command;
and updating the resource record table according to the resource use information.
In yet another embodiment of the present invention, as shown in FIG. 10, the apparatus further comprises a storage module 77.
A storage module 77, configured to determine whether the job task is executed; if the execution is complete, the job task is saved to the historical task table.
It should be understood that the resource scheduling apparatus based on DCU in this embodiment is consistent with the description of the resource scheduling method based on DCU in the embodiment of the first aspect, and is not described herein again.
The resource scheduling device based on the DCU receives a plurality of deep learning application job tasks, determines the execution sequence and the execution nodes of the job tasks according to the resource demand information, the job task types and the available resource use state information, and then schedules the job tasks to the corresponding execution nodes according to the execution sequence, so that the execution nodes execute the job tasks of the corresponding types, and the resource scheduling supporting the DCU is realized, and the resource scheduling device based on the DCU is flexible in application and rich in functions.
In order to implement the above embodiments, the present invention further provides a computer device.
The computer device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the DCU-based resource scheduling method as in the first aspect is implemented.
In order to implement the above embodiments, the present invention also provides a non-transitory computer-readable storage medium.
The non-transitory computer-readable storage medium has stored thereon a computer program, which when executed by a processor implements a DCU-based resource scheduling method as an embodiment of the first aspect.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It should be noted that in the description of the present specification, reference to the description of the term "one embodiment", "some embodiments", "an example", "a specific example", or "some examples", etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Claims (10)

1. A resource scheduling method based on DCU is characterized by comprising the following steps:
receiving a plurality of deep learning application job tasks, wherein the job tasks comprise resource demand information and job task types;
acquiring available resource use state information under a DCU development environment of a deep learning accelerator;
determining an execution sequence and an execution node of the job task according to the resource demand information, the job task type and the available resource use state information;
and scheduling the job tasks to corresponding execution nodes according to the execution sequence so that the execution nodes execute the job tasks of the corresponding types.
2. The method of claim 1, wherein determining an execution order and execution nodes for the job tasks based on the resource demand information, the job task types, and the available resource usage status information comprises:
the job task also comprises a user job priority, and the job task type has a corresponding scheduling type priority;
determining an actual job priority corresponding to the job task according to the user job priority, the resource demand information and the scheduling type priority;
determining the execution sequence of the job tasks according to the actual job priority; and
and determining the execution node with the same type as the job task.
3. The method of claim 1, wherein while the executing node executes the job task, further comprising:
and creating resource information corresponding to the job task, and recording the resource information to a resource record table.
4. The method of claim 1, further comprising:
receiving a cancel command for canceling the job task in the process of executing the job task by the execution node;
determining resource use information of the job task according to the cancel command;
and updating a resource record table according to the resource use information.
5. The method of claim 1, further comprising:
judging whether the execution of the job task is completed;
and if the execution is finished, saving the job task into a historical task table.
6. The method of claim 1, wherein the deep learning application comprises one or more of Caffe, TensorFlow, Pythritth, Keras.
7. The method of claim 1, wherein the resource requirement information includes one or more of a number of CPUs, a memory size, a number of DCUs.
8. A resource scheduling apparatus based on DCU, comprising:
the system comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a plurality of deep learning application job tasks, and the job tasks comprise resource demand information and job task types;
the acquisition module is used for acquiring available resource use state information under the DCU development environment of the deep learning accelerator;
the determining module is used for determining the execution sequence and the execution node of the job task according to the resource demand information, the job task type and the available resource use state information;
and the scheduling module is used for scheduling the job tasks to corresponding execution nodes according to the execution sequence so as to enable the execution nodes to execute the job tasks of the corresponding types.
9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the computer program, implements the DCU-based resource scheduling method according to any one of claims 1 to 7.
10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the DCU-based resource scheduling method according to any one of claims 1 to 7.
CN202011381447.XA 2020-12-01 2020-12-01 Resource scheduling method and device based on DCU and computer equipment Pending CN112612600A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011381447.XA CN112612600A (en) 2020-12-01 2020-12-01 Resource scheduling method and device based on DCU and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011381447.XA CN112612600A (en) 2020-12-01 2020-12-01 Resource scheduling method and device based on DCU and computer equipment

Publications (1)

Publication Number Publication Date
CN112612600A true CN112612600A (en) 2021-04-06

Family

ID=75229814

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011381447.XA Pending CN112612600A (en) 2020-12-01 2020-12-01 Resource scheduling method and device based on DCU and computer equipment

Country Status (1)

Country Link
CN (1) CN112612600A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115102851A (en) * 2022-08-26 2022-09-23 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Fusion platform for HPC and AI fusion calculation and resource management method thereof
CN115865693A (en) * 2022-11-14 2023-03-28 华南理工大学 Kubernetes scheduling method and system for edge computing

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567086A (en) * 2010-12-30 2012-07-11 中国移动通信集团公司 Task scheduling method, equipment and system
CN111552550A (en) * 2020-04-26 2020-08-18 星环信息科技(上海)有限公司 Task scheduling method, device and medium based on GPU (graphics processing Unit) resources
CN111611087A (en) * 2020-06-30 2020-09-01 中国人民解放军国防科技大学 Resource scheduling method, device and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102567086A (en) * 2010-12-30 2012-07-11 中国移动通信集团公司 Task scheduling method, equipment and system
CN111552550A (en) * 2020-04-26 2020-08-18 星环信息科技(上海)有限公司 Task scheduling method, device and medium based on GPU (graphics processing Unit) resources
CN111611087A (en) * 2020-06-30 2020-09-01 中国人民解放军国防科技大学 Resource scheduling method, device and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115102851A (en) * 2022-08-26 2022-09-23 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Fusion platform for HPC and AI fusion calculation and resource management method thereof
CN115865693A (en) * 2022-11-14 2023-03-28 华南理工大学 Kubernetes scheduling method and system for edge computing

Similar Documents

Publication Publication Date Title
US20210406079A1 (en) Persistent Non-Homogeneous Worker Pools
CN112256423B (en) System, apparatus and process for dynamic tenant architecture adjustment in a distributed resource management system
CN102835068B (en) Method and apparatus for managing reallocation of system resources
US9342364B2 (en) Workflow managed composite applications
US9465663B2 (en) Allocating resources in a compute farm to increase resource utilization by using a priority-based allocation layer to allocate job slots to projects
CN113110938B (en) Resource allocation method and device, computer equipment and storage medium
WO2023045467A1 (en) Container cpu resource scheduling and isolation method and apparatus, and storage medium and electronic device
CN111682973B (en) Method and system for arranging edge cloud
US7721289B2 (en) System and method for dynamic allocation of computers in response to requests
WO2010066547A2 (en) Shared resource service provisioning using a virtual machine manager
US20090013321A1 (en) Managing virtual computers
CN110838939B (en) Scheduling method based on lightweight container and edge Internet of things management platform
US20080263561A1 (en) Information processing apparatus, computer and resource allocation method
CN111464659A (en) Node scheduling method, node pre-selection processing method, device, equipment and medium
CN112052068A (en) Method and device for binding CPU (central processing unit) of Kubernetes container platform
CN101110932A (en) Image processing device
CN112612600A (en) Resource scheduling method and device based on DCU and computer equipment
EP4177751A1 (en) Resource scheduling method, resource scheduling system, and device
CN110287022A (en) A kind of scheduling node selection method, device, storage medium and server
US8555286B2 (en) Method, system, and apparatus for establishing a software configurable computing environment
CN108616424A (en) A kind of resource regulating method, computer equipment and system
CN114327881A (en) Task scheduling method and device
CN114490062A (en) Local disk scheduling method and device, electronic equipment and storage medium
CN112114958A (en) Resource isolation method, distributed platform, computer device, and storage medium
CN114816272B (en) Magnetic disk management system under Kubernetes environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220425

Address after: No. 33, Qiuyun Road, Qiaolin street, Pukou District, Nanjing City, Jiangsu Province, 211805

Applicant after: Zhongke Shuguang (Nanjing) Computing Technology Co.,Ltd.

Address before: 100193 No.36 Zhongguancun Software Park, No.8 Dongbeiwang West Road, Haidian District, Beijing

Applicant before: Dawning Information Industry (Beijing) Co.,Ltd.

Applicant before: DAWNING INFORMATION INDUSTRY Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20210406

RJ01 Rejection of invention patent application after publication