CN114237869B

CN114237869B - Ray double-layer scheduling method and device based on reinforcement learning and electronic equipment

Info

Publication number: CN114237869B
Application number: CN202111362677.6A
Authority: CN
Inventors: 刘逊韵; 张拥军; 管延霞; 徐新海; 刘运韬; 李渊
Original assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Current assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-09-16
Anticipated expiration: 2041-11-17
Also published as: CN114237869A

Abstract

The invention provides a Ray double-layer scheduling method and device based on reinforcement learning and electronic equipment, wherein the Ray double-layer scheduling method based on reinforcement learning comprises the following steps: acquiring a cluster task queue, resource node cluster information and resource node cluster task queue information, and determining a target decision action based on a preset Ray double-layer scheduling model; the preset Ray double-layer scheduling model comprises a target decision action determined after reinforcement learning is carried out on the basis of the resource node cluster information and the resource node cluster task queue information; and scheduling the tasks to be scheduled in the cluster task queue to the corresponding distributed resource nodes based on the target decision action. The method of the invention realizes the purpose that the target decision-making action is determined by autonomous learning, thereby ensuring that the determined target decision-making action is more reasonable and accurate and effectively improving the resource utilization rate of each resource node.

Description

Ray double-layer scheduling method and device based on reinforcement learning and electronic equipment

Technical Field

The invention relates to the technical field of computers, in particular to a Ray double-layer scheduling method and device based on reinforcement learning and electronic equipment.

Background

At present, with the diversified development of game games, the resource consumption of computers is rapidly increased, and the overall computing power is low. Therefore, how to improve the performance of computers and the overall computing power thereof is becoming a popular research direction.

In the related technology, a static resource scheduling method is used for scheduling resources in a computer, when a scheduling task is received, the task to be scheduled is submitted to a local scheduler, if the local resources can meet the resource requirement of the task, the resource scheduling is directly completed, otherwise, a computer meeting the resource requirement of the task is randomly selected to execute the scheduling.

However, since the global scheduling operation in the related art is determined by a random scheduling decision, the randomness thereof may cause a reduction in resource utilization and overall computation of a computer executing a task, thereby resulting in inefficiency in resource scheduling.

Disclosure of Invention

The invention provides a Ray double-layer scheduling method, a Ray double-layer scheduling device and electronic equipment based on reinforcement learning, which are used for solving the defect of low resource scheduling efficiency caused by random scheduling operation determination during global scheduling in the related technology and achieving the purpose of improving the resource scheduling efficiency by performing dynamic scheduling based on Ray and reinforcement learning.

The invention provides a Ray double-layer scheduling method based on reinforcement learning, which comprises the following steps:

acquiring a cluster task queue, wherein the cluster task queue comprises a queue representing the dependency relationship among tasks in each task of a target application;

acquiring resource node cluster information, wherein the resource node cluster information comprises the respective quantity and resource types of used resource nodes and unused resource nodes in each resource node and the resource utilization rate of the used resource nodes;

acquiring resource node cluster task queue information, wherein the resource node cluster task queue information comprises the total number of tasks in a cluster task queue, task waiting time, task resource requirements and task estimated running time;

determining a target decision action based on a preset Ray double-layer scheduling model; the preset Ray double-layer scheduling model comprises a target decision action determined after reinforcement learning is carried out on the basis of the resource node cluster information and the resource node cluster task queue information;

and scheduling the tasks to be scheduled in the cluster task queue to the corresponding distributed resource nodes based on the target decision action.

According to the invention, before the step of determining the target decision action based on the preset Ray double-layer scheduling model, the method further comprises the following steps:

acquiring the resource demand of a task to be scheduled in the cluster task queue, the current task queue length of a local resource node and the current resource use condition of the local resource node;

and calling a preset Ray double-layer scheduling model according to the resource demand, the current task queue length and the current resource use condition.

According to the invention, the method for the Ray double-layer scheduling based on the reinforcement learning comprises the following steps that a preset Ray double-layer scheduling model comprises a preset local scheduling submodel and a preset global scheduling submodel, and the preset Ray double-layer scheduling model is called according to the resource demand, the length of the current task queue and the current resource use condition, and the method comprises the following steps:

judging whether the length of the current task queue reaches a preset queue length threshold value or not and judging whether the local resource node can meet the resource demand or not based on the current resource use condition;

determining that the length of the current task queue does not reach a preset queue length threshold value and the local resource node meets the resource demand, and scheduling by using the preset local scheduling submodel;

and if the length of the current task queue reaches a preset queue length threshold value or the local resource node cannot meet the resource demand, calling the preset global scheduling submodel to determine the target decision-making action after reinforcement learning is performed based on the resource node cluster information and the resource node cluster task queue information.

According to the Ray double-layer scheduling method based on reinforcement learning provided by the invention, after the step of scheduling the tasks to be scheduled in the cluster task queue to the corresponding allocated resource nodes based on the target decision action, the method further comprises the following steps:

acquiring a time structure body, wherein the time structure body comprises a generation time, a starting execution time and a waiting time of a task to be executed in each resource node;

respectively updating the resource node cluster information and the resource node cluster task queue information according to the time structure body to obtain new resource node cluster information and new resource node cluster task queue information;

re-executing the preset Ray-based double-layer scheduling model according to the new resource node cluster information and the new resource node cluster task queue information, and determining a target decision action;

the scheduling process is ended until all tasks in the task graph of the target application are executed; the task graph comprises graphs generated according to the dependency relationship among the tasks in the tasks.

According to the invention, the method for determining the target decision action based on the preset Ray double-layer scheduling model comprises the following steps:

determining a decision action space, wherein the decision action space comprises a space formed by all decision actions related to operations corresponding to target task instructions, and the target task instructions comprise instructions for generating each task;

and determining a target decision action from the decision action space based on the resource node cluster information, the resource node cluster task queue information and a preset reward function in the preset Ray double-layer scheduling model.

According to the invention, the method for performing Ray double-layer scheduling based on reinforcement learning comprises the following steps of determining a target decision action from a decision action space based on the resource node cluster information, the resource node cluster task queue information and a preset reward function in a preset Ray double-layer scheduling model, wherein the method comprises the following steps:

performing quality judgment on the decision action in the decision action space by using the resource node cluster information and the resource node cluster task queue information and combining a preset reward function in the preset Ray double-layer scheduling model to obtain a quality judgment result after reinforcement learning;

if the quality judgment result after reinforcement learning reaches a preset reinforcement learning termination condition, determining the decision action as the target decision action;

if the result of the goodness judgment after reinforcement learning does not reach the preset reinforcement learning termination condition, selecting a new decision action from the decision action space, then returning to execute the step of using the resource node cluster information and the resource node cluster task queue information, and performing goodness judgment on the decision action in the decision action space by combining a preset reward function; until the goal decision action is obtained.

The invention also provides a Ray double-layer scheduling device based on reinforcement learning, which comprises:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a cluster task queue, and the cluster task queue comprises a queue for representing the dependency relationship among tasks in each task of a target application;

a second obtaining module, configured to obtain resource node cluster information, where the resource node cluster information includes respective numbers and resource types of used resource nodes and unused resource nodes in each resource node, and a resource utilization rate of the used resource nodes;

the third acquisition module is used for acquiring resource node cluster task queue information, wherein the resource node cluster task queue information comprises the total number of tasks in the cluster task queue, task waiting time, task resource requirements and task estimated running time;

the determining module is used for determining a target decision action based on a preset Ray double-layer scheduling model; the preset Ray double-layer scheduling model comprises a target decision action determined after reinforcement learning is carried out on the basis of the resource node cluster information and the resource node cluster task queue information;

and the scheduling module is used for scheduling the tasks to be scheduled in the cluster task queue to the corresponding allocated resource nodes based on the target decision action.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of any one of the above-mentioned reinforcement learning-based Ray double-layer scheduling methods.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the reinforcement learning based Ray double-layer scheduling method as described in any of the above.

The present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the steps of any of the reinforcement learning-based Ray double-layer scheduling methods described above.

According to the invention, the purpose of quickly and efficiently scheduling tasks to be scheduled in the cluster task queue is realized by the reinforcement learning-based Ray double-layer scheduling method through a mode of determining a target decision based on the acquired cluster task queue, the resource node cluster information, the resource node cluster task queue information and a preset Ray double-layer scheduling model. Because the cluster task queue comprises a queue for representing the dependency relationship among tasks in each task of the target application, the resource node cluster task queue information comprises the total number of tasks in the cluster task queue, the task waiting time, the task resource requirement and the task estimated running time, the resource node cluster information comprises the respective number and resource types of used resource nodes and unused resource nodes in each resource node, and the resource utilization rate of the used resource nodes, so that the dynamic use condition of resources in each resource node and the dynamic scheduling condition of each task can be considered when the target decision action is determined by combining a scheduling method of reinforcement learning in a preset Ray double-layer scheduling model, the target decision action can be determined by autonomous learning, therefore, the determined target decision-making action is more reasonable and accurate, and the resource utilization rate of each resource node is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a Ray double-layer scheduling method based on reinforcement learning according to the present invention;

FIG. 2 is a schematic diagram of a task graph of a target application provided by the present invention;

FIG. 3 is a schematic structural diagram of a Ray double-layer scheduling apparatus based on reinforcement learning according to the present invention;

fig. 4 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

Reinforcement learning is a general term for a class of machine learning algorithms that mainly studies how an agent takes action based on current environmental reward feedback to maximize the expected benefit. With the deep research of the machine learning algorithm, the reinforcement learning makes a major breakthrough in many fields such as decision games. However, although the existing application problem is increasingly complex, higher accuracy is also pursued, so that the resource demand is increased, and a large amount of data also needs to be collected, which increases the difficulty of model training, and leads to rapid increase of the consumption of computing resources. Moreover, the process progress of the computer is delayed, the performance of the single computer is gradually improved, and the performance requirement of intensive learning large-scale model training cannot be met. In order to further improve the collating computing power of the system, a resource transverse expansion mode needs to be adopted, namely, the performance requirement of rapid increase of reinforcement learning training is met by a distributed computing mode through a distributed cluster.

As a distributed system, how to effectively manage resources of a large-scale data center is one of key technologies affecting performance of a distributed reinforcement learning system. In the prior art, a static resource scheduling method is used for scheduling resources in a computer, when a scheduling task is received, the task to be scheduled is submitted to a local scheduler, if the local resources can meet the resource requirement of the task, the resource scheduling is directly completed, otherwise, a computer meeting the resource requirement of the task is randomly selected to execute the scheduling. That is, the traditional static resource scheduling management mode lacks sufficient flexibility, and a user needs to predict the amount of resources required in training according to previous experience to make a resource management decision, so that the real-time resource requirement of a task cannot be well met, and the labor cost in the aspect of operation and maintenance is increased.

In fact, the actual consumption of resources in the training process of the distributed reinforcement learning is affected by many factors, such as the target of the training task, the specific type and parameters of the adopted algorithm, and the like, so that the resource consumption in the training process of the actual model is difficult to obtain reasonable prediction and make resource allocation before the system runs. With the development of the field of artificial intelligence, different learning models are proposed for solving the problem of cluster resource scheduling. For example, a DeepRM model and a Decima model can be used for solving the problem of cluster resource scheduling, wherein the Decima model refers to the method that in cluster scheduling, a graph neural network is used for carrying out feature extraction on scheduling tasks, and task processing with any size and dependency relationship is realized; the deep model means that task information in resource scheduling is modeled into image information, the image information is modeled through a convolutional neural network, parameter updating of the neural network is achieved through a reinforcement learning method, and task decision is conducted. Therefore, a DeepRM model is used for modeling and converting information in resource scheduling into an image form, the feature extraction of image information is realized through a convolutional neural network, and parameters are continuously updated in an iterative manner to form a final decision model; solving the resource scheduling problem of tasks with dependency relationship by using a Decima model through an extensible graph neural network; the double-layer deep reinforcement learning model realizes the management work of resources together.

However, the existing resource management mechanism is not suitable for a distributed reinforcement learning framework with resource demand dynamically changing in real time in the training process, the invention combines a reinforcement learning method with the resource scheduling problem in the distributed reinforcement learning framework by taking advantage of the advantage that reinforcement learning is suitable for solving the problem of sequence decision, and simultaneously realizes the aim of optimizing resource scheduling through the design of a reward function. The autonomous learning of the scheduling strategy is realized, and the obtained scheduling strategy can effectively improve the resource utilization rate and reduce the task response time.

Based on the problems, the invention provides a Ray double-layer scheduling method based on reinforcement learning, which can be applied to a scene in which a plurality of resource nodes are transversely expanded, wherein each resource node can be internally provided with a task generator, an environment feedback device and a resource scheduler respectively, and all the resource nodes form a cluster; and, the execution subject of the reinforcement learning based Ray double-layer scheduling method may be a resource scheduler in any resource node in the cluster. Alternatively, the resource node may be a Personal Computer (PC), a portable device, a notebook Computer, a smart phone, a tablet Computer, a portable wearable device, and other electronic devices, such as a tablet Computer, a mobile phone, and the like. The present invention does not limit the specific form of the resource node.

It should be noted that the execution subject of the following method embodiments may be part or all of the resource node. The following method embodiments take the execution subject as a resource scheduler in a local resource node as an example for explanation, where the local resource node is any one of all resource nodes.

Fig. 1 is a schematic flow diagram of a Ray double-layer scheduling method based on reinforcement learning, as shown in fig. 1, the Ray double-layer scheduling method based on reinforcement learning includes the following steps:

step 110, obtaining a cluster task queue, where the cluster task queue includes a queue representing a dependency relationship between tasks in each task of the target application.

The target application can be an application of a game type game, and the target task instruction can be an instruction generated when the target application receives starting operations such as clicking or touching of a user; the dependency relationship between the tasks can represent that the tasks in the task graph have a front task and/or a rear task; and, each task entering the cluster task queue may be a task to be scheduled.

Specifically, an agent may be disposed in the local resource node, or no agent may be disposed in the local resource node, when an agent is disposed in the local resource node, a target task instruction of a target application may be received by the agent in the local resource node, and the received target task instruction may be further sent to the task generator, and when an agent is not disposed in the local resource node, a target task instruction of the target application may be received by the task generator in the local resource node. Further, when the task generator performs task decomposition operation on the target application according to the received target task instruction, the task generator may also perform resource demand generation operation of the task simultaneously for each decomposed task, so as to obtain each task of the target application and a resource demand amount of each task; and then, the task generator generates a task graph according to the dependency relationship among the tasks, and sequentially sends each task to the cluster according to the dependency relationship among the tasks while generating the task graph so as to obtain a cluster task queue, wherein the obtained cluster task queue can be sent to the resource scheduler, and each task of the target application, the resource demand of each task and the task graph are sent to the environment feedback device.

It should be noted that when the task generator in the local resource node obtains each task of the target application and the resource demand of each task, it may determine the resource demand of each task, and if the resource demand of a certain task exceeds the maximum cluster resource supply, store the task to the task queue and send the task prompt information to the user; on the contrary, if the resource demand of each task of the target application does not exceed the maximum cluster resource supply, a task graph is further generated according to the dependency relationship among the tasks; the maximum cluster resource supply amount can be used for representing the resource node with the maximum resource amount in all the resource nodes; the resource demand of each task may refer to a resource node type required when the corresponding task is executed, for example, the resource demand for generating a certain task is < CPU:1, GPU:1>, < CPU:1, GPU:1> indicates that the running of the task needs resources of 1 Central Processing Unit (CPU) and 1 Graphics Processing Unit (GPU). And the order of sending tasks to the cluster by the task generator and the formation of the cluster task queue are determined according to the dependency relationship among the tasks.

In addition, for the generated task graph, the task generator updates the state of each task in the task graph according to a preset time interval, where the preset time interval may be set in advance manually, for example, 10 milliseconds or 5 milliseconds, and the task generator updates the state of each task in the task graph in each preset time interval. For example, if a task is still in a to-be-scheduled state in the current time interval, yellow may be used to indicate that the current state of the task is still in a ready-to-execute state to be scheduled or not completely executed; if another task is in the executed state in the current time interval, the color of the task can be updated to green, so that the task is in the executed state; the status of the corresponding task in the current time interval can also be directly characterized by ready and finished (finished) letter identifiers.

Step 120, resource node cluster information is obtained, where the resource node cluster information includes the respective numbers and resource types of used resource nodes and unused resource nodes in each resource node, and the resource utilization rate of the used resource nodes.

Specifically, the resource scheduler in the local resource node may obtain the resource node cluster information through the environment feedback device, that is, the environment feedback device may obtain the resource node cluster information first, and the resource node cluster information may include the number of used resource nodes and the resource type information in all the resource nodes, the number of idle resource nodes and the resource type information, and the resource utilization rate of the used resource nodes. For example, the number of used resource nodes in 2 resource nodes is 1, the resource type information includes the CPU in the resource node, the number of unused resource nodes is 1, the resource type information includes the GPU in the resource node, and the resource utilization rate is further determined according to the resource usage of 1 used resource node. Then, the environment feedback device sends the obtained resource node cluster information to the resource scheduler.

Step 130, acquiring resource node cluster task queue information, wherein the resource node cluster task queue information includes the total number of tasks in the cluster task queue, task waiting time, task resource requirements and task estimated running time.

Specifically, a resource scheduler in the local resource node may obtain resource node cluster task queue information through an environment feedback device, that is, the environment feedback device may receive each task of the target application, a resource demand amount of each task, and a task graph from the task generator and obtain a cluster task queue from the cluster at the same time, and determine resource node cluster task queue information including a total number of tasks in the cluster task queue, a task waiting time, a task resource demand and a task estimated running time based on the resource node cluster task queue information, and further send the resource node cluster task queue information to the resource scheduler, so that the resource node may obtain the resource node cluster task queue information through the environment feedback device. Further, the environment feedback device may further reflect the overall operation state of the cluster based on the acquired resource node cluster information and resource node cluster task queue information, that is, the overall operation state of the cluster represents the resource node cluster information and the resource node cluster task queue information, and sends the overall operation state of the cluster to the resource nodes.

It should be noted that step 120 and step 130 may be executed simultaneously, or step 120 and step 130 may be executed first, or step 130 and step 120 may be executed first. Step 120 and step 130 are preferably performed simultaneously.

Step 140, determining a target decision action based on a preset Ray double-layer scheduling model; the preset Ray double-layer scheduling model comprises the step of determining the target decision action after performing reinforcement learning based on the resource node cluster information and the resource node cluster task queue information.

Specifically, the local resource node can further determine a target decision action based on a preset Ray double-layer scheduling model based on the received overall running state of the cluster, resource node cluster information and resource node cluster task queue information, the preset Ray double-layer scheduling model does not only have a double-layer resource scheduling method in the existing Ray frame, but improves the existing Ray frame, namely, the preset Ray double-layer scheduling model is formed after combining reinforcement learning and double-layer scheduling in the existing Ray frame, so that the determined target decision action is more reasonable and accurate.

It should be noted that the existing Ray framework refers to a new distributed framework developed by the UC Berkeley project group, and the present invention is an improvement on the basis of the existing Ray framework, that is, the existing Ray framework is combined with reinforcement learning to solve the problem of cluster resource scheduling.

And 150, scheduling the tasks to be scheduled in the cluster task queue to the corresponding allocated resource nodes based on the target decision action.

The target scheduling decision may be used to characterize that the task to be scheduled is allocated to the corresponding resource node, and the task to be scheduled may be a task in a cluster task queue.

Specifically, when a resource scheduler in a local resource node determines a target decision-making action based on a preset Ray double-layer scheduling model, scheduling operation for a task to be scheduled can be executed based on the target decision-making action; moreover, the tasks to be scheduled may also be determined according to the dependency relationship between the tasks, for example, when aiming at the Task graph shown in fig. 2, the tasks to be scheduled are four tasks, namely, Task11, Task12, Task13 and Task14, which are independent from each other and have no preceding Task, and when one of the four tasks is completed, the following Task may be ready, that is, ready to be scheduled.

According to the Ray double-layer scheduling method based on reinforcement learning, the purpose of quickly and efficiently scheduling tasks to be scheduled in the cluster task queue is achieved by a mode of determining a target decision based on the acquired cluster task queue, the resource node cluster information, the resource node cluster task queue information and a preset Ray double-layer scheduling model. Because the cluster task queue comprises a queue for representing the dependency relationship among tasks in each task of the target application, the resource node cluster task queue information comprises the total number of tasks in the cluster task queue, the task waiting time, the task resource requirement and the task estimated running time, the resource node cluster information comprises the respective number and resource types of used resource nodes and unused resource nodes in each resource node, and the resource utilization rate of the used resource nodes, so that the dynamic use condition of resources in each resource node and the dynamic scheduling condition of each task can be considered when the target decision action is determined by combining a scheduling method of reinforcement learning in a preset Ray double-layer scheduling model, the target decision action can be determined by autonomous learning, therefore, the determined target decision-making action is more reasonable and accurate, and the resource utilization rate of each resource node is effectively improved.

Optionally, before performing step 140, the method further includes:

acquiring the resource demand of a task to be scheduled in the cluster task queue, the current task queue length of a local resource node and the current resource use condition of the local resource node; and calling a preset Ray double-layer scheduling model according to the resource demand, the current task queue length and the current resource use condition.

Specifically, in order to ensure flexibility and high efficiency of scheduling, a resource scheduler in a local resource node may first obtain, for received resource node cluster information and resource node cluster task queue information, a resource demand of a task to be scheduled in a cluster task queue, a current task queue length of the local resource node, and a current resource usage of the local resource node, so as to determine whether to invoke a preset Ray double-layer scheduling model or determine which scheduling method is used when invoking the preset Ray double-layer scheduling model, in combination with the resource demand of the task to be scheduled, the current task queue length of the local resource node, and the current resource usage of the local resource node, thereby achieving a purpose of flexibly using the preset Ray double-layer scheduling model.

According to the Ray double-layer scheduling method based on reinforcement learning, the mode of calling the preset Ray double-layer scheduling model is determined according to the resource demand of the task to be scheduled in the cluster task queue, the current task queue length of the local resource node and the current resource using condition of the local resource node, the purpose of improving the using efficiency and using flexibility of the preset Ray double-layer scheduling model is achieved, and a powerful basis is provided for the follow-up determination of target decision-making actions.

Optionally, when the preset Ray double-layer scheduling model includes a preset local scheduling submodel and a preset global scheduling submodel, the calling the preset Ray double-layer scheduling model according to the resource demand, the current task queue length, and the current resource usage, includes:

judging whether the length of the current task queue reaches a preset queue length threshold value or not and judging whether the local resource node can meet the resource demand or not based on the current resource use condition; determining that the length of the current task queue does not reach a preset queue length threshold value and the local resource node meets the resource demand, and scheduling by using the preset local scheduling submodel; and if the length of the current task queue reaches a preset queue length threshold value or the local resource node cannot meet the resource demand, calling the preset global scheduling submodel to determine the target decision-making action after reinforcement learning is performed based on the resource node cluster information and the resource node cluster task queue information.

Specifically, a resource scheduler in a local resource node compares the current task queue length of the local resource node with a preset queue length threshold, analyzes the current resource usage condition of the local resource node with the resource demand of a task to be scheduled in a cluster task queue, and judges whether the current queue length reaches the preset queue length threshold and whether the local resource node can meet the resource demand of the task to be scheduled, when it is determined that the current queue length reaches the preset queue length threshold and the local resource node can meet the resource demand of the task to be scheduled, the local resource node can be considered to be capable of running the task to be scheduled, at this time, a preset local scheduling submodel can be used for directly performing local scheduling, that is, the resource node where the resource scheduler is located is used for directly performing scheduling; on the contrary, when it is determined that the current queue length reaches the preset queue length threshold or the local resource node cannot meet the resource demand of the task to be scheduled, the local resource node may be considered as unable to run the task to be scheduled, and at this time, the preset global scheduling submodel may be used, so that the target decision-making action is determined after reinforcement learning is performed based on the resource node cluster information and the resource node cluster task queue information.

It should be noted that, for a preset local scheduling submodel and a preset global scheduling submodel included in a preset Ray double-layer scheduling model, the preset local scheduling submodel may be determined based on an existing double-layer resource scheduling method in a Ray frame, where the preset local scheduling submodel may be designed based on a local scheduling algorithm of the existing double-layer resource scheduling method, and the preset global scheduling submodel may be designed by modifying a central scheduling algorithm in the existing double-layer resource scheduling method into a reinforcement learning algorithm, so that a target decision action may be obtained through autonomous learning in an actual scheduling process, and thus, the preset Ray double-layer scheduling model may have advantages of both double-layer resource scheduling and reinforcement learning resource scheduling.

According to the Ray double-layer scheduling method based on reinforcement learning, the purpose of determining whether the task to be scheduled is directly scheduled locally or is scheduled by using a global scheduling method containing reinforcement learning is achieved by comparing the current task queue length of the local resource node with the preset queue length threshold and analyzing the current resource use condition of the local resource node and the resource demand of the task to be scheduled in the cluster task queue, not only can local scheduling be achieved, but also the reinforcement learning can be combined to perform global scheduling, and the flexibility and reliability of resource scheduling are greatly improved.

Optionally, after step 150, the method further comprises:

acquiring a time structure body, wherein the time structure body comprises the generation time, the starting execution time and the waiting time of a task to be executed in each resource node; respectively updating the resource node cluster information and the resource node cluster task queue information according to the time structure body to obtain new resource node cluster information and new resource node cluster task queue information; re-executing the preset Ray-based double-layer scheduling model according to the new resource node cluster information and the new resource node cluster task queue information, and determining a target decision action; the scheduling process is ended until all tasks in the task graph of the target application are executed; the task graph comprises graphs generated according to the dependency relationship among the tasks in the tasks.

For the tasks to be scheduled, which are scheduled to the corresponding resource nodes, to be queued in the waiting queue Q of the corresponding resource node, that is, each task to be scheduled is updated to a task to be executed when being scheduled to the corresponding resource node, and is queued in the queue of the tasks that have not been executed in the resource node in sequence; the generation time of the task to be executed in each resource node refers to the time of submitting each task to be executed, namely the time of generating the task graph; the starting execution time of the task to be executed refers to the time that the task to be executed is actually executed or starts to run by the scheduled resource node; the waiting time of the task to be executed refers to the time between the starting time when the task to be executed is queued and the time when the task to be executed is started to run.

Specifically, the local resource scheduler may obtain the time structure through a preset structure generation program, where the obtained time structure includes a generation time, a start execution time, and a waiting time of a task to be executed in each resource node, and the preset structure generation program includes:

and then, respectively updating the resource node cluster information and the resource node cluster task queue information according to the time structure body, wherein the updating process of the resource node cluster information comprises the following steps: based on the time structure body, when the fact that executed completion tasks exist in the cluster task queue is determined, new tasks to be scheduled can be determined from the tasks which have a dependency relationship with the completion tasks, the executed tasks and the tasks to be scheduled in the cluster task queue are updated, accordingly, a new cluster task queue is obtained, and then new resource node cluster task queue information is determined according to the total number of the tasks, the task waiting time, the task resource requirements and the task estimated running time of each new task in the new cluster task queue. That is, each time a scheduling decision is executed, the resource node cluster task queue information is updated as the scheduling decision is executed.

In addition, the updating process of the resource node cluster information comprises the following steps: according to at least one resource node currently receiving a task to be scheduled, the respective quantity and resource types of used resource nodes and unused resource nodes in all resource nodes are respectively updated, and meanwhile, the resource utilization rate of the used resource nodes is also updated, so that new resource node cluster information is obtained. Or acquiring new resource node cluster information from a cluster information monitoring module configured in the resource node according to the condition that the task to be scheduled is scheduled to the corresponding resource node and the running condition of the task to be scheduled after the task is scheduled to the corresponding resource node.

Further, when determining the new resource node cluster information and the new resource node cluster task queue information, returning to step 140 to perform the scheduling again.

It should be noted that, in order to reduce the waiting time of the tasks to be executed in the resource nodes, the resource scheduler in the local resource node may first obtain the time structure of the tasks, to timely and accurately determine the completed tasks that have been executed based on the time change of each task to be executed in the time structure, and then further update the cluster task queue, so as to determine a new task to be scheduled from the obtained new cluster task queue. For example, in the Task graph shown in fig. 2, when tasks 11, Task12, Task13, and Task14 are all to-be-scheduled tasks and scheduling is completed, a new Task to be scheduled may be re-determined based on the remaining tasks Task21, Task22, Task23, and Task31, so that each Task in the Task graph can be rapidly scheduled on the premise of reducing Task latency as much as possible and improving resource node utilization.

According to the Ray double-layer scheduling method based on reinforcement learning, the purpose of rescheduling other unscheduled tasks in the task graph is achieved by acquiring the mode that the time structure body comprising the generation time, the starting execution time and the waiting time of the task to be executed in each resource node updates the resource node cluster information and the resource node cluster task queue information respectively, so that each task in the task graph can be scheduled quickly and reasonably, and guarantee is provided for determining a target decision-making action subsequently.

Alternatively, step 140 may be implemented by the following process:

determining a decision action space, wherein the decision action space comprises a space formed by all decision actions related to operations corresponding to target task instructions, and the target task instructions comprise instructions for generating the tasks; and determining a target decision action from the decision action space based on the resource node cluster information, the resource node cluster task queue information and a preset reward function in the preset Ray double-layer scheduling model.

Specifically, the number of tasks to be scheduled in the task graph is set to be m, and the total number of resource nodes in the cluster is set to be n, so that the decision action space may be

Moreover, the size of the decision action space and the total number of tasks in the Task graph are in an exponential relationship, in order to effectively reduce the decision action space, the size of the decision action space may be reduced based on data dependency between tasks, for example, in the Task graph shown in fig. 2, tasks Task11 and Task12 which are independent of each other at the same time may be performed at the same time, and simultaneous execution of the tasks may further reduce the decision action space; tasks Task12 and Task21, namely Task21, which have data dependencies between tasks, are scheduled after Task12 is executed, so it can be seen that the data dependencies between tasks also limit the performability of partial tasks.

Then, by combining the resource node cluster information, the resource node cluster task queue information and a preset reward function in a preset Ray double-layer scheduling model, a target decision-making action is determined from a decision-making action space, wherein the preset reward function is shown as the following formula:

R＝E(R _i )，i∈I(2)

in the equations (1) to (5), reward represents a preset bonusThe function of the function is that of the function,

and β represents two different normalization factors; r represents the resource node utilization rate, namely the average expectation of the resource utilization rate of the used resource nodes in all the resource nodes; t represents the turn-around time with authority of all tasks in the task graph, E (R) _i ) Representing the average expectation of resource utilization of the ith used resource node, representing R _i Indicates the resource utilization rate, CPU _ use, of the ith used resource node _i Indicates the number of CPU resources, CPU _ all, contained in the ith used resource node _i Indicating the number of GPU resources contained in the ith used resource node, I indicating the number of used resource nodes, Tj _w-t Representing the turn-around time with authority of the j-th task, Tj _{finish-submit} Represents the time taken for the jth task to complete from task submission to task execution, Tj _finish-start Represents the time that the j-th task takes from the start of the task to the completion of the task, Tj _finish Indicates the task completion time of the jth task, Tj _submit Indicating the task submission time of the jth task, Tj _start The J represents the task start execution time of the jth task, and J represents the number of tasks in the task graph.

The invention provides a Ray double-layer scheduling method based on reinforcement learning, which comprises the steps of firstly determining a decision action space, then determining a target decision action from the decision action space by combining resource node cluster information, resource node cluster task queue information and a preset reward function in a preset Ray double-layer scheduling model, wherein the decision action space is a space formed by all decision actions related to operation corresponding to a target task instruction, so that the reasonability and reliability of the target decision action determined from the decision action space by combining the resource node cluster information, the resource node cluster task queue information and the preset reward function are higher, and the resource scheduling efficiency is further ensured.

Optionally, the determining a target decision action from the decision action space based on the resource node cluster information, the resource node cluster task queue information, and a preset reward function includes:

performing quality judgment on the decision action in the decision action space by using the resource node cluster information and the resource node cluster task queue information and combining a preset reward function to obtain a quality judgment result after reinforcement learning; if the quality judgment result after reinforcement learning reaches a preset reinforcement learning termination condition, determining the decision action as the target decision action; if the result of the goodness judgment after reinforcement learning does not reach the preset reinforcement learning termination condition, selecting a new decision action from the decision action space, then returning to execute the step of using the resource node cluster information and the resource node cluster task queue information, and performing goodness judgment on the decision action in the decision action space by combining a preset reward function; until the goal decision action is obtained.

Specifically, a resource scheduler in a local resource node uses resource node cluster information and resource node cluster task queue information, performs a quality judgment on a decision action in a decision action space in combination with a preset reward function to obtain a quality judgment result after reinforcement learning, can compare the quality judgment result after reinforcement learning with a preset optimization target at the moment, and determines an action decision corresponding to the reinforcement learning as a target decision action if the quality judgment result after reinforcement learning reaches the preset optimization target; otherwise, if the result of the goodness judgment after reinforcement learning does not reach the preset optimization target, a new decision action is selected from the decision action space again and the goodness judgment is executed. And determining a target decision action until the result of the quality judgment after reinforcement learning meets a preset optimization target.

Or judging whether the accumulated times of the current reinforcement learning reaches the preset learning times, if not, performing reinforcement learning again aiming at the new decision-making action in the decision-making action space until the reinforcement learning is finished when the preset learning times is reached, and determining the corresponding decision-making action as the target decision-making action when the reinforcement learning is finished.

According to the Ray double-layer scheduling method based on reinforcement learning, the purpose of determining the target decision-making action is achieved by performing reinforcement learning on the decision-making action in the decision-making action space by using the resource node cluster information and the resource node cluster task queue information and combining the preset reward function to perform quality judgment on the decision-making action in the decision-making action space, and therefore the reasonability and the reliability of the target decision-making action are improved.

The invention provides a reinforcement learning-based Ray double-layer scheduling device, and the following reinforcement learning-based Ray double-layer scheduling device and the above reinforcement learning-based Ray double-layer scheduling method can be referred to correspondingly.

Fig. 3 illustrates a reinforcement learning based Ray double-layer scheduling apparatus, as shown in fig. 3, the reinforcement learning based Ray double-layer scheduling apparatus 300 includes: a first obtaining module 310, configured to obtain a cluster task queue, where the cluster task queue includes a queue representing a dependency relationship between tasks in each task of a target application; a second obtaining module 320, configured to obtain resource node cluster information, where the resource node cluster information includes respective numbers and resource types of used resource nodes and unused resource nodes in each resource node, and a resource utilization rate of the used resource nodes; a third obtaining module 330, configured to obtain resource node cluster task queue information, where the resource node cluster task queue information includes a total number of tasks in the cluster task queue, task waiting time, task resource requirements, and task estimated running time; a determining module 340, configured to determine a target decision action based on a preset Ray double-layer scheduling model; the preset Ray double-layer scheduling model comprises a target decision action determined after reinforcement learning is carried out on the basis of the resource node cluster information and the resource node cluster task queue information; and a scheduling module 350, configured to schedule the task to be scheduled in the cluster task queue to the corresponding allocated resource node based on the target decision action.

Optionally, the reinforcement learning-based Ray double-layer scheduling device further includes a calling module, which may be configured to obtain a resource demand of a task to be scheduled in the cluster task queue, a current task queue length of a local resource node, and a current resource usage of the local resource node; and calling a preset Ray double-layer scheduling model according to the resource demand, the current task queue length and the current resource use condition.

Optionally, the calling module may be further configured to determine whether the current task queue length reaches a preset queue length threshold and determine whether the local resource node can meet the resource demand based on the current resource usage; determining that the length of the current task queue does not reach a preset queue length threshold value and the local resource node meets the resource demand, and scheduling by using the preset local scheduling submodel; and if the length of the current task queue reaches a preset queue length threshold value or the local resource nodes cannot meet the resource demand, calling the preset global scheduling submodel to ensure that the target decision-making action is determined after reinforcement learning is performed based on the resource node cluster information and the resource node cluster task queue information.

Optionally, the Ray double-layer scheduling device based on reinforcement learning further includes a processing module, configured to obtain a time structure, where the time structure includes a generation time, a start execution time, and a waiting time of a task to be executed in each resource node; respectively updating the resource node cluster information and the resource node cluster task queue information according to the time structure body to obtain new resource node cluster information and new resource node cluster task queue information; re-executing the preset Ray-based double-layer scheduling model according to the new resource node cluster information and the new resource node cluster task queue information, and determining a target decision action; the scheduling process is ended until all tasks in the task graph of the target application are executed; the task graph comprises graphs generated according to the dependency relationship among the tasks in the tasks.

Optionally, the determining module 340 may be specifically configured to determine a decision action space, where the decision action space includes a space formed by all decision actions related to an operation corresponding to a target task instruction, and the target task instruction includes an instruction for instructing generation of each task; and determining a target decision action from the decision action space based on the resource node cluster information, the resource node cluster task queue information and a preset reward function in the preset Ray double-layer scheduling model.

Optionally, the determining module 340 may be further configured to perform, by using the resource node cluster information and the resource node cluster task queue information, a quality judgment on the decision-making action in the decision-making action space in combination with a preset reward function in the preset Ray double-layer scheduling model, so as to obtain a quality judgment result after reinforcement learning; if the quality judgment result after reinforcement learning reaches a preset reinforcement learning termination condition, determining the decision action as the target decision action; if the result of the goodness judgment after reinforcement learning does not reach the preset reinforcement learning termination condition, selecting a new decision action from the decision action space, then returning to execute the step of using the resource node cluster information and the resource node cluster task queue information, and performing goodness judgment on the decision action in the decision action space by combining a preset reward function; until the goal decision action is obtained.

Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor)410, a communication Interface 420, a memory (memory)430 and a communication bus 440, wherein the processor 410, the communication Interface 420 and the memory 430 are communicated with each other via the communication bus 440. The processor 410 may invoke logic instructions in the memory 430 to perform a reinforcement learning based Ray double-tier scheduling method comprising:

acquiring a cluster task queue, wherein the cluster task queue comprises a queue representing the dependency relationship among tasks in each task of a target application; acquiring resource node cluster information, wherein the resource node cluster information comprises the respective quantity and resource types of used resource nodes and unused resource nodes in each resource node and the resource utilization rate of the used resource nodes; acquiring resource node cluster task queue information, wherein the resource node cluster task queue information comprises the total number of tasks in a cluster task queue, task waiting time, task resource requirements and task estimated running time; determining a target decision action based on a preset Ray double-layer scheduling model; the preset Ray double-layer scheduling model comprises a target decision action determined after reinforcement learning is carried out on the basis of the resource node cluster information and the resource node cluster task queue information; and scheduling the tasks to be scheduled in the cluster task queue to the corresponding distributed resource nodes based on the target decision action.

In addition, the logic instructions in the memory 430 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention or a part thereof which substantially contributes to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention further provides a computer program product, the computer program product includes a computer program, the computer program can be stored on a non-transitory computer readable storage medium, when the computer program is executed by a processor, a computer can execute the reinforcement learning based Ray double-layer scheduling method provided by the above methods, the method includes:

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the reinforcement learning-based Ray double-layer scheduling method provided by the above methods, the method including:

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A Ray double-layer scheduling method based on reinforcement learning is characterized by comprising the following steps:

determining a target decision action based on a preset Ray double-layer scheduling model; the preset Ray double-layer scheduling model comprises a target decision action which is determined after reinforcement learning is carried out on the basis of the resource node cluster information, the resource node cluster task queue information, the decision action space and a preset reward function;

2. A reinforcement learning-based Ray double-layer scheduling method according to claim 1, wherein before the step of determining a target decision-making action based on a preset Ray double-layer scheduling model, the method further comprises:

3. The reinforcement learning-based Ray double-layer scheduling method according to claim 2, wherein the preset Ray double-layer scheduling model comprises a preset local scheduling submodel and a preset global scheduling submodel, and the invoking of the preset Ray double-layer scheduling model according to the resource demand, the current task queue length and the current resource usage comprises:

4. The reinforcement learning-based Ray double-layer scheduling method according to claim 1, wherein after the step of scheduling the tasks to be scheduled in the cluster task queue to the corresponding allocated resource nodes based on the target decision-making action, the method further comprises:

5. A method for performing reinforcement learning-based Ray double-layer scheduling according to claim 1, wherein the determining a target decision action based on a preset Ray double-layer scheduling model comprises:

6. The reinforcement learning-based Ray double-layer scheduling method according to claim 5, wherein the determining a target decision action from the decision action space based on the resource node cluster information, the resource node cluster task queue information, and a preset reward function in the preset Ray double-layer scheduling model comprises:

7. A Ray double-layer scheduling device based on reinforcement learning is characterized by comprising:

the determining module is used for determining a target decision action based on a preset Ray double-layer scheduling model; the preset Ray double-layer scheduling model comprises a target decision action which is determined after reinforcement learning is carried out on the basis of the resource node cluster information, the resource node cluster task queue information, the decision action space and a preset reward function;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of the reinforcement learning based Ray dual-layer scheduling method according to any one of claims 1 to 6.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the reinforcement learning based Ray dual-layer scheduling method according to any one of claims 1 to 6.