CN114489966A

CN114489966A - Job scheduling method and device

Info

Publication number: CN114489966A
Application number: CN202111598410.7A
Authority: CN
Inventors: 李斌; 彭竞; 沈鸿; 孙元昊
Original assignee: China Telecom Corp Ltd
Current assignee: China Telecom Corp Ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-05-13

Abstract

The embodiment of the invention provides a job scheduling method and a job scheduling device, wherein the job scheduling method comprises the following steps: acquiring job information of a task queue; scheduling the task queue according to the operation information by using a policy network to obtain a first vector, and scheduling the task queue according to the operation information by using an expert policy to obtain a second vector; the first vector indicates a priority of each job schedule in the task queue determined by the policy network, and the second vector indicates a priority of each job schedule in the task queue determined by the expert policy; evaluating the first vector and the second vector through a discriminator function to obtain a first estimation value; and updating the policy network according to the first estimation value, and scheduling the jobs in the task queue according to the first vector.

Description

Job scheduling method and device

Technical Field

The embodiment of the invention relates to the technical field of high-performance computing, in particular to a job scheduling method and device.

Background

In recent years, deep learning has made tremendous progress and development in a number of different fields, such as computer vision, image recognition, natural language processing, recommendation algorithms. In order to improve the accuracy of the training results, the scale of the models used by people is continuously increased, and the training data volume is continuously enlarged. With the continuous expansion of training calculation amount, in order to ensure that the training time is within an acceptable range, a distributed training method has to be used.

The high-performance computing cluster has the characteristics of high performance, high computational cost ratio, convenience in expansion and suitability for parallel tasks, so that the high-performance computing cluster is suitable for performing large-scale distributed training, physical simulation and parallel computing tasks. Some existing large internet companies establish their own Graphics Processing Unit (GPU) or Tensor Processing Unit (TPU) clusters and have corresponding task scheduling and cluster management methods. The existing mainstream scheduling method requires that the task is required to complete the declaration of the required resource amount when being submitted, and the running task cannot be accelerated because of new available resources. This may make the cluster resources less than optimally utilized. The cost of resources is very high, so it is very important to design a scheduler that can efficiently utilize resources.

Disclosure of Invention

The embodiment of the invention provides a job scheduling method and device, which are used for improving the resource utilization efficiency of a cluster.

In a first aspect, an embodiment of the present invention provides a job scheduling method, where the method includes:

acquiring job information of a task queue; the task queue comprises at least one job;

scheduling the task queue according to the operation information by using a policy network to obtain a first vector, and scheduling the task queue according to the operation information by using an expert policy to obtain a second vector; the first vector indicates a priority of each job schedule in the task queue determined by the policy network, and the second vector indicates a priority of each job schedule in the task queue determined by the expert policy;

evaluating the first vector and the second vector through a discriminator function to obtain a first estimation value, wherein the first estimation value is used for indicating the probability that the first vector is an expert track generated by the expert strategy;

and updating the policy network according to the first estimation value, and scheduling the jobs in the task queue according to the first vector.

In one possible implementation, the job information includes the following information:

the submission time of each job in the task queue; a job duration for each job in the task queue; the number of central processing units occupied by each job in the task queue; the number of graphics processors occupied by each job in the task queue; the size of a memory occupied by each job in the task queue; an identification of each job in the task queue.

A possible implementation manner, wherein the updating the policy network according to the first estimation value includes:

inputting the first estimation value as an incentive value into a value function network corresponding to the strategy network to obtain an advantage function corresponding to the strategy network;

and updating the policy network according to the advantage function.

One possible implementation manner, where scheduling the job in the task queue according to the first vector includes:

and scheduling the jobs in the task queue into a running queue according to the first vector in the descending order of priority until the remaining resources do not meet the requirement of running the jobs with the highest priority in the remaining jobs in the task queue.

According to a possible implementation manner, the jobs in the task queue are scheduled again every preset time or when the jobs in the task queue are completed.

In a second aspect, an embodiment of the present invention provides a job scheduling apparatus, including:

an acquisition unit configured to acquire job information of a task queue; the task queue comprises at least one job;

the processing unit is used for scheduling the task queue according to the operation information by utilizing a strategy network to obtain a first vector, and scheduling the task queue according to the operation information by utilizing an expert strategy to obtain a second vector; the first vector indicates a priority of each job schedule in the task queue determined by the policy network, and the second vector indicates a priority of each job schedule in the task queue determined by the expert policy; evaluating the first vector and the second vector through a discriminator function to obtain a first estimated value, wherein the first estimated value is used for indicating the probability that the first vector is generated by the expert strategy; and updating the policy network according to the first estimation value, and scheduling the jobs in the task queue according to the first vector.

In one possible implementation, the processing unit is specifically configured to:

and updating the policy network according to the advantage function.

In a third aspect, an embodiment of the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method provided in the first aspect is implemented.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where the storage medium stores computer instructions, and the computer instructions, when executed by a processor, perform the method provided in the first aspect.

In a fifth aspect, embodiments of the present invention also provide a computer program product including computer-executable instructions for causing a computer to perform the method as provided in the first aspect.

Through the technical solutions in one or more of the above embodiments of the present invention, the embodiments of the present invention have at least the following technical effects:

in the embodiment provided by the invention, the task queue is scheduled by utilizing the policy network according to the operation information, so that a large amount of time cost for adjusting and optimizing the cluster priority function can be saved. In addition, the embodiment of the invention can carry out self-adaptation on the distribution of the job sets, avoids the poor scheduling optimization effect caused by the change of the distribution of the job sets, and simultaneously integrates the cluster characteristics and the job set characteristics to generate the priority function, so that the embodiment of the invention can automatically adapt to the resource expansion of the cluster.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1 is a schematic diagram of task queue scheduling according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a network architecture according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating a job scheduling method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a job scheduling apparatus according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly understood, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Reinforcement Learning (RL), also known as refinishment Learning, evaluative Learning or Reinforcement Learning, is one of The paradigms and methodologies of machine Learning to describe and solve The problem of agents (agents) Learning strategies to maximize returns or achieve specific goals during interactions with The environment (Sutton R S, Barto a g. reinformance Learning: An introduction. cambridge, MA: The MIT Press, 1998.). The reinforcement learning task may be represented as a markov decision process. A finite state MDP is denoted as M, which is a four-tuple M ═ M<S，A，T，R>Where S is a state set, A is an action set, T: sxas × S → [0, 1]Is the transition probability, R: s × A × S → R is a reward function. For example, in a maze game, agent randomly walks around the maze and stays in state s at time t_t(a certain position in the maze), select action a_t(one of up, down, left, and right), which will be at probability T(s)_t，a_t，s_t+1Transition to the next state s_t+1Then a reward value R(s) is obtained_t，a_t，s_t+1) The reward is 1 when an exit is found, and 0 in other cases.

Therefore, the goal of reinforcement learning is to learn an optimal strategy pi: s → P (A), where P (A) is the probability distribution function over the set of actions A. Finding the optimal strategy pi is to maximize the accumulated reward

Where γ ∈ (0, 1) is the discount factor.

Reinforcement learning is an artificial intelligence framework that utilizes interactions with the environment for knowledge skill learning. In the RL, the agent needs to collect the reward signal sent by the environment to find the learning goal, but in some real-world tasks, the reward function of the environment is difficult to define. The inverse reinforcement learning method is a typical method for designing a reward signal.

Currently, in the job scheduling process, jobs arrive continuously over time. As shown in fig. 1, generally, a cluster performs job scheduling at regular intervals by maintaining an internal ordered job queue and collecting specific information of jobs. For example: a job queue with the length of 128 is maintained in a high-performance cluster, and each time a job is submitted, the job is put into waiting, if a spare position exists in the job queue, the job is pulled into a scheduling queue to be scheduled according to a rule in the process of next-round job scheduling.

In FIG. 1, jobs are placed in a task queue and a job may be considered a task. And when the task queue is vacant, the job enters the task queue. In the embodiment of the invention, the jobs in the task queue can be subjected to priority sequencing to obtain the sequenced task queue. And then, the jobs in the task queue can be run, namely when the cluster resources are idle, the jobs are started to run according to the priority.

The general operation queue uses heuristic methods such as first-in first-out and small operation priority to carry out scheduling. But there is a distinction in the temporal dimension and the spatial dimension for the job: the time length of a job needing to run can be regarded as the size of the job in the time dimension, and the nodes occupied by the job can be understood as the size of the job in the space dimension. A scheduling method that simply considers the arrival time of a job may cause that some small jobs that can be completed quickly need to wait for a large amount of time; simply considering small jobs as a priority may result in some large jobs being starved as small jobs arrive continuously. Scheduling is carried out only according to the operation time dimension, so that short-duration operation which occupies a large amount of space resources has higher priority, and the operation efficiency of the whole cluster is influenced; the scheduling is performed only according to the operation space dimension, which results in that long operations occupy cluster resources for a long time and reduces the operation efficiency of the cluster.

Therefore, to find a good scheduling policy, the following problems need to be solved:

1. how to generate an appropriate scheduling policy based on the job set characteristics.

In each scheduling process, the cluster resource manager taking the churm as the main stream can traverse each job in the task queue and perform scheduling depending on rules. But the rule-based method can only extract the job information in the current task queue, and cannot find the overall characteristics of the job for a period of time to carry out flexible and reasonable scheduling. That is, the scheduling rules can only serve a set of jobs that satisfy a certain characteristic on a fixed basis, and it takes a lot of time to perform exploration attempts. The Slurm is an open-source and highly-extensible cluster management tool and a job scheduling system, and is used for Linux clusters of various scales.

2. How to adapt to different job set characteristics.

Within a cluster, the characteristics of the job set are not constant: along with the change of submitting users, the change of submitting job time and the change of cluster resources, the characteristics of the job set generate corresponding changes which are difficult to model and confirm. Therefore, how to make the scheduling policy adaptive to the feature change of the cluster job set is also an important research direction.

Therefore, the embodiment of the invention provides a job scheduling method based on inverse reinforcement learning, which can search cluster job set characteristics and adaptively generate a corresponding scheduling method according to the cluster job set characteristics, thereby improving the resource utilization efficiency of a cluster and reducing the average waiting time ratio of user jobs.

At present, in the research of solving the problem of cluster job scheduling by using a reinforcement learning method, a fixed function for calculating job priority which can adapt to the distribution of fed cluster job sets is mostly obtained. By using a neural network as an approximate expression of the function. However, the method of reinforcement learning is used singly, the learning objective is single, and the learned strategy needs to be changed correspondingly under different work distributions, so the embodiment of the invention uses the inverse reinforcement learning method to solve the problem.

The design of the high-performance cluster job scheduling based on the inverse reinforcement learning provided by the embodiment of the invention is based on the following two facts:

1. the learned trajectory must be complete, because only after a complete scheduling process, the scheduling quality can be judged according to the average waiting time ratio, so as to learn, and this undoubtedly greatly increases the difficulty of learning for reinforcement learning. Where a trajectory is a series of state behaviors in the real world.

2. Generally, the scheduling strategy obtained by the reinforcement learning method can only produce better scheduling effect on the job set meeting the job distribution which accords with the learning of the scheduling strategy. After the operation set is replaced, the learned strategy is not completely matched with the operation set, so that the corresponding scheduling effect is reduced.

In the embodiment of the invention, the reverse reinforcement learning general process is as follows:

step one, generating a scheduling strategy as an initial strategy;

step two, learning to obtain an excitation function by comparing the difference between the high-hand interaction sample and the self interaction sample; specifically, in the training process of the reverse reinforcement learning, some expert strategies which are verified to have a better effect on a training operation set are used for generating an expert scheduling track. The strategy network of reinforcement learning is used as a generator to generate own scheduling track, a discriminator function is used to judge which of the two tracks is an expert track and which is generated by the strategy network, and the track generated by the training strategy network is close to the track generated by the expert strategy by using the countermeasure thought.

Thirdly, reinforcement learning is carried out by utilizing an excitation function, and the level of a scheduling strategy is improved;

step four, if the two strategies are not greatly different, the learning can be stopped, otherwise, the step two is returned.

Fig. 2 is a schematic diagram of a network architecture according to an embodiment of the present invention. FIG. 2 is a schematic diagram of a high-performance cluster computing resource modeling architecture according to an embodiment of the present invention. In the embodiment of the present invention, a cluster may be regarded as being composed of a plurality of computing nodes, each node shares storage, and each node may respectively have various computing resources: CPU computing resources, GPU computing resources and memory, wherein all nodes are interconnected by using at least 100Gb/s Endpoint Detection and Response (EDR) Infiniband (Infiniband), so that the communication overhead between nodes can be approximately ignored.

The computing node in the embodiment of the invention can be an independent physical server, can also be a server cluster or distributed system formed by a plurality of physical servers, and can also be a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content distribution network, big data and artificial intelligence platform and the like.

As shown in fig. 3, a schematic flow chart of a job scheduling method according to an embodiment of the present invention includes:

s301: acquiring job information of a task queue; the task queue includes at least one job.

In the embodiment of the present invention, in the course of the reinforcement learning module training, a job sample, that is, all jobs involved in one interaction with a cluster, may be obtained by sampling from a job data set, where the job information includes the following information:

the submission time of each job in the task queue; a job duration for each job in the task queue; the number of central processing units occupied by each job in the task queue; the number of graphics processors occupied by each job in the task queue; the size of a memory occupied by each job in the task queue; an identification of each job in the task queue; an identification of a submitting user for each job in the task queue.

In the embodiment of the invention, the reinforcement learning and inverse reinforcement learning network parameters also need to be initialized, and the parameters comprise strategy network parameters, value function network parameters and discriminator function parameters.

The strategy network is an artificial neural network which is not completely trained, and can be used for job scheduling, and the specific training process is not limited and is not described herein any more.

In the embodiment of the invention, information such as parameters corresponding to the expert strategy can be acquired. The expert strategy can be a completely trained artificial neural network, can be used for job scheduling, and can obtain a better scheduling result.

In the embodiment of the invention, the information can be dispatched every 15 minutes or once when the job is finished in the task queue during the interaction with the environment.

S302: and scheduling the task queue according to the operation information by using a strategy network to obtain a first vector, and scheduling the task queue according to the operation information by using an expert strategy to obtain a second vector.

Wherein the first vector indicates a priority of each job schedule in the task queue determined by the policy network and the second vector indicates a priority of each job schedule in the task queue determined by the expert policy.

In the embodiment of the invention, in the scheduling process, the policy network outputs the priority according to the operation information in the task queue, namely, the priority is calculated only by using the characteristics of the operation, the calculation of the queue sequence by a general neural network is omitted, and the independence between the sequence of the operation in the queue and the final priority output is ensured.

For example, the task queue includes 128 jobs, and all jobs in the task queue can be represented by a (128, 5) -dimensional matrix according to the job duration of each job, the number of central processing units occupied by each job, the number of graphics processors occupied by each job, the size of memory occupied by each job, and the identifier of each job. When scheduling is carried out, the idle resource information of each node of the cluster is obtained, and an n + m-dimensional vector is formed, namely (the number of idle 1 CPU node, the number of idle 2 CPU nodes, …, the number of all idle CPU nodes, the number of idle 1 GPU node, the number of idle 2 GPU nodes, …, the number of all GPU idle nodes), wherein n is the highest CPU core number, and m is the highest GPU card number. It is added to the end of each row of the job information matrix to generate a (128, 5+ m + n) -dimensional matrix S, which is input to the policy network. After passing through the policy network, a 128-dimensional vector a is output, namely a first vector:

[x₁，x₂，x₃，·..，x₁₂₈]

wherein x_iRepresenting the priority of the corresponding job in the task queue,

aiming at the problem of high-performance cluster job scheduling, the quality of the intermediate process of scheduling is difficult to define, only the final result of scheduling can be evaluated and scored, and feedback learning is performed according to the set reward function. In the embodiment of the invention, a method of reverse reinforcement learning is used instead of manually setting the reward function. In the training process of the reverse reinforcement learning, some expert strategies which are verified to have better effects on a training operation set are used for generating an expert scheduling track. Therefore, the matrix S may also be scheduled by using expert strategies to generate the corresponding second vector a_expert。

The specific generation process of the first vector and the second vector is not limited in this embodiment of the present invention.

S303: evaluating the first vector and the second vector by a discriminator function to obtain a first estimate value, the first estimate value being indicative of a probability that the first vector is generated by the expert strategy.

The discriminator function may also be referred to as a discriminator, and is a function commonly used in reinforcement learning, and the specific aspect of the discriminator is not limited in the present invention.

In the embodiment of the present invention, the first vector, the second vector, the current task queue, and the reward function are used as inputs of the value function network, and the value function network outputs the advantage function, that is, the measure of the advantage of the current action is adopted in the current state, and the embodiment of the present invention updates the value function network by using a time sequence difference method, and the specific formula can be expressed as follows:

δ＝r+γV(t+1)-V(t) (1)

Loss＝∑δ² (2)

the formula (1) defines a merit function δ, where r is the reward value, γ is the discount factor, and V is the value function, which are common parameters and functions in reinforcement learning, and are not specifically described. Equation (2) defines the Loss function of the value function network update.

In the embodiment of the invention, a strategy network for reinforcement learning is used as a generator to generate a scheduling track (namely a first vector) of the strategy network, a discriminator function is used for judging which of the two tracks is an expert track and which is generated by the strategy network, and the track characteristics generated by the training strategy network are close to the track generated by the expert strategy by using the countermeasure thought, namely the discriminator function is used for replacing a reward function in the reinforcement learning.

The specific target Loss (Loss) function can be as follows:

L＝-log(P_expert)-log(1-P_genetal) (3)

wherein the second estimated value P_expertRepresenting the probability that the discriminator function considers the real expert trajectory to be the expert trajectory, i.e. the probability that the second vector is generated by the expert strategy; first estimated value P_generalThe probability that the policy-generated trajectory is considered by the representative arbiter function to be an expert trajectory, i.e. the probability that the first vector is considered to be generated by said expert policy.

In the embodiment of the invention, the training aim is to ensure that a discriminator function cannot discriminate the difference between the track generated by the strategy network and a real expert, namely, both probability values tend to be 0.5. Finally, the scoring of the policy network trajectory using the discriminator function is used as the reward function r.

S304: and updating the policy network according to the first estimation value, and scheduling the jobs in the task queue according to the first vector.

In the embodiment of the present invention, the first estimation value may be input to a value function network corresponding to the policy network as an incentive value, so as to obtain an advantage function corresponding to the policy network; and updating the policy network according to the advantage function.

For example, the policy network is updated using a dominance function and a resampling method. The resampling method can scale sampling results by using different strategies, so that the problem that a large amount of sampling is time-consuming is avoided, and the specific formula is as follows:

pi and pi in the formula (4)_oldRespectively representing the probability of making action a in state s under the old and new policies. Equation (5) defines a Loss function used by policy network update, and clip indicates that if alpha is in the interval (1-epsilon, 1+ epsilon), then the value is kept unchanged, otherwise if the value is greater than 1+ epsilon, then the value is taken as 1+ epsilon, and if the value is less than 1-epsilon, then the value is taken as 1-epsilon. The value of the epsilon is determined according to the actual situation, and the method ensures that the change amplitude of the strategy is in a controllable range. Equation (6) is the gradient calculation corresponding to equation (5). min () represents a take minimum operation.

According to the method, the characteristics of tracks generated by expert strategies on different operation sets are learned, so that the reward function can judge the scheduling intermediate process, and the distribution of different operation sets is self-adaptive. In addition, the quality of the scheduling intermediate state can be learned, and the convergence speed in the process of training the reinforcement learning model is improved.

In the embodiment of the present invention, the following method may be adopted to schedule the job:

In the embodiment of the invention, the jobs in the task queue can be scheduled again every preset time or when the jobs in the task queue are completed.

The foregoing process is described below by way of a specific example.

The embodiment of the invention realizes the inverse reinforcement learning scheduling method. In the learning process under the online condition, the embodiment of the invention uses the collected historical cluster operation information to train and learn. In the training process, a certain job is randomly selected as a starting job, 255 jobs after the starting job are selected, 256 jobs form a training track, and the track is used as the input of training.

After training begins, scheduling of jobs is performed at intervals of every 15 minutes or with job completion. During the training process, all jobs with training time less than arrival time are considered invisible, each dimension corresponds to a job by using a mask function, namely a 256-dimensional vector, the visible job is marked as 1, and the invisible job is marked as 0.

All jobs in the task queue may be represented in a matrix of (128, 5) in combination with the job duration of each job, the number of central processors occupied by each job, the number of graphics processors occupied by each job, the size of memory occupied by each job, and the identity of each job. When scheduling is carried out, the idle resource information of each node of the cluster is obtained, and an n + m-dimensional vector is formed, namely (the number of idle 1 CPU node, the number of idle 2 CPU nodes, …, the number of all idle CPU nodes, the number of idle 1 GPU node, the number of idle 2 GPU nodes, …, the number of all GPU idle nodes), wherein n is the highest CPU core number, and m is the highest GPU card number. And adding the operation information matrix to the tail of each row of the operation information matrix to generate a (128, 5+ m + n) -dimensional matrix S input strategy network.

After passing through the policy network, a 128-dimensional vector a is output, namely a first vector:

[x₁，x₂，x₃，…，x₁₂₈]

moving the jobs into the running queue according to the size of the priority, wherein the larger the priority is, the jobs are moved into the running queue first until the remaining resources do not meet the priority in the running remaining task queue any moreThe highest job. Meanwhile, the state matrix S' is recorded when waiting for the next scheduling to start.

Similarly, the matrix S is scheduled by using an expert strategy to generate a corresponding second vector A_expert. For example, if the current expert training emphasizes the size of the job, the jobs are prioritized according to their respective occupied resources, and the less the jobs occupy the resources, the greater the priority. The vectors A and A_expertEvaluating using a discriminator function to obtain P_generalAnd P_expertObtaining the Loss value, e.g. P, using equation (3)_general＝0.5，P_expert0.5, then Loss is 2log2, updating the discriminator function, and simultaneously dividing P_generalAs a prize value.

The matrix S, the matrix S', the vector A and the reward value P_generalAnd (4) obtaining the output delta of the merit function in the value function network, and updating the value function network by using the formulas (7) and (8) after the completion of scheduling of one track.

δ＝P_general+γV(S′)-V(S) (7)

Loss＝∑δ² (8)

And (3) updating the policy network by using the output merit function delta by using the formulas (9), (10) and (11), thereby completing a round of scheduling and learning:

as can be seen from the foregoing description, the following advantages exist in the embodiments of the present invention:

and carrying out new scheduling on the running clusters every time when the jobs are finished or 15 minutes is separated from the last scheduling. In the embodiment of the invention, the task queue information and the cluster resource information are considered at the same time, in the training process, all the jobs with the training time less than the arrival time are considered invisible, a mask function, namely a 256-dimensional vector, is utilized, each dimension corresponds to one job, the visible job is marked as 1, and the invisible job is marked as 0.

The embodiment of the invention can represent all the jobs in the task queue in a matrix of (128, 5). When scheduling is carried out, the idle resource information of each node of the cluster is obtained, and an n + m-dimensional vector is formed, namely [ the number of idle 1 CPU node, the number of idle 2 CPU nodes, …, the number of all idle CPU nodes, the number of idle 1 GPU node, the number of idle 2 GPU nodes, … and the number of all GPU idle nodes ], wherein n is the highest CPU core number, and m is the highest GPU card number. And adding the operation information matrix to the tail of each row of the operation information matrix to generate a (128, 5+ m + n) -dimensional matrix S input strategy network.

The embodiment of the invention can also judge the scheduling intermediate process, schedule the matrix by utilizing an expert strategy and generate the corresponding action vector A_expertThe vectors A and A_expertEvaluating using a discriminator function to obtain P_generalAnd P_expertUpdating the discriminator function with the Loss value and simultaneously updating P_generalAs a reward value, i.e. a cluster scheduling optimization objective.

Compared with the existing scheduling implementation for the high-performance cluster jobs, the embodiment of the invention can save a large amount of time cost for adjusting and optimizing the cluster priority function. In addition, the embodiment of the invention can carry out self-adaptation on the distribution of the job sets, avoids the poor scheduling optimization effect caused by the change of the distribution of the job sets, and simultaneously integrates the cluster characteristics and the job set characteristics to generate the priority function, so that the embodiment of the invention can automatically adapt to the resource expansion of the cluster.

The various embodiments described herein may be implemented as stand-alone solutions or combined in accordance with inherent logic and are intended to fall within the scope of the present application.

Based on the same technical concept, an embodiment of the present invention provides a job scheduling apparatus, as shown in fig. 4, the apparatus 400 includes: an acquisition unit 401 and a processing unit 402. The division of the modules in the embodiment of the present application is schematic, and is only a logic function division, and there may be another division manner in actual implementation. In addition, functional modules in the embodiments of the present application may be integrated into one processor, may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.

the processing unit is used for scheduling the task queue according to the operation information by utilizing a strategy network to obtain a first vector, and scheduling the task queue according to the operation information by utilizing an expert strategy to obtain a second vector; the first vector indicates a priority of each job schedule in the task queue determined by the policy network, and the second vector indicates a priority of each job schedule in the task queue determined by the expert policy; evaluating the first vector and the second vector through a discriminator function to obtain a first estimation value, wherein the first estimation value is used for indicating the probability that the first vector is an expert track generated by the expert strategy; and updating the policy network according to the first estimation value, and scheduling the jobs in the task queue according to the first vector.

and updating the policy network according to the advantage function.

Based on the same technical concept, the embodiment of the present invention provides a computer device, where the computer device may be a terminal or a server, as shown in fig. 5, the device 500 includes at least one processor 501 and a memory 502 connected to the at least one processor, a specific connection medium between the processor 501 and the memory 502 is not limited in the embodiment of the present invention, and the processor 501 and the memory 502 are connected through a bus in fig. 5 as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.

In the embodiment of the present invention, the memory 502 stores instructions executable by the at least one processor 501, and the at least one processor 501 may execute the steps included in the method shown in fig. 5 by executing the instructions stored in the memory 502.

The processor 501 is a control center of the computer device, and can connect various parts of the computer device by using various interfaces and lines, and execute or execute the instructions stored in the memory 502 and call up the data stored in the memory 502. Alternatively, the processor 501 may include one or more processing units, and the processor 501 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, and the like, and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 501. In some embodiments, processor 501 and memory 502 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.

The processor 501 may be a general-purpose processor, such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof, configured to implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor.

Memory 502, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 502 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. The memory 502 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 502 of embodiments of the present invention may also be circuitry or any other device capable of performing a storage function to store program instructions and/or data.

Based on the same inventive concept, embodiments of the present invention provide a computer-readable storage medium storing a computer program executable by a computer device, which, when the program runs on the computer device, causes the computer device to perform the steps of the case activity detection method described above.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data transmission apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data transmission apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data transmission apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data transfer apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A job scheduling method, comprising:

scheduling the task queue according to the operation information by using a strategy network to obtain a first vector, and scheduling the task queue according to the operation information by using an expert strategy to obtain a second vector; the first vector indicates a priority of each job schedule in the task queue determined by the policy network, and the second vector indicates a priority of each job schedule in the task queue determined by the expert policy;

evaluating the first vector and the second vector through a discriminator function to obtain a first estimation value, wherein the first estimation value is used for indicating the probability that the first vector is generated by the expert strategy;

2. The method of claim 1, wherein the job information comprises the following information:

3. The method of claim 1 or 2, wherein said updating the policy network according to the first estimate value comprises:

and updating the policy network according to the advantage function.

4. The method of claim 1 or 2, wherein said scheduling jobs in the task queue according to the first vector comprises:

5. The method according to claim 1 or 2, characterized in that the jobs in the task queue are scheduled again every preset time period or when there is a job completion in the task queue.

6. A job scheduling apparatus comprising:

7. The apparatus of claim 6, wherein the job information comprises the following information:

8. The apparatus according to claim 6 or 7, wherein the processing unit is specifically configured to:

and updating the policy network according to the advantage function.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the job scheduling method according to any one of claims 1-5 when executing the computer program.

10. A computer-readable storage medium, characterized in that the storage medium stores computer instructions which, when executed by a processor, implement the method of any of claims 1 to 5.