CN117032964A

CN117032964A - Automatic scheduling method and device, system, equipment and storage medium for jobs

Info

Publication number: CN117032964A
Application number: CN202310995952.0A
Authority: CN
Inventors: 邓玲; 杨振东; 杨志芬
Original assignee: China United Network Communications Group Co Ltd
Current assignee: China United Network Communications Group Co Ltd
Priority date: 2023-08-08
Filing date: 2023-08-08
Publication date: 2023-11-10

Abstract

The application discloses an automatic scheduling method, an automatic scheduling device, an automatic scheduling system, automatic scheduling equipment and a storage medium, relates to the technical field of computers, and solves the problem that a front job scheduling system is low in efficiency. The method comprises the following steps: the computing portal acquires the job information of the job to be executed with the highest priority in the first-stage job queue. The computing power resource requirements, the scheduling policy requirements, and the application names used are sent to the computing network brain. The computing network brain receives computing power resource requirements, scheduling policy requirements and application names used for the jobs to be executed, which are sent by the computing network portal. And determining a target computing cluster according to the computing power resource requirement, the scheduling policy requirement and the used application program name. And sending the cluster information of the target computing cluster to the computing network portal. The computing network portal receives the cluster information of the target computing clusters sent by the computing network brain. And sending the job information to a scheduler corresponding to the target computing cluster, so that the target scheduler forwards the job information to the target computing cluster.

Description

Automatic scheduling method and device, system, equipment and storage medium for jobs

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a system, a device, and a storage medium for automatic scheduling of jobs.

Background

There are three general indicators for measuring the performance of job scheduling systems in the current industry: the throughput rate of the operation, namely the number of completed operations in unit time; secondly, calculating the utilization rate of resources; thirdly, fairness to job scheduling.

As the number of computing clusters built into production increases, there are imbalances in the resource utilization of each computing cluster and tidal effects of different time periods, and the collaborative scheduling needs among the computing clusters begin to appear. In a large computing system formed by a plurality of computing clusters, namely when the computing clusters are connected, if an original job scheduling method is adopted, job tasks submitted by users are directly distributed to one computing cluster for execution, if the computing resources of the computing clusters are insufficient, the jobs are queued until the idle computing resources of the corresponding queues of the computing clusters meet job operation requirements, and the jobs can be operated on the computing clusters. In the job queuing process, even if the computing resources of other computing clusters are idle, the jobs in the queuing state cannot be rescheduled to the queues of other computing clusters to run, which results in the reduced efficiency of the whole job scheduling system.

Disclosure of Invention

The application provides an automatic job scheduling method, an automatic job scheduling device, an automatic job scheduling system, automatic job scheduling equipment and a storage medium, which are used for solving the problem of low efficiency of a conventional job scheduling system.

In order to achieve the above purpose, the application adopts the following technical scheme:

in a first aspect, the present application provides an automatic job scheduling method applied to a computing network brain, where the computing network brain is deployed in an automatic job scheduling system. The automatic scheduling system of the operation, further comprising: a computing portal, at least one scheduler, at least one computing cluster. The computing net brain is in communication connection with the computing net portal. The computing network brain is communicatively coupled to at least one scheduler. The computing portal is communicatively coupled to at least one scheduler. At least one scheduler is in one-to-one correspondence with at least one computing cluster. The scheduler is communicatively coupled to the computing clusters. The automatic scheduling method of the job comprises the following steps: the computing network brain receives computing power resource requirements, scheduling policy requirements and application names used for the jobs to be executed, which are sent by the computing network portal. The computing power resource requirement, the scheduling policy requirement and the name of the used application program are acquired from the job information of the job to be executed with the highest priority in the first job queue by the computing net portal. The computing network brain determines a target computing cluster according to the computing power resource requirements, the scheduling policy requirements and the application program names used. The computing network brain sends the cluster information of the target computing cluster to the computing network portal, so that the computing network portal sends the job information to a dispatcher corresponding to the target computing cluster, and the dispatcher forwards the job information to the target computing cluster.

In the automatic scheduling method of the job, after receiving the computing power resource requirement, the scheduling policy requirement and the used application name of the job to be executed sent by the computing network portal, the computing network brain determines the target computing cluster according to the computing power resource requirement, the scheduling policy requirement and the used application name, and then sends the cluster information of the computing cluster to the computing network portal, so that the computing network portal sends the job information to a scheduler corresponding to the target computing cluster to forward the job information to the target computing cluster, the situation that the job to be executed is directly submitted to a computing cluster with busy computing resources to wait in a queue after being accepted can be avoided, and the situation that other computing clusters have idle computing resources but do not execute the job can be realized, and more balanced computing resource utilization and less job queuing time among the computing clusters can be realized.

In one possible implementation, a computing network brain determines a target computing cluster based on computing power resource requirements, scheduling policy requirements, and application names used, comprising: the computing network brain determines at least one computing cluster based on the computing power resource requirements, the application names used. If the number of the at least one computing cluster is zero, the computing network brain sends a waiting instruction to the computing network portal. The waiting instruction is used for instructing the computing portal to store the job information into the first-stage queue. If the number of at least one computing cluster is one, the computing network brain determines the computing cluster as a target cluster. If the number of the at least one computing cluster is more than or equal to two, the computing network brain determines a target computing cluster according to the scheduling policy requirement.

In one possible implementation, the scheduling policy includes: the computing resources are free or the computing resources are least utilized or the queued jobs are least. The computing network brain determines a target computing cluster according to the scheduling policy requirement, and comprises the following steps: the computing network brain matches the scheduling policy requirements with the cluster information of the computing clusters to determine a target computing cluster. The cluster information is acquired from the scheduler for the computing network brain and stored.

In a second aspect, the present application provides a method for automatically scheduling jobs, which is applied to an algorithm portal. The computing net portal is deployed in the operation automation dispatching system. An automated scheduling method, comprising: the computing portal acquires the job information of the job to be executed with the highest priority in the first-stage job queue. The job information includes: computing power resource requirements, scheduling policy requirements, application names used. The computing network portal sends the computing power resource requirement, the scheduling policy requirement and the used application program name to the computing network brain so that the computing network brain determines a target computing cluster according to the computing power resource requirement, the scheduling policy requirement and the used application program name. The computing network portal receives the cluster information of the target computing clusters sent by the computing network brain. The computing portal sends the job information to a scheduler corresponding to the target computing cluster, so that the target scheduler forwards the job information to the target computing cluster.

In the automatic job scheduling method provided by the application, the computing network portal acquires the job information of the job to be executed with higher priority in the first-stage job queue, further transmits the computing power resource requirement, the scheduling strategy requirement and the used application program name in the job information to the computing network brain, further receives the cluster information of the target computing cluster transmitted by the computing network brain, further transmits the job information of the job to be executed to the target computing cluster, a user does not need to manually select the computing cluster and the queue, only needs to submit the job task and the scheduling strategy to the computing network portal, the shielding and no sense of the bottom computing cluster and the queue resource to the user are realized, and meanwhile, the computing network portal can enable the computing network brain to determine the target computing cluster queue of the job to be executed according to the resource use condition of the computing cluster queue, further realize the load balance between the dispatching job and different computing clusters under the scene of computing power grid connection of a plurality of computing clusters, and reduce the queuing waiting time of the job on the computing clusters.

A possible implementation manner, the computing portal sends the job information to a scheduler corresponding to the target computing cluster, so that the target scheduler forwards the job information to the target computing cluster, and includes: the computing portal sends the job information to a scheduler of the target computing cluster. The scheduler submits the job information to a second job queue of the target computing cluster. The target computing cluster processes the job to be executed with the highest priority in the second job queue.

In a possible implementation manner, the job automation scheduling method provided by the application further comprises the following steps: the computing portal receives job information of a job to be executed. The computing portal submits the job to be executed to a first level job queue.

In a third aspect, the present application provides an automatic job scheduling apparatus, applied to a computing network brain, where the computing network brain is deployed in an automatic job scheduling system. The automatic scheduling system of the operation, further comprising: a computing portal, at least one scheduler, at least one computing cluster. The computing net brain is in communication connection with the computing net portal. The computing network brain is communicatively coupled to at least one scheduler. The computing portal is communicatively coupled to at least one scheduler. At least one scheduler is in one-to-one correspondence with at least one computing cluster. The scheduler is communicatively coupled to the computing clusters. An automatic job scheduling device, comprising: and the receiving module is used for receiving the computing power resource requirement, the scheduling policy requirement and the used application program name of the job to be executed, which are sent by the computing network portal. The computing power resource requirement, the scheduling policy requirement and the name of the used application program are acquired from the job information of the job to be executed with the highest priority in the first job queue by the computing net portal. And the determining module is used for determining the target computing cluster according to the computing power resource requirement, the scheduling policy requirement and the used application program name. And the sending module is used for sending the cluster information of the target computing cluster to the computing network portal so that the computing network portal sends the job information to a dispatcher corresponding to the target computing cluster, and the dispatcher forwards the job information to the target computing cluster.

In a possible implementation manner, in the job automation scheduling device provided by the application, the determining module is specifically configured to determine at least one computing cluster according to the computing power resource requirement and the used application program name. And if the number of the at least one computing cluster is zero, sending a waiting instruction to the computing network portal. The waiting instruction is used for instructing the computing portal to store the job information into the first-stage queue. If the number of at least one computing cluster is one, determining the computing cluster as a target cluster. If the number of the at least one computing cluster is more than or equal to two, determining a target computing cluster according to the scheduling policy requirement.

In one possible implementation, the scheduling policy includes: the computing resources are free or the computing resources are least utilized or the queued jobs are least. The determining module is specifically configured to match the scheduling policy requirement with cluster information of the computing cluster, and determine a target computing cluster. The cluster information is acquired from the scheduler for the computing network brain and stored.

In a fourth aspect, the application also provides a device for automatically scheduling the operation, which is applied to the calculation network portal. The computing net portal is deployed in the operation automation dispatching system. An automated scheduling apparatus comprising: the device comprises an acquisition module, a sending module and a receiving module.

The acquisition module is used for acquiring the job information of the job to be executed with the highest priority in the first-stage job queue. The job information includes: computing power resource requirements, scheduling policy requirements, application names used.

And the sending module is used for sending the computing power resource requirements, the scheduling policy requirements and the used application program names to the computing network brain so that the computing network brain determines a target computing cluster according to the computing power resource requirements, the scheduling policy requirements and the used application program names.

And the receiving module is used for receiving the cluster information of the target computing cluster sent by the computing network brain.

And the sending module is also used for sending the job information to a scheduler corresponding to the target computing cluster so that the target scheduler forwards the job information to the target computing cluster.

In a possible implementation manner, the sending module is specifically configured to send the job information to a scheduler of the target computing cluster.

The automatic scheduling device for the job provided by the application further comprises: and the submitting module and the processing module.

The submitting module is used for submitting the job information to a second job queue of the target computing cluster.

And the processing module is used for processing the job to be executed with the highest priority in the second job queue.

In a possible implementation manner, the receiving module is specifically configured to receive job information of a job to be executed. And the submitting module is also used for submitting the job to be executed to the first-stage job queue.

On the other hand, the application also provides an automatic scheduling system for the jobs, which comprises the following steps: a computing net portal, a computing net brain, at least one scheduler, at least one computing cluster. The calculating net portal is in communication connection with the calculating net brain. The computing portal is communicatively coupled to at least one scheduler. The computing network brain is communicatively coupled to at least one scheduler. At least one scheduler is in one-to-one correspondence with at least one computing cluster. The scheduler is communicatively coupled to the computing clusters. The computer network is configured to perform the method of any one of the first aspect or the first aspect. The computing portal is for performing the second aspect or the job automation scheduling method of any of the second aspects. The scheduler is used for sending the job information to the computing cluster corresponding to the scheduler after receiving the job information sent by the computing portal. The computing cluster is used for executing the job to be executed according to the job information.

In a fifth aspect, the present application provides a job automation scheduling apparatus having a function of implementing the method of the first or second aspect described above. The functions can be realized by hardware, and can also be realized by executing corresponding software by hardware. The hardware or software includes one or more modules corresponding to the functions described above.

In a sixth aspect, the present application provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the method of job automation scheduling of the first aspect or any one of the possible implementations of the second aspect.

The technical effects caused by any one of the design manners of the third aspect to the sixth aspect may be referred to the technical effects caused by the different design manners of the first aspect to the second aspect, and are not repeated herein.

For a detailed description of the third to sixth aspects of the present application and various implementations thereof, reference may be made to the detailed description of the first aspect and various implementations thereof or of the second aspect and various implementations thereof; also, the advantages of the third aspect to the fourth aspect and various implementations thereof may be referred to for the analysis of the advantages of the first aspect or the second aspect and various implementations thereof, and will not be described here again.

These and other aspects of the application will be more readily apparent from the following description.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a configuration of an automatic job scheduling system according to an embodiment of the present application;

fig. 2 is a schematic structural diagram of an automatic job scheduling device according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of an automatic job scheduling method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an automatic job scheduling device according to an embodiment of the present application;

FIG. 5 is another schematic diagram of an automatic job scheduling device according to an embodiment of the present application;

FIG. 6 is a schematic diagram of another configuration of an automatic job scheduling device according to an embodiment of the present application;

fig. 7 is another schematic structural diagram of an automatic job scheduling device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. Wherein, in the description of the present application, "/" means that the related objects are in a "or" relationship, unless otherwise specified, for example, a/B may mean a or B; the "and/or" in the present application is merely an association relationship describing the association object, and indicates that three relationships may exist, for example, a and/or B may indicate: there are three cases, a alone, a and B together, and B alone, wherein a, B may be singular or plural. Also, in the description of the present application, unless otherwise indicated, "a plurality" means two or more than two. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural. In addition, in order to facilitate the clear description of the technical solution of the embodiments of the present application, in the embodiments of the present application, the words "first", "second", etc. are used to distinguish the same item or similar items having substantially the same function and effect. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ. Meanwhile, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as examples, illustrations or explanations. Any embodiment or design described herein as "exemplary" or "e.g." in an embodiment should not be taken as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion that may be readily understood.

In addition, the network architecture and the service scenario described in the embodiments of the present application are for more clearly describing the technical solution of the embodiments of the present application, and do not constitute a limitation on the technical solution provided by the embodiments of the present application, and as a person of ordinary skill in the art can know, with evolution of the network architecture and appearance of a new service scenario, the technical solution provided by the embodiments of the present application is also applicable to similar technical problems.

For ease of understanding, related art terms related to the present application will be explained first.

Job starvation refers to the phenomenon that some processes or threads cannot obtain required resources or execution time due to unfair resource allocation or unreasonable scheduling strategies. For example, a job has a high demand on computing resources at runtime, and because existing computing resources cannot meet their running requirements, the job is ignored by the scheduler indefinitely, resulting in the job not being scheduled to run.

A job to be executed refers to a collection of program instances that need to be executed to complete a particular computing task. Jobs are run through a set of processes, containers, or other entities on a compute node. The job submitted by the user may include the following: computing power resource requirements (e.g., number of compute nodes required to run a job to be executed, number of CPU cores required for each node, number of GPU cards required for each compute node, disk space required to run a job), application name used, version number, scheduling policy requirements, job name, job parameters, job input file, job commit time, job priority, job pre-estimated run time (job run execution time), etc.

The job queue means that before a job is scheduled by the scheduler, the job is placed into a job queue to be queued, and the job waits for a proper time to be scheduled and runs on a computing cluster managed by the scheduler. After a scheduler receives a job execution request, if the computing resources of the computing cluster queue managed by the scheduler are insufficient, the job will be in a queuing state until the available computing resources of the computing cluster queue managed by the scheduler meet the requirements, and the job can not run on the corresponding queue of the computing cluster. When a job task is allocated to a required computing resource to start running, the job task cannot be interrupted and migrated, and if the job task is interrupted once, the job task needs to be restarted. In the same JOB queue, each JOB task has a unique JOB number (JOB ID) and a corresponding JOB priority.

A computing cluster is a group of mutually independent computers, a computer system that utilizes a high-speed communications network, each computing cluster (i.e., each computer in the computing cluster) being an independent server running itself. These processes may communicate with each other, cooperatively provide applications, system resources, and data to the user, and manage through a single system mode.

A computing cluster may include multiple QUEUEs (QUEUEs, also called partitions), which are a logical group of computing nodes of the same type that perform the same class of computing tasks, and are separated from each other by a global unified scheduling policy, where each QUEUE schedules computing resources in the QUEUE independently. Different queues complete calculation tasks with different types and different purposes, if any queues perform AI model training and testing, if any queues are responsible for AI reasoning operation, if any queues only configure CPU, if any queues simultaneously configure CPU and GPU, if any queues configure large-capacity memory. Computing nodes configured with different types of GPU cards may also form different queues.

A scheduler is system software that manages a computing cluster, typically one scheduler maintains management of one computing cluster. The scheduler is mainly responsible for resource management and job scheduling of each queue in the computing cluster, and distributes computing resources for the jobs. The script of the scheduler is verified and optimized through repeated tests, so that the scheduler can be adapted to different application programs to exert the maximum efficiency. The scheduler can currently realize the scheduling of the GPU card and the CPU core level, namely, the minimum granularity can schedule a certain GPU card or a certain CPU core to execute a certain job task.

With the rapid development of computing power application and technology, super computing centers and intelligent computing centers are successively built in various places to meet the rapidly-increased computing power application demands, and with the synchronization of computing power centers of a plurality of computing clusters, computing power scheduling demands across the computing clusters are increasingly larger in the future development of a computing power scheduling system. However, at present, when a plurality of computing clusters are utilized to form a computing system, that is, when computing forces of the computing clusters are connected, if an original job scheduling method is used, a job to be executed submitted by a user is directly dispatched to one computing cluster for execution, if resources of the computing clusters are insufficient, the job to be executed is queued in a queue until idle computing resources corresponding to the queue meet operation requirements of the job to be executed, and then the job to be executed is executed. In the process of queuing the job to be executed in the queue, even if the idle computing resources of other computing clusters can meet the operation requirement of the job to be executed, the job to be executed in the queuing state cannot be rescheduled to the queue of the other computing clusters to be queued and the like for operation, which results in lower efficiency of the whole job scheduling system.

Based on the above, the application provides a method, a device, a system, equipment and a storage medium for automatically scheduling the job, and the method is applied to the automatic scheduling system of the job. A job automation scheduling system comprising: a computing network brain, a computing network portal, at least one scheduler, at least one computing cluster. The automatic scheduling method of the job comprises the following steps: the computing portal acquires the job information of the job to be executed with the highest priority in the first-stage job queue, wherein the job information comprises: computational resource requirements, scheduling policy requirements, and application names used. The computing network portal sends the computing power resource requirements, the scheduling policy requirements and the application names used to the computing network brain. After receiving the computing power resource requirement, the scheduling policy requirement and the used application program name sent by the computing network portal, the computing network brain determines a target computing cluster according to the computing power resource requirement, the scheduling policy requirement and the used application program name. The computing network brain sends the cluster information of the target computing cluster to the computing network portal. After receiving the cluster information of the target computing cluster, the computing network portal sends the job information to a scheduler corresponding to the target computing cluster, so that the scheduler forwards the job information to the target computing cluster.

According to the application, the job queues are divided into two stages, the calculation network portal is responsible for maintaining and managing the first-stage job queue, the scheduler is responsible for maintaining and managing the second-stage job queue, and the situation that the job to be executed is directly submitted to a computing cluster which does not meet the operation requirement of the job to be executed for queuing and waiting is avoided by adopting the scheduling method of the two-stage job queues, and the other computing clusters which meet the operation requirement of the job to be executed are not executed. A more balanced utilization of computing resources among the computing clusters can be achieved while reducing queuing time for jobs to be executed.

The following describes in detail the implementation of the embodiment of the present application with reference to the drawings.

In one aspect, the solution provided by the present application may be applied to an automatic job scheduling system 100 illustrated in fig. 1, where the system includes: a computing network brain 101, a computing network portal 102, at least one scheduler 103, at least one computing cluster 104.

Wherein the computing network brain 101 is in communication with the computing network portal 102. The computing network brain 101 is communicatively coupled to at least one scheduler 103. The computing portal 102 is in communication with at least one scheduler 103. The scheduler 103 is communicatively coupled to the computing clusters 104. The schedulers 103 are in one-to-one correspondence with the computing clusters 104.

Illustratively, the communication between the computing network brain 101 and the computing network portal 102 is via a control message link. The computing network brain 101 is connected with the scheduler 103 through a control message link. The computing portal 102 and the scheduler 103 are connected by a data transmission link. The scheduler 103 is communicatively coupled to the computing clusters 104.

The computing network brain 101 is configured to receive a computing power resource requirement, a scheduling policy requirement and a used application program name of a job to be executed sent by the computing network portal 102, determine a computing cluster 104 according to the computing power resource requirement, the scheduling policy requirement and the used application program name, and send cluster information of the computing cluster 104 to the computing network portal 102.

The computing portal 102 is used to maintain and manage the first level job queues. Specifically, the computing portal 102 is configured to execute the job information of the job to be executed with the highest priority in the first job queue in the job automation scheduling method provided by the present application, and send the computing resource requirement, the scheduling policy requirement and the application program name used in the job information of the job to be executed to the computing brain 101. And receiving the cluster information of the target computing cluster sent by the computing network brain 101, and finally sending the job information to the dispatcher 103 corresponding to the computing cluster 104.

Illustratively, the computing network portal 102 is configured to, after receiving the cluster information of the computing cluster 104 submitted by the computing network brain 101, send, through an API interface of a data transmission link between the computing network portal and the target computing cluster, job information of a job to be executed submitted by a user to a scheduler corresponding to the target computing cluster.

The scheduler 103 is configured to receive job information of a job to be executed sent by the computing portal 102, and submit the job information of the job to be executed to a computing cluster 104 corresponding to the scheduler 103.

Illustratively, the scheduler 103 sends the job to be executed to the corresponding file directory of the target queue of the target computing cluster via the SFTP interface of the data transmission link with the target computing cluster.

The computing cluster 104 is configured to execute a job to be executed submitted by the scheduler 103 after receiving job information of the job to be executed.

It should be noted that, the job automation scheduling system 100 illustrated in fig. 1 is merely an illustration of the application scenario of the present application, and is not limited to the application scenario of the present application.

On the other hand, the job automation scheduling method provided by the present application may also be applied to the job automation scheduling apparatus 200 illustrated in fig. 2, where the job automation scheduling apparatus 200 includes: processor 201, memory 202.

The processor 201 may be a central processing unit (central processing unit, CPU), and the processor 201 may also be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field-programmable gate arrays (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or a combination thereof.

The memory 202 may be a volatile memory (RAM), such as a random-access memory (RAM); or a nonvolatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD); or a combination of the above types of memories for storing applications, configuration files, data information or other content that may implement the methods of the present application.

Processor 201 performs the job automation scheduling method provided by the present application by running or executing software programs and/or modules stored in memory 202 and invoking data stored in memory 202.

It should be noted that the above-mentioned automatic job scheduling device illustrated in fig. 2 is merely an illustration of an application scenario of the present application, and is not a limitation of the application scenario of the present application.

On the other hand, the application provides an automatic job scheduling method which can be applied to the automatic job scheduling system. The method for automatically scheduling the operation provided by the application is specifically described below through the interaction process of the calculation network portal and the calculation network brain. As shown in fig. 3, the job automation scheduling method disclosed by the application may include the following steps:

s301, the computing portal acquires the job information of the job to be executed with the highest priority in the first-stage job queue.

Wherein the job information includes: computing power resource requirements, scheduling policy requirements, application names used. The job information may further include: the job information of the job to be executed is not limited by the present application, such as job name, job parameters, job input file, job priority, job permission execution duration, etc.

Specifically, the computing portal receives the job information of the job to be executed submitted by the user, and then submits the job to be executed to the first-stage job queue. The computing portal acquires the job information of the job to be executed with the highest priority in the first-stage job queue.

S302, the computing network portal sends computing power resource requirements, scheduling policy requirements and used application program names to the computing network brain, so that the computing network brain determines a target computing cluster according to the computing power resource requirements, the scheduling policy requirements and the used application program names.

Specifically, the computing portal sends the computing power resource requirement, the scheduling policy requirement and the used application program name in the job information of the job to be executed to the computing brain through a control message link between the computing portal and the computing brain.

S303, the computing network brain receives computing power resource requirements, scheduling policy requirements and used application program names of the to-be-executed jobs sent by the computing network portal.

The computing power resource requirement, the scheduling strategy requirement and the application program name used are acquired from the job information of the job to be executed with the highest priority in the first job queue by the computing network portal.

Specifically, the computing network brain receives computing power resource requirements, scheduling policy requirements and application names of the to-be-executed jobs sent by the computing network portal through a control message link between the computing network portal and the computing network brain.

S304, the computing network brain determines a target computing cluster according to the computing power resource requirement, the scheduling policy requirement and the used application program name.

Specifically, the computing network brain matches the computing power resource requirements, the scheduling policy requirements and the application program names used for the job to be executed with the application program list deployed in each queue in the computing clusters stored in the computing network brain, and determines the target computing cluster.

Illustratively, the process of acquiring the application manifest deployed for each queue in the computing cluster stored in the computing network brain may be: the computing network brain collects and gathers the job running state, resource use condition and application program related information of each queue in each computing cluster by means of regular automatic inquiry (such as inquiring each computing cluster once every 30 seconds) and manual configuration (such as manually registering the application program names deployed on each computing cluster queue by maintainers).

For example, as shown in Table 1, the application program list deployed for each queue in the computing cluster in Table 1, the job running status, resource usage, and deployed application program related information of each queue in the computing cluster may include the status (UP or DOWN) of each queue in the computing cluster, the application program deployed for each queue, the number of jobs in the running status in each queue, the number of jobs in the queuing status in each queue, the total number of compute nodes and the number of available nodes in each queue, the number of CPU cores and the number of available CPU cores in each queue, the number of GPU cards available, the amount of memory available, the disk capacity, the disk space available, and so on. The computing network brain maintains a list of applications deployed for each queue in each computing cluster, so as to dispatch the jobs submitted by the user to the queue in which the corresponding application is deployed for operation.

Table 1 computes a list of applications deployed for each queue in the cluster

Further, after receiving the power resource requirement, the scheduling resource requirement and the application program name of the job to be executed submitted by the computing network portal, the computing network brain matches the power resource requirement and the application program name of the job to be executed with table 1 to determine at least one target computing cluster. And then determining the target computing cluster according to the scheduling policy requirement. If the number of the at least one computing cluster is zero, the computing network brain sends a waiting instruction to the computing network portal, wherein the waiting instruction is used for indicating the computing network portal to store the job information into the first-stage queue. If the number of at least one computing cluster is one, the computing network brain determines the computing cluster as a target cluster. If the number of the at least one computing cluster is more than or equal to two, the computing network brain determines a target computing cluster according to the scheduling policy requirement.

For example, when the computing network brain receives a JOB1 submitted by the computing network portal, the computing power resource requirement of the JOB is 1 computing node, 4-core CPU, 2 GPU cards, 128G memory and 1TB disk space, and the application program name is d. The computing network brain matches the computing power resource requirement of the JOB to be executed JOB JOB1 and the used application program name with the table 1 stored in the computing network brain, determines that the queues where the application program d is currently deployed are Q4 and Q5, but the current computing power resource requirements of the queues Q4 and Q5 cannot meet the computing power resource requirement of the JOB to be executed JOB JOB1, determines the number of target computing clusters to be zero, and then sends a waiting instruction to the computing network portal to instruct the computing network portal to store the JOB information of the JOB to be executed JOB JOB1 into a first-stage JOB queue for queuing, starts a timing query mechanism, and submits the computing power resource of the queue Q4 or the queue Q5 of the computing cluster B to be executed after the computing power resource requirement of the JOB to be executed JOB JOB1 is released and met.

The computing network brain receives a JOB to be executed JOB2 submitted by a computing network portal, the computing power resource requirement of the JOB to be executed JOB2 is 1 computing node, 2 core CPU, 128GB memory and 1TB disk space, and the application program name of the JOB to be executed is c. The computing network brain matches the computing power resource requirement and the used application program name of the JOB JOB2 to be executed with the table 1 stored in the computing network brain, determines that the queues currently deployed with the application program c are Q1, Q4, Q5 and Q6, and simultaneously, as the JOB JOB2 to be executed submitted by the computing network portal received by the computing network brain does not have the specified scheduling policy requirement, the computing network brain selects the queue Q1 of the computing cluster A as a target computing cluster according to the principle that CPU (central processing unit) available computing resources are the most.

S305, the computing network brain sends the cluster information of the target computing cluster to the computing network portal, so that the computing network portal sends the job information to a dispatcher corresponding to the target computing cluster, and the dispatcher forwards the job information to the target computing cluster.

The cluster information of the computing cluster may be number information of the computing cluster or number information of a scheduler corresponding to the computing cluster.

S306, the computing network portal receives the cluster information of the target computing cluster sent by the computing network brain.

S307, the computing portal sends the job information to the scheduler corresponding to the target computing cluster, so that the target scheduler forwards the job information to the target computing cluster.

Specifically, the computing portal sends the job information of the job to be executed to a scheduler corresponding to the target computing cluster, and then the job information is forwarded to the target computing cluster by the target scheduler. After the computing cluster receives the job information, the computing cluster dispatches the corresponding computing power resources according to the computing power resource requirements, the dispatching strategy requirements and the used application program names in the job information so as to execute the job to be executed.

The above description has been presented with respect to the solution provided by the embodiment of the present application mainly from the point of view of the working principle of the device. It is to be appreciated that the computing device, in order to implement the functionality described above, includes corresponding hardware structures and/or software modules that perform the various functions. Those of skill in the art will readily appreciate that the various illustrative algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application can divide the functional modules of the computing device according to the method example, for example, each functional module can be divided corresponding to each function, or two or more functions can be integrated in one processing module. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.

Fig. 4 shows a possible composition diagram of a job automation scheduler in the above-described embodiment in the case of dividing the respective functional modules with the respective functions. As shown in fig. 4, the job automation scheduling device 400 may include: a receiving module 401, a determining module 402, and a transmitting module 403.

Wherein, the receiving module 401 is configured to support the job automation scheduling device 400 to execute S303 of the job automation scheduling method shown in fig. 3.

A determining module 402, configured to support the job automation scheduling device 400 to execute S304 of the job automation scheduling method shown in fig. 3.

A sending module 403, configured to support the job automation scheduling device 400 to execute S305 of the job automation scheduling method shown in fig. 3.

It should be noted that, all relevant contents of each step related to the above method embodiment may be cited to the functional description of the corresponding functional module, which is not described herein.

The job automation scheduling device 400 provided by the embodiment of the present application is used for executing the job automation scheduling method, so that the same effects as those of the job automation scheduling method can be achieved.

Fig. 5 shows another possible composition diagram of a job automation scheduler in accordance with the above embodiment in the case of dividing the respective functional modules with the respective functions. As shown in fig. 5, the job automation scheduling device 500 may include: an acquisition module 501, a transmission module 502 and a reception module 503.

Wherein, the obtaining module 501 is configured to support the job automation scheduling device 500 to execute S301 of the job automation scheduling method shown in fig. 3.

A sending module 502, configured to support the job automation scheduling device 500 to execute S302 or S307 of the job automation scheduling method shown in fig. 3.

A receiving module 503, configured to support the job automation scheduling device 500 to execute S306 of the job automation scheduling method shown in fig. 3.

Further, as shown in fig. 6, the job automation scheduling device 500 provided in the embodiment of the present application may further include: a submit module 504, a process module 505.

Wherein, the submitting module 504 is configured to support the job automation scheduling apparatus 400 to execute a step of sending job information to a scheduler of the target computing cluster or a step of submitting a job to be executed to a first-level job queue in the job automation scheduling method shown in fig. 3.

A processing module 505, configured to support the job automation scheduling device 400 to perform the step of submitting job information to the second job queue of the target computing cluster in the job automation scheduling method shown in fig. 3.

The job automation scheduling device 500 provided in the embodiment of the present application is used for executing the job automation scheduling method, so that the same effects as those of the job automation scheduling method can be achieved.

On the other hand, the embodiment of the present application further provides a job automation scheduling device, as shown in fig. 7, where the job automation scheduling device 700 may include a memory 701, a processor 702, and a transceiver 703, where the memory 701 and the processor 702 may be connected by a bus or a network or other manners, and in fig. 7, the connection is exemplified by a bus.

The processor 702 may be a central processing unit (central processing unit, CPU). The processor 702 may also be other general purpose processors, digital job automation schedulers (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or combinations thereof.

The processor 702 is configured to perform the job automation scheduling method provided by the present application.

The memory 701 may be a volatile memory (RAM), such as a random-access memory (RAM); or a nonvolatile memory (non-volatile memory), such as a read-only memory (ROM), a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD); or a combination of the above-mentioned types of memories for storing application code, configuration files, data information, or other content in which the methods of the application may be implemented.

The memory 701 serves as a non-transitory computer readable storage medium that may be used to store non-transitory software programs, non-transitory computer executable programs, and modules. The processor 702 executes various functional applications of the processor and data processing by running non-transitory software programs, instructions, and modules stored in the memory 701.

Memory 701 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data created by the processor 702, etc. In addition, memory 701 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 701 may optionally include memory remotely located relative to processor 702, such remote memory being connectable to processor 702 through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The memory 701 is used for storing job information of a job to be executed and cluster information of a computing cluster, which can implement the method of the present application.

The transceiver 703 is used for information interaction of the job automation scheduling device 700 with other devices.

The one or more modules are stored in the memory 701 and when executed by the processor 702 perform the functions of a client in the job automation scheduling method in the embodiment shown in fig. 3.

The embodiment of the application also provides a computer readable storage medium, wherein instructions are stored, and the instructions are executed to execute the method for automatically scheduling the operation and the related steps in the method embodiment.

It will be apparent to those skilled in the art from this description that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to perform all or part of the functions described above.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another apparatus, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and the parts displayed as units may be one physical unit or a plurality of physical units, may be located in one place, or may be distributed in a plurality of different places. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application, or a contributing part or all or part of the technical solution, may be embodied in the form of a software product, where the software product is stored in a storage medium, and includes several instructions for causing a device (may be a single-chip microcomputer, a chip or the like) or a processor (processor) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. The automatic dispatching method of the operation is applied to a calculation network brain which is deployed in an automatic dispatching system of the operation; the job automation scheduling system further includes: a computing web portal, at least one scheduler, at least one computing cluster; the computing network brain is in communication connection with the computing network portal; the computing network brain is in communication with the at least one scheduler; the computing portal is in communication connection with the at least one scheduler; the at least one scheduler is in one-to-one correspondence with the at least one computing cluster; the scheduler is in communication connection with the computing cluster; the method is characterized by comprising the following steps of:

the computing network brain receives computing power resource requirements, scheduling strategy requirements and application program names used of the to-be-executed operation sent by the computing network portal; the computing power resource requirement, the scheduling policy requirement and the application program name are obtained from the job information of the job to be executed with the highest priority in the first job queue by the computing network portal;

The computing network brain determines a target computing cluster according to the computing power resource requirement, the scheduling policy requirement and the used application program name;

the computing network brain sends the cluster information of the target computing cluster to the computing network portal, so that the computing network portal sends the job information to a dispatcher corresponding to the target computing cluster, and the dispatcher forwards the job information to the target computing cluster.

2. The method of claim 1, wherein the computing network brain determining a target computing cluster from the computing power resource requirements, the scheduling policy requirements, and the application names used, comprises:

the computing network brain determines at least one computing cluster according to the computing power resource requirement and the used application program name;

if the number of the at least one computing cluster is zero, the computing network brain sends a waiting instruction to the computing network portal; the waiting instruction is used for indicating the computing portal to store the job information to the first-stage queue;

if the number of the at least one computing cluster is one, the computing network brain determines that the computing cluster is the target cluster;

And if the number of the at least one computing cluster is more than or equal to two, the computing network brain determines the target computing cluster according to the scheduling policy requirement.

3. The method of claim 2, wherein the scheduling policy comprises: the computing resources are idle or the computing resource utilization rate is the lowest or the queued jobs are the least; the computing network brain determines the target computing cluster according to the scheduling policy requirement, and the computing network brain comprises:

the computing network brain matches the scheduling policy requirement with cluster information of a computing cluster to determine the target computing cluster; the cluster information is acquired and stored by the computing network brain from the scheduler.

4. An automatic scheduling method for operation is applied to an account portal; the calculation network portal is deployed in the operation automation scheduling system; the automatic scheduling method is characterized by comprising the following steps:

the computing network portal acquires the job information of the job to be executed with the highest priority in the first-stage job queue; the job information includes: computing power resource requirements, scheduling policy requirements, and application names used;

the computing network portal sends the computing power resource requirement, the scheduling policy requirement and the used application program name to the computing network brain so that the computing network brain determines a target computing cluster according to the computing power resource requirement, the scheduling policy requirement and the used application program name;

The computing network portal receives the cluster information of the target computing cluster sent by the computing network brain;

and the computing network portal sends the job information to a scheduler corresponding to the target computing cluster, so that the target scheduler forwards the job information to the target computing cluster.

5. The method of claim 4, wherein the computing portal sending the job information to a scheduler corresponding to the target computing cluster to cause the target scheduler to forward the job information to the target computing cluster, comprising:

the computing network portal sends the job information to a dispatcher of the target computing cluster;

the scheduler submits the job information to a second job queue of the target computing cluster;

and the target computing cluster processes the job to be executed with the highest priority in the second job queue.

6. The method according to claim 4, wherein the method further comprises:

the computing network portal receives the job information of the job to be executed;

the computing portal submits the job to be executed to the first-stage job queue.

7. An automatic operation scheduling device is applied to a calculation network brain, and the calculation network brain is deployed in an automatic operation scheduling system; the job automation scheduling system further includes: a computing web portal, at least one scheduler, at least one computing cluster; the computing network brain is in communication connection with the computing network portal; the computing network brain is in communication with the at least one scheduler; the computing portal is in communication connection with the at least one scheduler; the at least one scheduler is in one-to-one correspondence with the at least one computing cluster; the scheduler is in communication connection with the computing cluster; the automatic scheduling device for the jobs is characterized by comprising:

The receiving module is used for receiving the computing power resource requirement, the scheduling policy requirement and the used application program name of the job to be executed, which are sent by the computing network portal; the computing power resource requirement, the scheduling policy requirement and the application program name are obtained from the job information of the job to be executed with the highest priority in the first job queue by the computing network portal;

the determining module is used for determining a target computing cluster according to the computing power resource requirement, the scheduling policy requirement and the used application program name;

and the sending module is used for sending the cluster information of the target computing cluster to the computing network portal so that the computing network portal sends the job information to a dispatcher corresponding to the target computing cluster, and the dispatcher forwards the job information to the target computing cluster.

8. The apparatus of claim 7, wherein the device comprises a plurality of sensors,

the determining module is specifically configured to determine at least one computing cluster according to the computing power resource requirement and the used application program name; if the number of the at least one computing cluster is zero, sending a waiting instruction to the computing network portal; the waiting instruction is used for indicating the computing portal to store the job information to the first-stage queue; if the number of the at least one computing cluster is one, determining that the computing cluster is the target cluster; and if the number of the at least one computing cluster is more than or equal to two, determining the target computing cluster according to the scheduling policy requirement.

9. The apparatus of claim 8, wherein the scheduling policy comprises: the computing resources are idle or the computing resource utilization rate is the lowest or the queued jobs are the least;

the determining module is specifically configured to match the scheduling policy requirement with cluster information of a computing cluster, and determine the target computing cluster; the cluster information is acquired and stored by the computing network brain from the scheduler.

10. An automatic scheduling device for operation is applied to an account portal; the calculation network portal is deployed in the operation automation scheduling system; the automatic scheduling device is characterized by comprising:

the acquisition module is used for acquiring the job information of the job to be executed with the highest priority in the first-stage job queue; the job information includes: computing power resource requirements, scheduling policy requirements, and application names used;

the sending module is used for sending the computing power resource requirement, the scheduling policy requirement and the used application program name to the computing network brain so that the computing network brain determines a target computing cluster according to the computing power resource requirement, the scheduling policy requirement and the used application program name;

The receiving module is used for receiving the cluster information of the target computing cluster sent by the computing network brain;

the sending module is further configured to send the job information to a scheduler corresponding to the target computing cluster, so that the target scheduler forwards the job information to the target computing cluster.

11. The apparatus of claim 10, wherein the device comprises a plurality of sensors,

the sending module is specifically configured to send the job information to a scheduler of the target computing cluster;

the job automation scheduling device further includes:

the submitting module is used for submitting the job information to a second job queue of the target computing cluster;

12. The apparatus of claim 10, wherein the device comprises a plurality of sensors,

the receiving module is specifically used for receiving the job information of the job to be executed;

and the submitting module is also used for submitting the job to be executed to the first-stage job queue.

13. An automated job scheduling system, comprising: a computing network portal, a computing network brain, at least one scheduler, at least one computing cluster; the computing network portal is in communication connection with the computing network brain; the computing portal is in communication connection with the at least one scheduler; the computing network brain is in communication with the at least one scheduler; the at least one scheduler is in one-to-one correspondence with the at least one computing cluster; the scheduler is in communication connection with the computing cluster; the computing network brain for performing the job automation scheduling method of any one of claims 1-3; the computing portal for executing the job automation scheduling method of any one of claims 4-6; the scheduler is used for sending the job information to the computing cluster corresponding to the scheduler after receiving the job information sent by the computing portal; the computing cluster is used for executing the job to be executed according to the job information.

14. An automatic job scheduling apparatus, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the job automation scheduling method of any one of claims 1-3 or any one of claims 4-6.

15. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the job automation scheduling method of any one of claims 1-3 or any one of claims 4-6.