CN110795241B

CN110795241B - Job scheduling management method, scheduling center and system

Info

Publication number: CN110795241B
Application number: CN201910994082.9A
Authority: CN
Inventors: 毛登峰; 杨昆; 陈健
Original assignee: Beijing Paratera Technology Co ltd
Current assignee: Beijing Paratera Technology Co ltd
Priority date: 2019-10-18
Filing date: 2019-10-18
Publication date: 2022-07-19
Anticipated expiration: 2039-10-18
Also published as: CN110795241A

Abstract

The invention discloses a job scheduling management method, which is suitable for being executed in a scheduling center, wherein the scheduling center is respectively connected with a client and a computing cluster, the computing cluster is provided with at least one computing node, each computing node is provided with at least one GPU display card, and the method comprises the following steps: receiving a job submitting instruction sent by a client, wherein the instruction comprises the calculation requirement of the submitted job, and specifically comprises the number m of calculation nodes and the number n of GPU video cards of each calculation node; acquiring current idle resources of the computing cluster, and selecting computing resources matched with the computing requirements from the current idle resources; and distributing the matched computing resources to the operation, and acquiring and correlating and storing the operation identification of the operation, the distributed operation computing node identification and the GPU video card serial number used for running the operation in the operation computing node before running the operation so as to be convenient for subsequent inquiry and use. The invention also discloses a corresponding dispatching center and a system.

Description

Job scheduling management method, scheduling center and system

Technical Field

The invention relates to the technical field of internet and computers, in particular to a job scheduling management method, a scheduling center and a job scheduling management system.

Background

Because a single computer provides limited computing power, when a computing task with a large computing requirement needs to be processed, a cluster is usually used for computing. A cluster is a supercomputer consisting of multiple computers interconnected by a high-speed network. The GPU is also a kind of computing resource, and currently, computing operations related to artificial intelligence and machine learning are generally performed by using the GPU. Usually, each node on the cluster configured with GPU resources is installed with multiple GPU graphics cards, for example, 8 GPU graphics cards or more, so the total number of GPU graphics cards in the cluster is very large. After a user submits a job, the job scheduling system allocates the resources in the idle resources according to the resources required by the job, the allocation is random, and finally, the nodes for running the job and the GPU cards are random.

However, both cluster administrators and job submitters have a need to know explicitly on which GPU card of which node a job runs. A cluster administrator usually monitors the use status of various resources inside a cluster, so that when abnormal use of resources occurs, specific hardware resources can be quickly located according to jobs. The job submitter usually needs to know the specific condition of the resource usage and determine whether the job is abnormal. Or when the operation is abnormally debugged, the actual performance of the operation of the resource needs to be checked. Accordingly, there is a need to provide a method that can quickly locate computing resources for running a job.

Disclosure of Invention

To this end, the present invention provides a job scheduling management method, a scheduling center and a system in an attempt to solve or at least alleviate at least one of the problems presented above.

According to an aspect of the present invention, there is provided a job scheduling management method, adapted to be executed in a scheduling center, where the scheduling center is respectively connected to a client and a computing cluster, and the computing cluster has at least one computing node, and each computing node has at least one GPU graphics card, the method includes: receiving a job submitting instruction sent by a client, wherein the job submitting instruction comprises calculation requirements of submitted jobs, and the calculation requirements comprise the number m of calculation nodes and the number n of GPU video cards of each calculation node; acquiring current idle resources of the computing cluster, and selecting computing resources matched with the computing requirements from the current idle resources, wherein the computing resources are provided with m computing nodes, and the number of idle GPU video cards of each computing node is more than or equal to n; and distributing the matched computing resources to the operation, and acquiring and correlating and storing the operation identification of the operation, the distributed operation computing node identification and the GPU video card serial number used for running the operation in the operation computing node before running the operation so as to be convenient for subsequent inquiry and use.

Alternatively, in the JOB scheduling management method according to the present invention, the JOB identification is obtained by querying the environment variable SLURM _ JOB _ ID of the JOB computing node.

Alternatively, in the job scheduling management method according to the present invention, the job computing node identification is obtained by querying a hostname command of the job computing node.

Optionally, in the job scheduling management method according to the present invention, after the GPU is correctly installed, the compute node may generate a corresponding device file, where the device file includes a serial number of a graphics card of each GPU.

Optionally, in the job scheduling management method according to the present invention, the GPU video card serial number of each job compute node is obtained by querying the environment variable CUDA _ VISIBLE _ deviceof the job compute node.

Optionally, in the job scheduling management method according to the present invention, the scheduling center is further connected to a performance monitoring center, the performance monitoring center is configured to monitor a current idle resource of the computing cluster in real time, and the step of obtaining the current idle resource of the computing cluster includes: and sending an idle resource query request to the performance monitoring center, and receiving the current idle resource of the computing cluster returned by the performance monitoring center.

Optionally, in the job scheduling management method according to the present invention, the method further includes: the job is run in the allocated computing resources and the allocated computing resources are released after the job run ends.

According to another aspect of the present invention, there is provided a scheduling center for executing a job scheduling management method, the scheduling center being connected to a client and a computing cluster, respectively, the computing cluster having at least one computing node, each computing node having at least one GPU graphics card, the apparatus comprising: the instruction receiving module is suitable for receiving a job submitting instruction sent by a client, wherein the job submitting instruction comprises calculation requirements of submitted jobs, and the calculation requirements comprise the number m of calculation nodes and the number n of GPU video cards of each calculation node; the resource allocation module is suitable for acquiring current idle resources of the computing cluster and selecting computing resources matched with the computing requirements from the current idle resources, wherein the computing resources are provided with m computing nodes, and the number of idle GPU display cards of each computing node is more than or equal to n; and the resource recording module is suitable for distributing the matched computing resources to the operation, and acquiring and correlating and storing the operation identification of the operation, the distributed operation computing node identification and the GPU video card serial number used for running the operation in the operation computing node before the operation is run so as to be convenient for subsequent inquiry and use.

Optionally, in the dispatching center according to the present invention, the JOB identification is obtained by querying an environment variable SLURM _ JOB _ ID of the JOB computing node; the job computing node identification is obtained by inquiring a hostname command of the job computing node; the GPU graphics card serial number of each job compute node is obtained by querying the environment variable CUDA _ VISIBLE _ DEVICES of that job compute node.

According to still another aspect of the present invention, there is also provided a job scheduling management system including: the dispatch center as described above; the client is suitable for responding to a job submitting request of a user and sending a job submitting instruction to the scheduling center, wherein the job submitting instruction comprises a calculation requirement; and the computing cluster is provided with at least one computing node, and each computing node is provided with at least one GPU display card and is suitable for processing the jobs submitted by the client.

Optionally, in the job scheduling management system according to the present invention, the job scheduling management system further includes: and the performance monitoring center is suitable for monitoring the current idle resources of the computing cluster in real time and responding to the idle resource query request of the scheduling center to send the current idle resources of the computing cluster to the scheduling center.

According to yet another aspect of the present invention, there is also provided a computing device comprising: at least one processor; and at least one memory including computer program instructions; the at least one memory and the computer program instructions are configured to, with the at least one processor, cause a computing device to perform a job scheduling management method as described above.

According to yet another aspect of the present invention, there is also provided a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform the job scheduling management method as described above.

According to the technical scheme of the invention, a user submits a job operation command and a calculation requirement required by job operation to a scheduling center, the scheduling center distributes in idle resources of a calculation cluster according to the calculation requirement required by the job, and the calculation nodes meeting the calculation requirement and the idle GPU display cards thereof are distributed to the job. Before the operation of the operation, the scheduling center acquires the operation identification of the operation, the distributed operation calculation node identification and the GPU video card serial number used for operating the operation in the operation calculation node, and stores the acquired content in a correlation manner. Therefore, the node for running the operation and the serial number of the corresponding GPU display card can be directly inquired and obtained by using the operation identification subsequently, and the GPU position for running the operation can be quickly found out. On the basis, monitoring of the operation GPU load or field debugging of operation abnormity can be achieved.

Drawings

To the accomplishment of the foregoing and related ends, certain illustrative aspects are described herein in connection with the following description and the annexed drawings, which are indicative of various ways in which the principles disclosed herein may be practiced, and all aspects and equivalents thereof are intended to be within the scope of the claimed subject matter. The above and other objects, features and advantages of the present disclosure will become more apparent from the following detailed description read in conjunction with the accompanying drawings. Throughout this disclosure, like reference numerals generally refer to like parts or elements.

FIG. 1 shows a schematic diagram of a job scheduling management system 100 according to one embodiment of the present invention;

FIG. 2 illustrates a block diagram of a computing device 200, according to one embodiment of the invention;

FIG. 3 illustrates a flow diagram of a job schedule management method 300 according to one embodiment of the present invention;

FIG. 4 shows a schematic diagram of a job completion execution flow, according to one embodiment of the invention; and

fig. 5 shows a schematic diagram of a scheduling center 500 adapted to perform a job scheduling management method according to one embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

FIG. 1 shows a schematic diagram of a job scheduling management system 100 according to one embodiment of the present invention. As shown in FIG. 1, the system 100 may include one or more clients 110, a dispatch center 120, and at least one computing cluster 130. Wherein, the dispatch center 120 is communicatively connected with the client 110 and the computing cluster 130, respectively. It should be understood that the job scheduling management system 100 shown in fig. 1 is only exemplary, and in a specific implementation, there may be different numbers of clients 110, scheduling centers 120 and computing centers 130, and the present invention does not limit the number and the arrangement of these devices.

The client 110 sends a job submission instruction to the dispatch center 120 in response to a request from a user to submit a job. Dispatch center 120 may be deployed in multiple geographic locations, respectively, which may be implemented as a single computing device or as a cluster. Thus, the client 110 may send job submission instructions to a dispatch center 120 coupled thereto. The job submission instruction includes the computational requirements of the submitted job. The computational requirements may include, for example, the number of compute nodes and the number of GPU graphics cards for each compute node executing the job, etc.

The dispatch center 120 may monitor resource usage of the computing cluster 130 in real time, such as monitoring currently idle resources of the computing cluster 130. The dispatch center 120, upon receiving the job submission instruction, extracts the computational requirements in the instruction. And then, acquiring current idle resources of the computing cluster, and selecting proper computing resources from the current idle resources to allocate to the job. If one job application uses 2 compute nodes, each node uses 1 GPU graphics card. After the job is submitted, the scheduling center finds out 2 computing nodes with 1 idle GPU video card meeting the conditions in all the current idle resources so as to distribute the job for use.

According to one implementation, the system 100 may also be provided with a dedicated performance monitoring center 140 for monitoring resource usage of the computing cluster 130 in real time. For example, the current idle nodes of compute cluster 130, the idle GPU number of idle nodes, the job processing rate in the cluster, the number of nodes CPU occupied by a job, memory, network, storage, hardware resource configuration information, node performance data, node application job data, node process data, and function level data may be monitored. Alternatively, information such as the list information of the computing nodes that process the job, the number of CPU cores of the computing nodes, the physical configuration information of the nodes, whether or not the computing nodes are exclusively used when the job is executed, and the like may also be recorded. In this way, the scheduling center 120 may obtain resource usage of the computing cluster 130, such as obtaining currently idle resources, from the performance monitoring function center 140.

Considering that different computing clusters 130 may be disposed in different geographic locations, the performance monitoring center 140 may monitor the resource usage of all computing clusters 130 as a whole as a general center. Of course, a performance monitoring center 140 may be provided for each computing cluster 130 to monitor the corresponding computing cluster 110. At this time, the scheduling center 130 obtains the resource usage of the corresponding computing cluster 130 from each performance monitoring center 140.

According to an embodiment of the present invention, the various components in the job scheduling management system 100 described above may communicate over one or more networks, such as a Local Area Network (LAN) or a Wide Area Network (WAN), such as the internet. Each compute node, the scheduling center, the performance monitoring center, and the client in the compute cluster may be implemented by the computing device 200 as described below.

FIG. 2 shows a schematic diagram of a computing device 200, according to one embodiment of the invention. As shown in FIG. 3, in a basic configuration 202, a computing device 200 typically includes a system memory 206 and one or more processors 204. A memory bus 208 may be used for communication between the processor 204 and the system memory 206.

Depending on the desired configuration, the processor 204 may be any type of processing, including but not limited to: a microprocessor (μ P), a microcontroller (μ C), a Digital Signal Processor (DSP), or any combination thereof. The processor 304 may include one or more levels of cache, such as a level one cache 210 and a level two cache 212, a processor core 214, and registers 216. Example processor cores 214 may include Arithmetic Logic Units (ALUs), Floating Point Units (FPUs), digital signal processing cores (DSP cores), or any combination thereof. The example memory controller 218 may be used with the processor 204, or in some implementations the memory controller 218 may be an internal part of the processor 204.

Depending on the desired configuration, system memory 206 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. System memory 206 may include an operating system 220, one or more applications 222, and program data 224. In some implementations, the application 222 can be arranged to execute instructions on the operating system with the program data 224 by the one or more processors 204. The program data 224 includes instructions, and in the computing device 200 according to the present invention, the program data 224 contains instructions for performing the job scheduling management method 300.

Computing device 200 may also include an interface bus 240 that facilitates communication from various interface devices (e.g., output devices 242, peripheral interfaces 244, and communication devices 246) to the basic configuration 202 via the bus/interface controller 230. The example output device 242 includes a graphics processing unit 248 and an audio processing unit 250. They may be configured to facilitate communication with various external devices, such as a display or speakers, via one or more a/V ports 252. Example peripheral interfaces 244 can include a serial interface controller 254 and a parallel interface controller 256, which can be configured to facilitate communications with external devices such as input devices (e.g., keyboard, mouse, pen, voice input device, touch input device) or other peripherals (e.g., printer, scanner, etc.) via one or more I/O ports 258. An example communication device 246 may include a network controller 360, which may be arranged to facilitate communications with one or more other computing devices 362 over a network communication link via one or more communication ports 264.

A network communication link may be one example of a communication medium. Communication media may typically be embodied by computer readable instructions, data structures, program modules, in a modulated data signal, such as a carrier wave or other transport mechanism, and may include any information delivery media. A "modulated data signal" may be a signal that has one or more of its data set or its changes made in such a manner as to encode information in the signal. By way of non-limiting example, communication media may include wired media such as a wired network or private-wired network, and various wireless media such as acoustic, Radio Frequency (RF), microwave, Infrared (IR), or other wireless media. The term computer readable media as used herein may include both storage media and communication media.

Computing device 200 may be implemented as a server, such as a database server, an application server, a WEB server, and the like, or as a personal computer including desktop and notebook computer configurations. In an embodiment according to the invention, the computing device 200 is configured to execute a job scheduling management method 300 according to the invention.

As previously mentioned, there is a need to provide a method that can quickly establish a relationship between a computing job and allocated GPU resources. In one implementation, the GPU on which the job actually runs can be found by way of process identification PID:

first, a NODE identification NODE of job execution is determined from the job identification JID. After the job is submitted, the dispatch center will give the corresponding job identification (JOBID: 107505). Through the job identification, the node of 'gk 31' where the job runs can be queried by the squeue command. At this point, the running node is determined, but the GPU card on the node that was used to run the job is not yet determined.

Thereafter, all process identification PIDs of the job are acquired from the job identification JID, resulting in a list of PIDs of the JID (which may be referred to as a first list). Specifically, all process identification PIDs to job 107505 are queried via the 'scontrol lists' command, as shown in FIG. 2. However, a job may usually have a plurality of process information, which may also be dynamically changed, may have a process terminated and exited, and may have a new process started. Therefore, the output results will be different when the 'scontrol lists' query command is run at different times.

PID	JOBID
		-1	107505
300	107505
		323	107505
329	107505
		555	107505

And finally, acquiring process identification PIDs of all GPUs on the NODE NODE of the running job to obtain a PID list (which can be called as a second list) on the GPU. Specifically, on the NODE running the job, the 'nvidia-smi' command is run to display all GPU information on the NODE and a running process identification PID list on the GPU.

GPU	PID
		0	555
1	555
		2	555
3	555
		……	……

In this way, by comparing the PID list of the job with the PID list on the GPU, the GPU number used by the job can be determined. If the PID of a certain GPU appears in the job list, the GPU is indicated to be used by the job. If a PID of the job 107505 in the first list is 555 and GPUs corresponding to the PID555 in the second list are GPUs 0-3, it can be determined that the GPUs used by the job 107505 are GPUs 0-3 on the gk31 node.

However, the PID used in this implementation is dynamic information, and the timing of obtaining the PID list may affect the result of the query. Even if the intensive query mode is adopted, the accurate list can not be obtained. And the method is complex, has too many calculation links, and is not suitable for being used in a scene of simultaneous operation of a large number of concurrent operations. Therefore, in another implementation manner, the present invention provides another job scheduling management scheme for more quickly and accurately locating the GPU location corresponding to the job.

FIG. 3 illustrates a flow diagram of a job scheduling management method 300, suitable for execution in the scheduling center 120 as described above, according to one embodiment of the present invention. The job scheduling management method 300 will be described below in conjunction with the system 100 described in FIG. 1.

As shown in fig. 3, the method begins at step S310. In step S310, a job submission instruction sent by the client is received, where the job submission instruction includes a calculation requirement of the submitted job, and the calculation requirement includes a calculation node number m and a GPU graphics card number n of each calculation node.

According to one embodiment, the computing requirements may further include one or more of a floating point arithmetic unit, floating point arithmetic capability, CPU dominant frequency, CPU socket, CPU core, CPU hyper-threading, memory capacity, memory dominant frequency, file system, storage medium, storage interface, network type, network rate, network bandwidth, network latency, and the like.

Further, a job profile may be set for each job, which includes the computational requirements to execute the job. Before submitting a job, a user can select one or more performance indexes and write the performance indexes into a job configuration file. In this way, the scheduling center 120 can obtain the calculation requirement to execute the job by reading the job configuration file.

Subsequently, in step S320, current idle resources of the computing cluster are obtained, and a computing resource matching the computing requirement is selected from the current idle resources, where the computing resource has m computing nodes, and the number of idle GPU video cards of each computing node is greater than or equal to n.

As described above, the scheduling center may monitor the resource usage of the computing cluster by itself, or may be monitored by the performance monitoring center. Therefore, the scheduling center may send an idle resource query request to the performance monitoring center and receive the current idle resources of the computing cluster returned by the performance monitoring center.

The principle of the dispatching center is that the resource matched with the job requirement is selected from the idle resources, and then the selected resource is distributed to the job. Assume that the demand for a job is 2 compute nodes, each requiring 2 GPU cards. If the current idle resources are more, the scheduling center randomly selects the resources meeting the conditions to be allocated to the operation. For example, if there are A, B, C nodes that are completely idle, the scheduling system may assign the job two GPUs with GPU numbers 0, 1 on node A and two GPUs with GPU numbers 0, 1 on node B. Other similar combinations of nodes are of course possible. In practice, the resource usage may be too tight, for example, node a may only have GPU cards with serial numbers 2 and 7 idle, node B may only have GPU cards with serial number 4 idle, and node C may only have GPU cards with serial numbers 1 and 6 idle. At this time, the node B does not meet the job calculation requirement, so the node B does not participate in resource allocation, and the resources finally allocated to the job are the GPU cards with serial numbers 2 and 7 on the node a and the GPU cards with serial numbers 1 and 6 on the node C.

According to one embodiment, when there are multiple computing clusters, the scheduling center preferentially selects the cluster which is closest to the computing cluster and whose current free resources can meet the computing requirements. If the current idle resource of the closest computing cluster cannot meet the computing requirement of the user, the cluster with the next closest distance is selected, and so on. Or, the dispatching center can preferentially select the cluster with the best computing performance and the current idle resources meeting the computing requirements. The computing performance is related to the hardware and software device status of each node of the computing cluster. Similarly, if the current idle resource of the computing cluster with the best computing performance cannot meet the computing requirement of the user, the cluster with the second best computing performance is selected. Or, the dispatching center preferentially selects the cluster with the lowest computation cost and the current idle resource capable of meeting the computation requirement. The calculation cost is related to the electricity charge, the machine room hosting charge, the maintenance charge, the operation charge, the network operator charge and the like of the area where the cluster is located, and can be obtained by carrying out weighted average on the charges. If the current idle resource of the computing cluster with the lowest computing cost cannot meet the computing requirement of the user, the computing cluster with the lowest computing cost is selected and computed as the current low cluster, and so on.

Further, after the computing cluster is determined, if the number of currently idle computing nodes in the computing cluster is more than the number m of computing nodes required by the user, the scheduling center preferentially selects m computing nodes with better computing performance from the idle nodes, and the computing performance is related to the states of software and hardware devices of the nodes.

Subsequently, in step S330, the matched computing resource is allocated to the job, and before the job is executed, the job identifier storing the job, the allocated job computing node identifier, and the GPU video card serial number for executing the job in the job computing node are obtained and associated for subsequent query.

According to one embodiment of the present invention, as shown in fig. 4, during the life cycle of a job, some programs are executed at specific times to perform initialization or cleaning work, or to record the job, etc. For example, the Prolog process is executed after the computing resources are allocated, and the Epilog program is executed after the job computation is completed. Prolog and Epilog are programs reserved by the job scheduler that a user can customize for additional operations, with the Prolog program being located before the job starts running and the Epilog program being located after the job has completed execution. The method is to write a corresponding program at the position of a Prolog program to acquire a job identifier, a job computing node identifier and a GPU video card serial number distributed to the job.

The job identifier JOBID is a mark capable of uniquely identifying a job, and different job scheduling systems have own job identifiers. There are various methods of acquiring the identification, and for example, it may be acquired by an environment variable, may be acquired by output information of a job submission command, or may be acquired by some query command. The present invention is not limited to a specific job identifier obtaining method, and all methods capable of obtaining a job identifier are within the scope of the present invention. For the LSF job scheduling system, it may be obtained by the environment variable LSB _ JOBID. According to one embodiment, the JOB identification JOBID is obtained by querying the environment variable SLURM _ JOB _ ID of the JOB compute node. Exemplary code to obtain a job compute NODE identification NODE is as follows:

[scta019@login1 sandbox]$hostname

Login1

[scta019@login1 sandbox]$cat./get_hostname.sh

#！/bin/sh

echo＂This job is running on host＇$(hostname)＇＂

[scta019@login1 sandbox]$

[scta019@login sandbox]$sbatch–p ai–N 1./get_hostname.sh

Submitted batch job 108835

[scta019@login sandbox]$

[scta019@login sandbox]$cat slurm-108835.out

This job is running on host:＇gk31＇

[scta019@login1 sandbox]$

the job computing NODE identifier NODE is a mark capable of uniquely identifying the host name of the computing NODE, and various methods for acquiring the job computing NODE identifier exist, and all the methods capable of acquiring the job identifier are within the protection scope of the invention. According to one embodiment, the job compute NODE identification NODE may be obtained by querying a host name hostname command of the job compute NODE, or may be a unique identification of the compute NODE host by a network card MAC.

The GPU display card serial number GPU DEVICE INDEX of each job compute node is obtained by querying the environment variable CUDA _ VISIBLE _ DEVICEs of the job compute node. For the physical serial numbers of the GPU display cards, after the GPU is correctly installed on the compute nodes, corresponding device files are generated in the/dev directory, where the device files include the serial number of each GPU display card, and each compute node has a set of display card serial numbers, which may be numbered from 0 or 1, and the present invention is not limited to this. The device file is similar to "/dev/nvidia 0"/dev/nvidia 1 "/dev/nvidia 2" and so on, and the corresponding GPU numbers are 0, 1, 2.

After the scheduling center allocates the GPU resources to the job, the CUDA _ VISIBLE _ deviceenvironment variable of the job compute node is set to the corresponding GPU graphics serial number. The physical number of the GPU device may therefore be obtained by the environment variable CUDA _ VISIBLE _ DEVICES.

For example, in the previous example, the computational resources allocated for the job are the GPU cards with sequence numbers 2 and 7 on the A node and the GPU cards with sequence numbers 1 and 6 on the C node. Then the CUDA _ VISIBLE _ deviceenvironment variables at the a node and C node, respectively, would be set to:

and (2) node A: CUDA _ VISIBLE _ devicej 2, 7

And C, node C: CUDA _ VISIBLE _ DEVICES ═ 1, 6

In this way, by querying A, C the CUDA _ VISIBLE _ deviceenvironment variable values of the two nodes, respectively, the GPU graphics serial number used for running the job in the job compute node can be determined. At the moment, the mapping relation between the operation identification and the operation GPU can be established quickly and accurately, based on the mapping relation, cluster managers can monitor and troubleshoot the operation GPU load, and operation submitters can also debug and trace on the spot.

According to an embodiment of the invention, the method 300 further comprises the steps of: the job is run in the allocated computing resources and the allocated computing resources are released after the job run is finished. The complete flow of job execution is shown in fig. 4, after the job is submitted, the scheduling center performs calculation resource allocation, and then executes the Prolog program to obtain the job identifier JOBID, the job calculation NODE identifier NODE and the GPU DEVICE INDEX. Thereafter, the job is run on the allocated computing resources and after the job run is completed, an Epilog program is executed, which, like the Prolog program, can retrieve job messages from the environment variables. And finally, releasing the computing resources allocated to the operation to complete the whole operation computing process.

Fig. 5 is a block diagram illustrating a configuration of a scheduling center 500 for executing a job scheduling management method according to an embodiment of the present invention, where the scheduling center is connected to a client and a computing cluster, and the computing cluster has at least one computing node, and each computing node has at least one GPU graphics card.

As shown in fig. 5, the dispatch center 500 includes an instruction receiving module 510, a resource allocating module 520, and a resource recording module 530.

The instruction receiving module 510 receives a job submission instruction sent by a client, where the job submission instruction includes calculation requirements of a submitted job, and the calculation requirements include the number m of calculation nodes and the number n of GPU graphics cards of each calculation node. The instruction receiving module 510 may perform processing corresponding to the processing described above in step S310, and the detailed description thereof will not be repeated.

The resource allocation module 520 obtains current idle resources of the computing cluster, and selects a computing resource matched with the computing requirement from the current idle resources, where the computing resource has m computing nodes, and the number of idle GPU video cards of each computing node is greater than or equal to n. According to one embodiment, the resource allocation module 520 may send an idle resource query request to the performance monitoring center and receive the current idle resource of the computing cluster returned by the performance monitoring center. The resource allocation module 520 may perform the processing corresponding to the processing described above in step S320, and the detailed description thereof is omitted here.

The resource recording module 530 allocates the matched computing resource to the job, and acquires and associates and stores the job identifier of the job, the allocated job computing node identifier, and the serial number of the GPU video card for running the job in the job computing node before running the job, so as to facilitate subsequent query and use. The resource recording module 530 may perform processing corresponding to the processing described above in step S330, and a detailed description thereof will not be repeated here.

According to the technical scheme of the invention, the physical position of the GPU resource used by the job is quickly found in the cluster through the job identifier based on the Prolog mechanism of the job scheduler, and the mapping relation between the job identifier and the GPU running the job is established. Based on the mapping relation, a cluster manager can monitor and troubleshoot the operation GPU load, and an operation submitter can debug and trace on site.

A11, the job scheduling management system of a10, further comprising: and the performance monitoring center is suitable for monitoring the current idle resources of the computing cluster in real time and responding to the idle resource query request of the scheduling center to send the current idle resources of the computing cluster to the scheduling center. A12, a computing device, comprising: at least one processor; and at least one memory including computer program instructions; the at least one memory and the computer program instructions are configured to, with the at least one processor, cause the computing device to perform the method of any of claims 1-6. A13, a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods in claims 1-6.

The various techniques described herein may be implemented in connection with hardware or software or, alternatively, with a combination of both. Thus, the methods and apparatus of the present invention, or certain aspects or portions thereof, may take the form of program code (i.e., instructions) embodied in tangible media, such as floppy diskettes, CD-ROMs, hard drives, or any other machine-readable storage medium, wherein, when the program is loaded into and executed by a machine, such as a computer, the machine becomes an apparatus for practicing the invention.

In the case of program code execution on programmable computers, the computing device will generally include a processor, a storage medium readable by the processor (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device. Wherein the memory is configured to store program code; the processor is configured to execute the job scheduling management method of the present invention according to instructions in the program code stored in the memory.

By way of example, and not limitation, computer readable media may comprise computer storage media and communication media. Computer readable media includes both computer storage media and communication media. Computer storage media store information such as computer readable instructions, data structures, program modules or other data. Communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media. Combinations of any of the above are also included within the scope of computer readable media.

In the description provided herein, algorithms and displays are not inherently related to any particular computer, virtual system, or other apparatus. Various general purpose systems may also be used with examples of this invention. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules or units or components of the devices in the examples disclosed herein may be arranged in a device as described in this embodiment or alternatively may be located in one or more devices different from the devices in this example. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components in the embodiments may be combined into one module or unit or component, and furthermore, may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

Additionally, some of the embodiments are described herein as a method or combination of method elements that can be implemented by a processor of a computer system or by other means of performing the described functions. A processor with the necessary instructions for carrying out the method or the method elements thus forms a device for carrying out the method or the method elements. Further, the elements of the apparatus embodiments described herein are examples of the following apparatus: the apparatus is used to implement the functions performed by the elements for the purpose of carrying out the invention.

As used herein, unless otherwise specified the use of the ordinal adjectives "first", "second", "third", etc., to describe a common object, merely indicate that different instances of like objects are being referred to, and are not intended to imply that the objects so described must be in a given sequence, either temporally, spatially, in ranking, or in any other manner.

While the invention has been described with respect to a limited number of embodiments, those skilled in the art, having benefit of this description, will appreciate that other embodiments can be devised which do not depart from the scope of the invention as described herein. Furthermore, it should be noted that the language used in the specification has been principally selected for readability and instructional purposes, and may not have been selected to delineate or circumscribe the inventive subject matter. Accordingly, many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the appended claims. The present invention has been disclosed with respect to the scope of the invention, which is to be considered as illustrative and not restrictive, and the scope of the invention is defined by the appended claims.

Claims

1. A job scheduling management method is suitable for being executed in a scheduling center, the scheduling center is respectively connected with a client and a computing cluster, the computing cluster is provided with at least one computing node, each computing node is provided with at least one GPU display card, and the method comprises the following steps:

receiving a job submitting instruction sent by a client, wherein the job submitting instruction comprises calculation requirements of submitted jobs, and the calculation requirements comprise the number m of calculation nodes and the number n of GPU video cards of each calculation node;

acquiring current idle resources of the computing cluster, and selecting computing resources matched with the computing requirements from the current idle resources, wherein the computing resources are provided with m computing nodes, and the number of idle GPU display cards of each computing node is more than or equal to n; and

distributing the matched computing resources to the operation, and acquiring and associating and storing an operation identifier of the operation, an identifier of the distributed operation computing node and a serial number of a GPU (graphics processing unit) video card used for running the operation in the operation computing node before running the operation so as to be convenient for subsequent inquiry and use;

the scheduling center is also connected with a performance monitoring center, the scheduling center is suitable for monitoring the node performance data of the computing cluster, and when the scheduling center selects the computing resource matched with the computing requirement from the current resources, the scheduling center selects the computing resource with the best computing performance and the current idle resource capable of meeting the computing requirement;

the scheduling center is also suitable for giving out corresponding operation identification after the operation is submitted;

the method for inquiring the display card identifier corresponding to the CPU corresponding to the job according to the job identifier comprises the following steps:

querying a node where the job runs through the job identifier;

acquiring a process identifier of a job according to the job identifier to obtain a first list, wherein the job comprises one or more processes, and the first list comprises all the process identifiers of the job;

acquiring process identifiers of GPU (graphics processing Unit) display cards on nodes running the operation to obtain a second list, wherein the second list comprises the process identifiers of all the GPU display cards in the nodes;

and determining the GPU display card used by the operation according to the comparison of the first list and the second list.

2. The method of claim 1, wherein the JOB identification is obtained by querying an environment variable SLURM JOB ID of a JOB compute node.

3. The method of claim 1 wherein the job compute node identification is obtained by querying a hostname command of the job compute node.

4. The method as claimed in any one of claims 1 to 3, wherein the compute node, after correctly installing the GPUs, generates a corresponding device file containing the graphics card serial number of each GPU.

5. The method of claim 4 wherein the GPU graphics card serial number for each job compute node is obtained by querying the environment variable CUDA VISIBLE DEVICES of that job compute node.

6. The method of claim 5, the performance monitoring center further configured to monitor currently idle resources of the computing cluster in real time, the obtaining the currently idle resources of the computing cluster including:

and sending an idle resource query request to the performance monitoring center, and receiving the current idle resources of the computing cluster returned by the performance monitoring center.

7. The method of claim 6, further comprising the steps of:

the job is run in the allocated computing resources and the allocated computing resources are released after the job run is finished.

8. A scheduling center for executing a job scheduling management method, the scheduling center being connected to a client and a computing cluster, respectively, the computing cluster having at least one computing node, each computing node having at least one GPU graphics card, the scheduling center comprising:

the instruction receiving module is suitable for receiving a job submitting instruction sent by a client, wherein the job submitting instruction comprises calculation requirements of submitted jobs, and the calculation requirements comprise the number m of calculation nodes and the number n of GPU video cards of each calculation node;

the resource allocation module is suitable for acquiring current idle resources of the computing cluster and selecting computing resources matched with the computing requirements from the current idle resources, wherein the computing resources are provided with m computing nodes, and the number of idle GPU display cards of each computing node is more than or equal to n; and

the resource recording module is suitable for distributing the matched computing resources to the operation, and acquiring and correlating and storing the operation identification of the operation, the distributed operation computing node identification and the GPU video card serial number used for running the operation in the operation computing node before the operation is run so as to be convenient for subsequent inquiry and use;

the resource allocation module selects the computing resource with the best computing performance and the current idle resource capable of meeting the computing requirement when selecting the computing resource matched with the computing requirement from the current resources;

the instruction receiving module is also suitable for giving out a corresponding operation identifier after the operation is submitted;

inquiring the node where the job runs through the job identifier;

9. The dispatch center of claim 8, wherein,

the JOB identification is obtained by querying an environment variable SLURM _ JOB _ ID of a JOB computing node;

the job computing node identification is obtained by querying a hostname command of the job computing node;

the GPU graphics card serial number of each job compute node is obtained by querying the environment variable CUDA _ VISIBLE _ DEVICES of that job compute node.

10. A job scheduling management system comprising:

the dispatch center of any one of claims 8 or 9;

the client is suitable for responding to a job submitting request of a user and sending a job submitting instruction to the scheduling center, wherein the job submitting instruction comprises a calculation requirement; and

and the computing cluster is provided with at least one computing node, and each computing node is provided with at least one GPU display card and is suitable for processing the jobs submitted by the client.

11. The job scheduling management system according to claim 10, further comprising:

and the performance monitoring center is suitable for monitoring the current idle resources of the computing cluster in real time and responding to the idle resource query request of the scheduling center to send the current idle resources of the computing cluster to the scheduling center.

12. A computing device, comprising:

at least one processor; and

at least one memory including computer program instructions;

the at least one memory and the computer program instructions are configured to, with the at least one processor, cause the computing device to perform the method of any of claims 1-6.

13. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a computing device, cause the computing device to perform any of the methods of claims 1-6.