WO2023169408A1

WO2023169408A1 - Resource scheduling method, apparatus, and related device

Info

Publication number: WO2023169408A1
Application number: PCT/CN2023/080047
Authority: WO
Inventors: 申鹏
Original assignee: 华为技术有限公司
Priority date: 2022-03-08
Filing date: 2023-03-07
Publication date: 2023-09-14
Also published as: CN116775258A

Abstract

A resource scheduling method, the method being applied in an HPC cluster comprising a scheduler, a plurality of computing nodes, and a plurality of switches, wherein the plurality of computing nodes implement a communication connection by means of the plurality of switches; when performing resource scheduling, the scheduler obtains a job to be processed; the scheduler determines a plurality of first computing nodes from among the plurality of computing nodes according to a topological structure of the HPC cluster, wherein the total number of switches traversed in data transmission between the plurality of determined first computing nodes is less than a threshold value; and consequently the scheduler notifies the plurality of first computing nodes to execute said job. The scheduler limits the number of switches traversed when implementing a communication connection between the plurality of selected first computing nodes, which causes lower communication costs produced when the plurality of first computing nodes execute the job, and network transmission delay produced when data communication is carried out between the plurality of first computing nodes can also be lowered.

Description

Resource scheduling methods, devices and related equipment

This application claims priority to the Chinese patent application filed with the China Patent Office on March 8, 2022, with the application number 202210227649.1 and the invention name "Resource Scheduling Method, Device and Related Equipment". The entire content of the patent application is incorporated by reference. incorporated in this application.

Technical field

The present application relates to the field of resource management technology, and in particular, to a resource scheduling method, device and related equipment.

Background technique

High performance computing (HPC) cluster refers to a computer cluster that interconnects multiple computer systems through various interconnection technologies and is used to achieve high-speed computing. It is widely used in weather forecasting, computing simulation, image processing and other fields. .

Typically, an HPC cluster is configured with a scheduler, and for a job submitted to the HPC cluster, the scheduler can randomly select multiple computing nodes from the HPC cluster and assign them to the job, so that the randomly selected multiple computing nodes are Computing nodes can collaboratively execute the job through data communication between each other. Among them, communication connections are realized between different computing nodes through one or more switches.

However, this resource scheduling method can easily lead to high communication costs between multiple computing nodes and low efficiency in executing the job when the HPC cluster executes the job. Therefore, how to reduce the communication cost when multiple computing nodes execute jobs and improve the efficiency of job execution has become an important issue that needs to be solved.

Contents of the invention

This application provides a resource scheduling method, device, scheduler, computing device, computer-readable storage medium and computer program product to reduce the communication cost when multiple computing nodes execute jobs and improve the efficiency of executing jobs. .

In a first aspect, this application provides a resource scheduling method, which method is applied to an HPC cluster. The HPC cluster includes a scheduler, multiple computing nodes, and multiple switches, and the multiple computing nodes implement communication connections through the multiple switches; When the scheduler performs resource scheduling, it first obtains one or more jobs to be processed, and for each job, the scheduler determines multiple first computing nodes from multiple computing nodes according to the topology of the HPC cluster. , wherein the determined total number of switches through which data is transmitted between the plurality of first computing nodes is less than the threshold, so the scheduler notifies the plurality of first computing nodes to execute the job.

When the scheduler selects multiple first computing nodes to execute each job, it limits the number of switches across when communication connections are realized between the multiple first computing nodes, which causes the multiple first computing nodes to execute the The communication cost generated by the job is low, that is, there is no need to communicate across a large number of switches, and the network transmission delay generated during data communication between multiple first computing nodes can also be reduced, so that Improve the efficiency of the first computing node in executing jobs.

There may be one or more jobs to be processed, and when there are multiple jobs to be processed, for each job, the scheduler may schedule corresponding resources for the job according to the resource scheduling method introduced in the first aspect. resource.

In a possible implementation, the networking mode between multiple computing nodes and multiple switches in the HPC cluster can be So there is a two-layer fat tree network structure. Specifically, the multiple switches in the HPC cluster include aggregation switches and edge switches, and multiple computing nodes in the HPC cluster access the HPC cluster through edge switches. The edge switches are connected to the aggregation switches. coupling.

In a possible implementation, when the scheduler determines multiple first computing nodes from multiple computing nodes, it may specifically traverse the topology structure of the HPC cluster and determine multiple computing nodes connected to the same edge switch as A plurality of first computing nodes. In this way, the determined multiple first computing nodes all perform data communication through the same edge switch, which not only makes the communication cost incurred by the multiple first computing nodes performing the job lower, but also reduces the communication cost among the multiple first computing nodes. The network transmission delay generated during data communication is also low, which can improve the efficiency of the first computing node in executing the job.

In a possible implementation, when the scheduler determines multiple first computing nodes from multiple computing nodes, it may specifically traverse the topology structure of the HPC cluster, determine the number of computing nodes connected to each edge switch, and , when the number of computing nodes connected to each edge switch is less than the first quantity threshold, the scheduler determines multiple computing nodes connected to multiple edge switches as multiple first computing nodes, and the multiple first computing nodes Communication connections are implemented through at least one aggregation switch. In this way, when the number of computing nodes connected to the same edge switch is small, the scheduler can select computing nodes under multiple edge switches as the first computing node, while meeting the needs of computing nodes processing jobs, through Constraining the number of edge switches and aggregation switches that data transmission between different first computing nodes passes through can reduce the communication cost incurred by multiple first computing nodes performing the job as much as possible, and reduce the communication cost between multiple first computing nodes. The network transmission delay generated during data communication.

In a possible implementation, the multiple switches in the HPC cluster also include a core switch, and different aggregation switches are coupled through the core switch. In this way, when the scheduler determines multiple first computing nodes, the scheduler may specifically: Traverse the topology structure of the HPC cluster, determine the number of computing nodes connected to each aggregation switch through the edge switch, and when the number of computing nodes connected to each aggregation switch through the edge switch is less than the second quantity threshold, the scheduler will pass at least one A plurality of computing nodes implemented by the core layer switch for communication connection are determined as the first computing node. In this way, when the number of computing nodes connected to the same aggregation switch is small, the scheduler can select computing nodes under multiple aggregation switches as the first computing node, while meeting the needs of computing nodes processing jobs, through constraints The number of edge switches, aggregation switches and core switches through which data is transmitted between different first computing nodes can reduce the communication cost incurred by multiple first computing nodes performing the job as much as possible, and reduce the number of communication costs between multiple first computing nodes. The network transmission delay caused by data communication between devices.

In a possible implementation, the plurality of first computing nodes determined by the scheduler include a head computing node and at least one agent computing node. Then, when the scheduler executes a job through the plurality of first computing nodes, the scheduler may specifically: Send an execution instruction to the head computing node, the execution instruction is used to notify the head computing node to execute the job, and at least one agent computing node is notified by the head computing node to execute the job, and among the plurality of first computing nodes, in addition to the head computing node and The remaining first computing nodes other than the at least one agent computing node are notified by the agent computing node to execute the job. In this way, the scheduler can only send execution instructions to one head computing node to trigger multiple first computing nodes to execute jobs; and when the number of agent computing nodes is multiple, multiple agent computing nodes can simultaneously notify the remaining third computing nodes. One computing node executes the job, thereby improving the concurrency of notifying the first computing node to execute the job, thereby improving the efficiency of the first computing node executing the job.

In a possible implementation, the above-mentioned at least one agent computing node includes a first agent computing node and a second agent computing node, and the first agent computing node is notified by the head computing node to execute the job, and the second agent computing node is notified by the head computing node to execute the job. An agent computing node notifies the execution job, and the second agent computing node is also used to notify the remaining first computing nodes among the plurality of first computing nodes (that is, excluding the head computing node, the first agent computing node, and the second agent node). (outside the first computing node) to execute the job. In this way, the agent computing node with at least a two-layer structure can be used to further improve the efficiency of notifying the first computing node to start executing the job.

In a possible implementation, the total number of head computing nodes and at least one agent computing node is less than the number of remaining first computing nodes among the plurality of first computing nodes. In this way, a small number of computing nodes can be used to notify a larger number of computing nodes in parallel, thereby effectively improving the concurrency of notifying the first computing node to start executing the job, and thereby improving the efficiency of multiple first computing nodes starting to execute the job.

In a possible implementation, the networking method of multiple computing nodes and multiple switches includes one of the following networks: a two-layer fat tree network, a three-layer fat tree network, a two-dimensional grid network, and a three-dimensional grid. Network, two-dimensional ring network, three-dimensional ring network. In this way, in network structures in various networking modes, the above methods can be used to reduce the communication cost of notifying multiple computing nodes that execute jobs, and improve the efficiency of multiple computing nodes executing jobs, thereby improving the flexibility of solution implementation. .

In a possible implementation, the scheduler can also obtain the first connection information between the computing node and the switch in the HPC cluster and the second connection information between multiple switches, and calculate the first connection information and the second connection information between the switches in the HPC cluster. Second, connect the information to generate the topology structure of the HPC cluster, so that the subsequent scheduler can schedule the corresponding computing nodes for the jobs to be processed based on the topology structure.

Optionally, the topology structure of the HPC cluster can also be generated in other ways. For example, it can be generated by technical personnel and configured in a scheduler, etc. This embodiment is not limited to this.

In a second aspect, this application also provides a resource scheduling device, which includes various modules for executing the resource scheduling method in the first aspect or any possible implementation of the first aspect.

In a third aspect, this application also provides a scheduler, including: a processor and a memory; the memory is used to store computer instructions, and the processor is used to execute the above-mentioned first aspect or any aspect of the first aspect according to the computer instructions stored in the memory. A resource scheduling method in an implementation method. It should be noted that the memory can be integrated into the processor or independent of the processor. The computing device may also include a bus. Among them, the processor is connected to the memory through a bus. The memory may include readable memory and random access memory.

In a fourth aspect, the application also provides a computing device, the computing device includes a scheduler, and the scheduler includes a processor and a memory; the memory is used to store computer instructions, and the processor is used to store computer instructions according to the memory Execute the resource scheduling method in the above first aspect or any implementation method of the first aspect.

In a fifth aspect, the present application provides a computer-readable storage medium. The computer-readable storage medium stores instructions, which when run on a computer, cause the computer to execute any one of the above-mentioned first aspect and the first aspect. The steps of the method described in the embodiment.

In a sixth aspect, the present application provides a computer program product containing instructions that, when run on a computer, cause the computer to perform the operational steps of the method described in any one of the above-mentioned first aspects and the first aspect.

Based on the implementation methods provided in the above aspects, this application can also be further combined to provide more implementation methods.

Description of the drawings

Figure 1 is a schematic architectural diagram of an exemplary HPC cluster provided by this application;

Figure 2 is a schematic diagram of an exemplary computing cluster 101 based on a two-layer fat tree network structure provided by this application;

Figure 3 is a schematic diagram of an exemplary computing cluster 101 based on a three-layer fat tree network structure provided by this application;

Figure 4 is a schematic flow chart of a resource scheduling method provided by this application;

Figure 5a is a schematic diagram of the topology corresponding to the two-layer fat tree networking structure provided by this application;

Figure 5b is a schematic diagram of the topology corresponding to the three-layer fat tree networking structure provided by this application;

Figure 6 is a schematic structural diagram of a resource scheduling device provided by this application;

Figure 7 is a schematic diagram of the hardware structure of a computing device provided by this application.

Detailed ways

The technical solutions in this application will be described below with reference to the drawings in the embodiments of this application.

Refer to Figure 1, which is a schematic diagram of an HPC cluster provided by this application. As shown in FIG. 1 , the HPC cluster 100 includes a computing cluster 101 and a scheduler 102 . The computing cluster 101 includes multiple computing nodes, as shown in Figure 1 , and the multiple computing nodes implement communication connections through multiple switches. The scheduler 102 may be configured with a queue, which may be used to temporarily store one or more jobs submitted to the HPC cluster, such as data aggregation calculation jobs. Furthermore, for each job in the queue, the scheduler 102 can schedule a certain number of computing nodes from the computing cluster 101 so as to utilize the multiple scheduled computing nodes to execute the job.

For example, the computing nodes in the computing cluster 101 can be implemented by devices with computing capabilities, such as servers. Specifically, one or more processors can be provided in a device with computing capabilities, such as using a central processing unit (CPU) or an application-specific integrated circuit (ASIC), or a programmable logic device (programmable logic device, PLD) implementation, the above PLD can be a complex programmable logical device (complex programmable logical device, CPLD), FPGA, general array logic (generic array logic, GAL) or any combination thereof.

Scheduler 102 can be implemented in hardware or software. When implemented by hardware, the scheduler 102 may be, for example, a server or other device with data processing capabilities. When implemented in software, scheduler 102 may be, for example, an application running on a computing device.

It is worth noting that the HPC cluster 100 can be further configured with nodes with other functions. As shown in Figure 1, the HPC cluster can also include a management node 103. The management node 103 is used to manage the computing nodes in the computing cluster 101, etc. . Moreover, the architecture of the HPC cluster 100 shown in FIG. 1 is only used as an example, and the present application can also be applied to other applicable HPC cluster architectures.

In actual application, multiple computing nodes and multiple switches in the HPC cluster 100 can be constructed through various networking methods to obtain the computing cluster 101. In one example, the networking method between multiple computing nodes and multiple switches may be a two-layer fat-tree network structure, as shown in Figure 2 (Figure 2 includes 8 switches, n computing nodes are taken as an example for illustration). In a two-layer fat tree network, switches can be divided into two categories, including edge switches and aggregation switches. Each computing node is connected to the HPC cluster 100 through an edge switch, and the edge switch is coupled with the aggregation switch, so that different computing nodes connected to the same edge switch can communicate through the edge switch, as shown in Figure 2 for computing node 1. And computing nodes 2 can communicate through edge switch 1. Different edge switches can communicate through aggregation switches. In Figure 2, edge switch 1 and edge switch 2 can communicate through aggregation switch 1, or they can communicate through aggregation switch 2/aggregation switch 3, etc.

In another example, the networking mode between multiple computing nodes and multiple switches may be a three-layer fat tree network structure, as shown in Figure 3 (Figure 3 includes 12 switches and n computing nodes). Take an example to illustrate). Among them, based on the two-layer fat tree network, the switches in the three-layer fat tree network include not only edge switches and aggregation switches, but also core switches. The core switches are coupled with the aggregation switches and used to implement For data communication between different aggregation switches, as shown in Figure 3, aggregation switch 2 and aggregation switch 3 can communicate through core switch 1, or can communicate through core switch 2, etc.

Of course, in other examples, the networking method between multiple computing nodes and multiple switches may also be a two-dimensional mesh (2D mesh) network, a three-dimensional mesh (3D mesh) network, or a two-dimensional torus network (2D torus). ), 3D torus, etc., the embodiments of the present application are not limited to this.

When allocating computing resources to a job in the queue, the scheduler 102 may randomly select multiple computing nodes from the computing cluster 101 and instruct the multiple computing nodes to execute the job. During specific implementation, the scheduler 102 can send an instruction to start executing the job to one of the computing nodes, and the computing node triggers the other computing nodes to start executing the job. However, communication connections between the computing node and other computing nodes may be realized through multiple switches, which requires the computing node to communicate across switches when triggering other computing nodes to start executing the job, resulting in multiple computing nodes executing the job. The resulting communication cost is high, and the network transmission delay for cross-switch communication between computing nodes is high, resulting in low efficiency in executing jobs on multiple computing nodes.

Based on this, embodiments of the present application provide a resource scheduling method. When scheduling computing resources for a job to be processed, the scheduler 102 determines a set of computing nodes from the computing cluster 101 according to the topology of the HPC cluster 100, and calculates The node set includes at least one computing node, and the computing nodes in the computing node set are used to execute the pending job, and notify the computing nodes in the computing node set to execute the job. For convenience of description, the computing node that performs the above job may be called a first computing node, that is, the computing node set includes at least one first computing node. Wherein, the total number of switches through which data transmission between the plurality of first computing nodes passes is less than a threshold. When the scheduler 102 selects a plurality of first computing nodes to execute a job, it limits the number of switches across the communication connections between the plurality of first computing nodes, which causes the plurality of first computing nodes to execute the job. The resulting communication cost is low, that is, there is no need to communicate across a large number of switches, and the network transmission delay generated during data communication between multiple first computing nodes can also be reduced, which can improve The efficiency of the first computing node in executing the job.

Next, the process of scheduling computing resources for jobs by the scheduler 102 provided by the present application will be further introduced with reference to the accompanying drawings.

Referring to Figure 4, Figure 4 is a schematic flowchart of a resource scheduling method provided by an embodiment of the present application. This method can be applied to the HPC cluster 100 shown in Figure 1, or can also be applied to HPC clusters of other architectures. In this embodiment, the method is applied to the HPC cluster 100 and executed by the scheduler 102 in the HPC cluster 100 as an example for illustrative description. As shown in Figure 4, the method may specifically include:

S401: The scheduler 102 obtains the jobs to be processed.

In this embodiment, for a job submitted to the HPC cluster 100, the scheduler 102 may allocate computing resources for processing the job.

In a possible implementation, the HPC cluster 100 can provide an interactive interface to the outside, and the interactive interface can be presented to the user through a client. In this way, the client can generate corresponding jobs based on the user's operations on the interactive interface and submit them to the scheduler 102, such as jobs submitted through the message passing interface (MPI). When the number of jobs submitted to the scheduler 102 is small, the scheduler 102 can directly allocate computing resources to each submitted job so that the allocated computing resources are used to execute the job. When the number of jobs submitted to the scheduler 102 is large, the jobs can be temporarily stored in the queue in the scheduler 102, and after completing the scheduling of computing resources for the current job, the scheduler 102 can continue to take out a job from the queue, such as Jobs may be retrieved in the order in which they are submitted to the scheduler 102, or based on their priority, and corresponding computing resources may be scheduled for the jobs retrieved from the queue.

S402: The scheduler 102 determines a first computing node from a plurality of computing nodes included in the computing cluster 101 according to the topology of the HPC cluster, and the total number of switches through which data is transmitted between the plurality of first computing nodes is less than a threshold.

In this embodiment, when the scheduler 102 allocates computing resources to a job to be processed, it can select a computing node with a relatively concentrated network location as the first computing node to process the job as much as possible, so as to reduce the number of first computing nodes as much as possible. The communication cost incurred by the node to process the job.

During specific implementation, before scheduling computing resources for a job, the scheduler 102 can read the topology structure of the HPC cluster locally, or can receive the topology structure sent by the management node 103, etc., and the topology structure can indicate the computing nodes in the computing cluster 101. The connection relationship with the switch. Among them, the topology obtained by the scheduler 102 includes the computing node's Internet protocol (IP) address and switch identification. For example, when a two-layer fat tree network structure is used for networking, as shown in Figure 5a, the topology can adopt the structure of "compute node IP address: [edge switch name]"; when a three-layer fat tree network structure is used When networking, as shown in Figure 5b, the topology can use "computing node IP address: [edge switch name] [aggregation switch name, aggregation switch name...]", where an edge switch can communicate with one or more cores Switch interconnect.

In actual application, the scheduler 102 or the management node 103 can collect information about the computing nodes and switches in the computing cluster 101 and generate the topology structure of the HPC cluster. Taking the topology generated by the scheduler 102 as an example, the scheduler 102 can obtain first connection information and second connection information. The first connection information can indicate the connection relationship between the computing nodes and switches in the HPC cluster 100. The second connection information The connection relationship between different switches can be indicated, so that the scheduler 102 can generate the topology structure of the HPC cluster based on the first connection information and the second connection information. In actual application, the user can also manually generate the topology structure based on the interconnection relationship between computing nodes and switches, and configure it in the scheduler 102 . In this embodiment, the specific implementation manner in which the scheduler 102 obtains the topology structure is not limited.

Then, the scheduler 102 can traverse the topology structure of the HPC cluster and determine a plurality of first computing nodes from the computing cluster 101, and the total number of switches through which data is transmitted between the plurality of first computing nodes is less than a threshold. The threshold may be a fixed value, such as being preconfigured in the scheduler 102 by a manager; or the threshold may be determined in real time by the scheduler 102 based on the jobs to be processed, and the thresholds corresponding to different jobs may be different. For example, for job 1, the threshold that limits the number of switches can be the value a; for job 2, the threshold can be the number b, etc. Moreover, the number of first computing nodes allocated by the scheduler 102 to the job may be a fixed value, or the scheduler 102 may determine it according to the job. For example, the scheduler 102 allocates 32 first computing nodes to job 3 and 64 to job 4. the first computing node, etc. The scheduler 102 allocates the determined plurality of first computing nodes to the job, so that the plurality of first computing nodes are subsequently used to execute the job.

Illustratively, this embodiment provides the following scheduling strategies for determining the first computing node.

In the first scheduling strategy, the scheduler 102 can determine the number of computing nodes connected to each edge switch by traversing the topology structure of the HPC cluster. When the number of computing nodes connected to a certain edge switch reaches the first quantity threshold, the scheduler 102 may allocate multiple computing nodes connected to the edge switch to the job as the first computing node to process the job. At this time, the total number of switches through which data transmission between multiple first computing nodes passes is 1, that is, different first computing nodes can communicate with each other only through this edge switch.

In further embodiments, the load of some computing nodes may be relatively large, or different jobs may have different requirements for computing nodes. For example, some jobs may have higher latency requirements for computing node processing, while other jobs may have higher latency requirements for computing node processing. The job has lower latency requirements, etc. Based on this, when traversing the topology, the scheduler 102 can specifically determine the load of the computing node, job requirements and other information, and determine the available computing nodes that can process the job (such as those with smaller loads). Computing nodes, computing nodes with lower processing job latency, etc.), so that the scheduler 102 can allocate multiple available computing nodes connected to the same edge switch as the first computing node to the job.

In the second scheduling strategy, the scheduler 102 can traverse the topology structure of the HPC cluster and determine the number of computing nodes connected to each edge switch. When the number of computing nodes connected to each edge switch is less than the first quantity threshold, the scheduler 102 may allocate computing nodes under multiple edge switches to the job. At this time, in order to reduce the communication cost between the computing nodes assigned to the job, the scheduler 102 can limit the number of switches (such as aggregation switches) through which data communication between different edge switches passes. The computing node under multiple edge switches with the smallest number of switches is assigned to the job as the first computing node. For example, assume that the first quantity threshold is 80, and the scheduler 102 determines by traversing the topology structure that the number of computing nodes of the access edge switch 1 is 64, the number of computing nodes of the access edge switch 2 is 32, and the number of computing nodes of the access edge switch 2 is 32. The number of computing nodes entering edge switch 3 is 32, among which the edge Data transmission between switch 1 and edge switch 2 passes through one aggregation switch, and data transmission between edge switch 1 and edge switch 3 passes through two aggregation switches. Then, when the scheduler 102 allocates 80 computing nodes to a job, the scheduler 102 can allocate 64 computing nodes of the access edge switch 1 and 16 of the nodes of the access edge switch 2 as the first computing nodes to the job. In this way, subsequent data transmission between the 80 first computing nodes can only pass through edge switch 1, or only through edge switch 2, or through edge switch 1, an aggregation switch and edge switch 2, so that the distribution to The job has the lowest communication cost among the 80 first computing nodes.

Furthermore, when the number of aggregation switches through which data is transmitted between multiple edge switches is the same, the scheduler 102 can also combine the physical locations of the edge switches to give priority to the multiple edge switches whose physical locations are relatively close. The computing node is assigned to the job to be processed as the first computing node to further reduce the communication cost between the first computing nodes. Alternatively, the scheduler 102 may combine the load of the edge switches and preferentially select the computing nodes under multiple edge switches with smaller loads as the first computing node to be assigned to the to-be-processed job to balance the load of the edge switches in the network, etc. This embodiment does not limit this.

In the third scheduling strategy, the switches in the computing cluster 101 also include core switches. For example, the computing cluster 101 can be networked through a three-layer fat tree network structure. Then, the scheduler 102 can traverse the topology structure of the HPC cluster, determine the computing nodes connected to each aggregation switch through the edge switch, and when the number of computing nodes connected to each aggregation switch is less than the second quantity threshold, the scheduler 102 can Multiple computing nodes that are communicated through at least one core layer switch are determined as the first computing node assigned to the job to be processed, thereby limiting the number of switches that pass through during data transmission between different aggregation switches, thereby reducing the number of The communication cost between the first computing nodes.

In actual application, the scheduler 102 can execute the above three scheduling strategies at the same time. For example, the first scheduling strategy can be executed first, and when the number of computing nodes that do not exist in a certain edge switch reaches the first quantity threshold, the scheduler 102 can execute The second scheduling strategy, and when the number of computing nodes connected to each aggregation switch reaches the second quantity threshold, the scheduler 102 can further execute the third scheduling strategy, so as to reduce the number of different computing nodes processing the job as much as possible. communication costs. Alternatively, the scheduler 102 may also execute only part of the above three scheduling strategies, or execute other applicable scheduling strategies, which is not limited in this embodiment.

S403: The scheduler 102 notifies multiple first computing nodes to execute the job.

After determining the plurality of first computing nodes allocated to the job to be processed, the scheduler 102 may notify the plurality of first computing nodes to execute the job.

For example, the scheduler 102 may notify each first computing node one by one to trigger each first computing node to start executing the job. Alternatively, the scheduler 102 may notify a head computing node (head node) among the plurality of first computing nodes to start executing the job. The head computing node is one of the first computing nodes among the plurality of first computing nodes and is configured by The head computing node notifies the remaining first computing nodes to start executing the job. During specific implementation, the scheduler 102 can send an execution instruction for the job, such as an mpirun instruction, to the head computing node. The execution instruction is used to notify the head computing node to execute the job; the head computing node parses the execution instruction and obtains the The IP addresses of the remaining first computing nodes are carried in the execution instruction, so that the head computing node can notify each first computing node one by one to start executing the job based on the IP addresses of the remaining first computing nodes. The scheduler 102 may randomly select a first computing node from a plurality of first computing nodes as a head computing node, and instruct the head computing node to notify the remaining first computing nodes to start executing the job. Alternatively, the scheduler 102 may select the one with the smallest network distance or the shortest communication delay between the multiple first computing nodes and the scheduler 102 based on the network distance or communication delay between each first computing node and the scheduler 102 . The first computing node is determined as the head computing node, and the head computing node is instructed to notify other first computing nodes to start executing the job. In actual application, the scheduler 102 may also determine the head computing node from multiple first computing nodes according to other rules, which is not limited in this embodiment.

In actual application, the scheduler 102 allocates a larger number of first computing nodes to the jobs to be processed. For example, the scheduler 102 Allocate 1,000 first computing nodes to the job. At this time, the head computing node notifies the remaining first computing nodes one by one to start executing the job, which may take a long time. Therefore, in other possible implementations, the head computing node may use a tree-structured notification method to improve the concurrency of notifying the first computing node to start executing the job, thereby reducing notification time and improving notification efficiency. Taking the head computing node to use two-layer tree nodes to start the remaining first computing nodes as an example, after receiving the execution instruction, the head computing node can determine at least one agent computing node among the remaining first computing nodes. For example, the execution instruction can It carries the IP address of the agent computing node, etc. Then, the head computing node can notify each agent computing node to start the execution job through the corresponding switch according to the network topology of the HPC cluster and the IP address of the agent computing node; then, the head computing node and the agent computing node can notify different first computing nodes respectively. The node starts the execution job, thereby improving the notification efficiency and accelerating the execution of the job through the head computing node and the agent computing node notifying the remaining first computing nodes in parallel. In actual application, the total number of head computing nodes and agent computing nodes may be less than the number of remaining first computing nodes among the plurality of first computing nodes.

For example, in the two-layer fat tree network shown in Figure 2, assuming that the first computing node allocated to the job to be processed includes computing node 1 to computing node a, computing node a+1 to computing node b, then the scheduler 102 may first send an execution instruction to the computing node 1 to notify the computing node 1 to start the execution job. The computing node 1 is the above-mentioned head computing node. Among them, the execution instruction can specify computing node a+1 as an agent computing node and notify it to start executing the job. Then, computing node 1 can trigger computing node a+1 to serve as an agent computing node and notify the remaining computing nodes processing the job to start execution. Operation. In this way, computing node 1 can notify computing nodes 2 to computing node a one by one, and the agent computing node (ie, computing node a+1) can notify computing nodes a+2 to computing node b one by one, thereby improving notification efficiency.

Among them, the agent computing node can be a computing node under different access edge switches. In this way, the computing nodes under each edge switch serve as the agent computing nodes and notify the remaining first computing nodes under the edge switch, which can further reduce the number of notifications for multiple The communication cost incurred by the first computing node to start executing the job. For example, for computing nodes a+2 to computing node b, compared to the way in which computing node 1 notifies these computing nodes to start executing jobs through edge switch 1, aggregation switch 1 and edge switch 2, the agent computing node uses the edge to notify these computing nodes to start executing jobs. The switch 2 notifies it to execute the job, which can effectively reduce the communication cost incurred when notifying the computing node a+2 to the computing node b to start the execution job, and can also further improve the notification efficiency.

It is worth noting that the above implementation example uses a two-layer tree structure to notify multiple first computing nodes to start execution jobs. When the number of first computing nodes is large (for example, the number of first computing nodes reaches 1,000 or more) , the scheduler 102 may also use a tree structure of more than three levels (including three levels) to notify multiple first computing nodes to start execution jobs. Taking the use of a three-layer tree structure to notify multiple first computing nodes as an example, the scheduler 102 may send an execution instruction to the head computing node to notify the head computing node to start the execution job. Then, the head computing node parses the IP addresses of the first agent computing node and the second agent computing node from the execution instruction, notifies the first agent computing node to start the execution job according to the IP address of each first agent computing node, and instructs the first agent computing node to start the execution job. Each first agent computing node notifies the second agent computing node to start executing the job. Finally, after receiving the notification from the first agent computing node, each second agent computing node can notify the remaining first computing nodes (that is, the first computing nodes except the head computing node and the agent computing node) in parallel to start executing the job. . For example, different first agent computing nodes may access different edge switches, the second agent computing node may be a first computing node that accesses the same edge switch as the first agent computing node, and the second agent computing node The number may be greater than the number of first agent computing nodes. In this way, multiple second agent computing nodes may be used to notify the remaining first computing nodes under the edge switch to start execution jobs through the edge switches they are connected to.

It is worth noting that when the scheduler 102 needs to allocate a large number of first computing nodes for a job to be processed, the scheduler 102 can determine multiple first computing nodes based on the above method to reduce the number of first computing nodes as much as possible. The communication cost between nodes. For some jobs in actual applications, the scheduler 102 assigns the first computing node to the job to be processed. If the number is small (such as allocating 5 first computing nodes, etc.), at this time, the scheduler 102 can also randomly select multiple first computing nodes to process the job from the computing cluster 101. Alternatively, the scheduler 102 may be configured with two resource scheduling modes, in which the job may carry an identifier of the resource scheduling mode when it is submitted to the HPC cluster 100 . Moreover, when the identifier of the resource scheduling mode to be processed indicates resource scheduling mode 1, the scheduler 102 can use the above process to allocate multiple first computing nodes to the job to be processed; and when the identifier of the resource scheduling mode to be processed indicates resource scheduling mode 2, , the scheduler 102 may randomly select multiple computing nodes from the computing cluster 101 as the first computing node to process the job.

In actual application scenarios, the scheduler 102 can schedule different resources for multiple different jobs. For example, for job 1 submitted to the HPC cluster 100, the scheduler 102 may determine a plurality of first computing nodes from the computing cluster 101 for processing the job 1. For job 2 submitted to the HPC cluster 100, the scheduler 102 can determine multiple second computing nodes for processing the job 2 from the computing cluster 101, where the scheduler 102 determines multiple second computing nodes for the job 2. The method is similar to the above-mentioned implementation method of determining multiple first computing nodes for job 1. For details, please refer to the foregoing process description. Further, when the job is executed in the HPC cluster 100, the HPC cluster 100 can return an execution result to the client. The execution result can indicate successful execution of the job, execution error, etc., and can also include other information related to the job. , such as intermediate data generated by the HPC cluster 100 during job execution, etc.

In this embodiment, when the scheduler 102 selects a plurality of first computing nodes to execute a job, it limits the number of switches across the communication connections between the plurality of first computing nodes, which makes the plurality of first computing nodes The communication cost incurred by a computing node when executing a job is low, that is, it does not need to communicate across a large number of switches, and the network transmission delay generated when data communication is performed between multiple first computing nodes can also be reduced. , thereby improving the efficiency of the first computing node in executing the job.

As a possible embodiment, there can be one or more jobs to be processed as shown in Figure 4 , that is, the scheduler 102 can process the resource allocation and scheduling process of a single job one by one, or can process the resource allocation of multiple jobs in batches. and scheduling process. During specific implementation, the scheduler can execute each job one by one according to the time sequence of obtaining services. Optionally, the scheduler can also execute each job one by one according to the priority of the job-related business or the satisfaction of the resources required by the job, where the satisfaction of the resources required by the job is used to indicate the relationship between idle resources and resources required by the job in the system. Matching degree. When the idle resources in the system are greater than or equal to the resources required by the job, it can be determined that the scheduler can execute the scheduling process of the job; when the idle resources in the system are less than the resources required by the job, it can be determined that the scheduler cannot execute the scheduling process of the job. , at this time, the idle resources of the system cannot meet the resource requirements of the job scheduling process. Optionally, the scheduler can also divide the multiple jobs to be processed into at least one group and execute the batch processing process of each group. For example, the scheduler can separately execute the scheduling process of each job in each group according to the method shown in Figure 4.

It is worth noting that those skilled in the art can think of other reasonable step combinations based on the above description, which also fall within the protection scope of this application. Secondly, those skilled in the art should also be familiar with the fact that the embodiments described in the specification are preferred embodiments, and the actions involved are not necessarily necessary for this application.

The resource allocation method provided by the present application is described in detail above with reference to FIGS. 1 to 5 . Next, the resource allocation device and computing device provided by the present application will be described with reference to FIGS. 6 to 7 .

Figure 6 is a schematic structural diagram of a resource allocation device provided by this application. The resource scheduling device 600 is used as a scheduler in an HPC cluster. The HPC cluster also includes multiple computing nodes and multiple switches. The multiple computing nodes implement communication connections through the multiple switches, as shown in Figure 6. The resource allocation device 600 may include:

Acquisition module 601, used to obtain jobs to be processed;

Determining module 602, configured to determine a plurality of first computing nodes from the plurality of computing nodes according to the topology structure of the HPC cluster, and the total number of switches through which data is transmitted between the plurality of first computing nodes. less than the threshold;

Notification module 603 is used to notify the plurality of first computing nodes to execute the job.

In a possible implementation, the multiple switches include an aggregation switch and an edge switch, the multiple computing nodes access the HPC cluster through the edge switch, and the edge switch is coupled with the aggregation switch.

In a possible implementation, the determination module 602 is configured to traverse the topology structure of the HPC cluster and determine multiple computing nodes connected to the same edge switch as the multiple first computing nodes.

In a possible implementation, the determining module 602 is used to:

Traverse the topology structure of the HPC cluster and determine the number of computing nodes connected to each edge switch;

When the number of computing nodes connected to each edge switch is less than the first quantity threshold, multiple computing nodes connected to multiple edge switches are determined as the multiple first computing nodes, and the multiple first computing nodes Communication connections are implemented through at least one aggregation switch.

In a possible implementation, the multiple switches further include a core switch, the aggregation switch is coupled with the core switch, and the determining module 602 is used to:

Traverse the topology structure of the HPC cluster and determine the number of computing nodes connected to each aggregation switch through edge switches;

When the number of computing nodes connected to each aggregation switch through the edge switch is less than the second quantity threshold, a plurality of computing nodes that are communicated through at least one core layer switch are determined to be the plurality of first computing nodes.

In a possible implementation, the plurality of first computing nodes include a head computing node and at least one agent computing node, and the notification module 603 is configured to send an execution instruction to the head computing node. The execution instruction Used to notify the head computing node to execute the job, and the at least one proxy computing node is notified by the head computing node to execute the job. Among the plurality of first computing nodes, except the head computing node and the The remaining first computing nodes other than the at least one agent computing node are notified by the at least one agent computing node to execute the job.

In a possible implementation, the at least one agent computing node includes a first agent computing node and a second agent computing node, and the first agent computing node is notified by the head computing node to execute the job, and the The second agent computing node is notified by the first agent computing node to execute the job, and the second agent computing node is further configured to notify the remaining first computing nodes among the plurality of first computing nodes to execute the job.

In a possible implementation, the total number of the head computing node and the at least one agent computing node is less than the number of remaining first computing nodes among the plurality of first computing nodes.

In a possible implementation, the networking mode of the multiple computing nodes and the multiple switches includes one of the following networks:

Two-layer fat tree network, three-layer fat tree network, two-dimensional grid network, three-dimensional grid network, two-dimensional ring network, three-dimensional ring network.

In a possible implementation, the acquisition module is further configured to acquire first connection information between computing nodes and switches in the HPC cluster and second connection information between the multiple switches;

The resource scheduling device 600 also includes:

Generating module 604, configured to generate the topology structure of the HPC cluster according to the first connection information and the second connection information.

It should be understood that the resource scheduling device 600 in this embodiment of the present invention can be implemented by a CPU or an application-specific integrated circuit (ASIC), or a programmable logic device (PLD). The above-mentioned PLD It can be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL) or any combination thereof. When the resource scheduling methods shown in FIGS. 1 to 5 are implemented through software, the resource scheduling device 600 and its respective modules may also be software modules.

The resource scheduling device 600 according to the embodiment of the present application may correspond to executing the method described in the embodiment of the present application, and The above and other operations and/or functions of each module of the resource scheduling device 600 are in order to implement the corresponding processes of each method in Figure 4, and for the sake of simplicity, they will not be described again here.

In this embodiment, when the scheduler selects multiple first computing nodes to execute the job, it limits the number of switches across when the communication connections are realized between the multiple first computing nodes. This makes the multiple first computing nodes The communication cost caused by the execution of the job by the computing node is low, that is, it does not need to communicate across a large number of switches, and the network transmission delay generated when data communication is performed between multiple first computing nodes can also be reduced. In this way, the efficiency of the first computing node in executing the job can be improved.

Figure 7 is a schematic diagram of a computing device 800 provided by this application. As shown in the figure, the computing device 800 includes a scheduler 700, which includes a processor 701, a memory 702, a communication interface 703, and a bus 704. Among them, the processor 701, the memory 702, and the communication interface 703 communicate through the bus 704. Communication can also be achieved through other means such as wireless transmission. In addition to the data bus, the bus 704 may also include a power bus, a control bus, a status signal bus, etc. However, for the sake of clarity, the various buses are labeled bus 704 in the figure.

The processor 701 can be a CPU. The processor 701 can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete Gate or transistor logic devices, discrete device components, etc.

Memory 702 may include read-only memory and random access memory and provides computer instructions and data to processor 701 . Also, memory 702 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. Among them, non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase electrically programmable read-only memory (EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of illustration, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchlink DRAM, SLDRAM) and direct Memory bus random access memory (direct rambus RAM, DR RAM).

The communication interface 703 is used to communicate with other devices or devices connected to the scheduler 700 .

Further, the computing device 800 may also include a communication interface 705 and a bus 706, wherein the scheduler 700 and the communication interface 705 communicate through the bus 706, or may communicate through other means such as wireless transmission. Similar to the bus 704, the bus 706 may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. However, for the sake of clarity, the various buses are labeled bus 706 in the figure.

The communication interface 705 is used to communicate with other devices connected to the computing device 800, such as communicating with a computing node connected to the computing device 800.

It should be understood that the computing device 800 according to the embodiment of the present application may correspond to the resource scheduling device 600 in the embodiment of the present application, and may correspond to the computing device 800 executing the method shown in FIG. 4 according to the embodiment of the present application, and calculate The above and other operations and/or functions implemented by the device 800 are respectively intended to implement the corresponding processes of each method in Figure 4. For the sake of simplicity, they will not be described again here.

As a possible implementation method, this application also provides a scheduler. The structure of the scheduler is the scheduler 700 shown in Figure 7, which is used to implement the corresponding process in the method shown in Figure 4. For the sake of simplicity, it is not mentioned here. Again.

As another possible implementation, this application also provides a chip, which is composed of electronic devices and can be To implement the operating steps of the method described in Figure 4 above.

As another possible implementation manner, this application also provides a chip, which may further include a processor, and the processor is used to implement the operation steps of the method described in Figure 4 above.

The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented using software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that contains one or more sets of available media. The usable media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media. The semiconductor medium may be a solid state drive (SSD)

The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person familiar with the technical field can easily think of various equivalent methods within the technical scope disclosed in the present application. Modification or replacement, these modifications or replacements shall be covered by the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims

A resource scheduling method, characterized in that the method is applied to a high-performance computing HPC cluster. The HPC cluster includes a scheduler, multiple computing nodes and multiple switches. The multiple computing nodes pass through the multiple switches. To realize communication connection, the method is executed by the scheduler, including:

Get pending jobs;

According to the topological structure of the HPC cluster, a plurality of first computing nodes are determined from the plurality of computing nodes, and the total number of switches through which data is transmitted between the plurality of first computing nodes is less than a threshold;

Notify the plurality of first computing nodes to execute the job.
The method according to claim 1, characterized in that the plurality of switches include an aggregation switch and an edge switch, the plurality of computing nodes access the HPC cluster through the edge switch, and the edge switch is connected to the aggregation switch. Switch coupling.
The method of claim 2, wherein determining a plurality of first computing nodes from the plurality of computing nodes according to the topological structure of the HPC cluster includes:

Traverse the topology structure of the HPC cluster and determine multiple computing nodes connected to the same edge switch as the multiple first computing nodes.
The method of claim 2, wherein determining a plurality of first computing nodes from the plurality of computing nodes according to the topological structure of the HPC cluster includes:

Traverse the topology structure of the HPC cluster and determine the number of computing nodes connected to each edge switch;

When the number of computing nodes connected to each edge switch is less than the first quantity threshold, multiple computing nodes connected to multiple edge switches are determined as the multiple first computing nodes, and the multiple first computing nodes Communication connections are implemented through at least one aggregation switch.
The method according to claim 2, characterized in that the plurality of switches further comprise a core switch, the aggregation switch is coupled to the core switch, and according to the topology structure of the HPC cluster, the plurality of switches are selected from the plurality of switches. A plurality of first computing nodes are determined among the computing nodes, including:

Traverse the topology structure of the HPC cluster and determine the number of computing nodes connected to each aggregation switch through edge switches;

When the number of computing nodes connected to each aggregation switch through the edge switch is less than the second quantity threshold, a plurality of computing nodes that are communicated through at least one core layer switch are determined to be the plurality of first computing nodes.
The method according to any one of claims 1 to 5, characterized in that the plurality of first computing nodes include a head computing node and at least one agent computing node, and the notification of the plurality of first computing nodes to execute the The above tasks include:

Send an execution instruction to the head computing node, the execution instruction is used to notify the head computing node to execute the job, the at least one agent computing node is notified by the head computing node to execute the job, the plurality of The remaining first computing nodes among the first computing nodes except the head computing node and the at least one agent computing node are notified by the at least one agent computing node to execute the job.
The method of claim 6, wherein the at least one agent computing node includes a first agent computing node and a second agent computing node, and the first agent computing node is notified by the head computing node to execute the job, the second agent computing node is notified by the first agent computing node to execute the job, the second agent computing node The node is further configured to notify the remaining first computing nodes among the plurality of first computing nodes to execute the job.
The method according to claim 6 or 7, characterized in that the total number of the head computing node and the at least one agent computing node is less than the number of remaining first computing nodes among the plurality of first computing nodes. .
The method according to any one of claims 1 to 8, characterized in that the networking mode of the plurality of computing nodes and the plurality of switches includes one of the following networks:

Two-layer fat tree network, three-layer fat tree network, two-dimensional grid network, three-dimensional grid network, two-dimensional ring network, three-dimensional ring network.
The method according to any one of claims 1 to 9, characterized in that the method further includes:

Obtain first connection information between computing nodes and switches in the HPC cluster and second connection information between the multiple switches;

According to the first connection information and the second connection information, a topology structure of the HPC cluster is generated.
A resource scheduling device, characterized in that the resource scheduling device is applied to a scheduler in a high-performance computing HPC cluster. The HPC cluster also includes a plurality of computing nodes and a plurality of switches. The plurality of computing nodes pass through The plurality of switches implement communication connections, and the resource scheduling device includes:

Get module, used to get pending jobs;

Determining module, configured to determine a plurality of first computing nodes from the plurality of computing nodes according to the topology structure of the HPC cluster, and the total number of switches through which data is transmitted between the plurality of first computing nodes is less than threshold;

A notification module, configured to notify the plurality of first computing nodes to execute the job.
A scheduler, the scheduler includes a processor and a memory, the memory is used to store computer instructions; the processor is used to execute the method according to any one of claims 1 to 10 according to the computer instructions operating steps.
A computing device, characterized in that it includes a scheduler, the scheduler including a processor and a memory; the memory is used to store computer instructions; the processor is used to execute the claim 1 according to the computer instructions The steps of any one of the methods described in to 10.
A computer-readable storage medium, characterized by comprising instructions that, when run on a computing device, cause the computing device to perform the operational steps of the method according to any one of claims 1 to 10.