WO2023169408A1 - Resource scheduling method, apparatus, and related device - Google Patents

Resource scheduling method, apparatus, and related device Download PDF

Info

Publication number
WO2023169408A1
WO2023169408A1 PCT/CN2023/080047 CN2023080047W WO2023169408A1 WO 2023169408 A1 WO2023169408 A1 WO 2023169408A1 CN 2023080047 W CN2023080047 W CN 2023080047W WO 2023169408 A1 WO2023169408 A1 WO 2023169408A1
Authority
WO
WIPO (PCT)
Prior art keywords
computing nodes
computing
job
switches
scheduler
Prior art date
Application number
PCT/CN2023/080047
Other languages
French (fr)
Chinese (zh)
Inventor
申鹏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023169408A1 publication Critical patent/WO2023169408A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals

Abstract

A resource scheduling method, the method being applied in an HPC cluster comprising a scheduler, a plurality of computing nodes, and a plurality of switches, wherein the plurality of computing nodes implement a communication connection by means of the plurality of switches; when performing resource scheduling, the scheduler obtains a job to be processed; the scheduler determines a plurality of first computing nodes from among the plurality of computing nodes according to a topological structure of the HPC cluster, wherein the total number of switches traversed in data transmission between the plurality of determined first computing nodes is less than a threshold value; and consequently the scheduler notifies the plurality of first computing nodes to execute said job. The scheduler limits the number of switches traversed when implementing a communication connection between the plurality of selected first computing nodes, which causes lower communication costs produced when the plurality of first computing nodes execute the job, and network transmission delay produced when data communication is carried out between the plurality of first computing nodes can also be lowered.

Description

资源调度方法、装置及相关设备Resource scheduling methods, devices and related equipment
本申请要求于2022年3月8日提交中国专利局、申请号为202210227649.1、发明名称为“资源调度方法、装置及相关设备”的中国专利申请的优先权,所述专利申请的全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on March 8, 2022, with the application number 202210227649.1 and the invention name "Resource Scheduling Method, Device and Related Equipment". The entire content of the patent application is incorporated by reference. incorporated in this application.
技术领域Technical field
本申请涉及资源管理技术领域,尤其涉及一种资源调度方法、装置及相关设备。The present application relates to the field of resource management technology, and in particular, to a resource scheduling method, device and related equipment.
背景技术Background technique
高性能计算(high performance computing,HPC)集群,是指通过各种互联技术将多个计算机系统进行互连、并用于实现高速运算的计算机集群,在气象预报、计算模拟、图像处理等领域广泛应用。High performance computing (HPC) cluster refers to a computer cluster that interconnects multiple computer systems through various interconnection technologies and is used to achieve high-speed computing. It is widely used in weather forecasting, computing simulation, image processing and other fields. .
通常情况下,HPC集群中配置有调度器,并且对于提交至HPC集群的作业,该调度器可以从HPC集群中随机选取多个计算节点,并将其分配给该作业,从而该随机选取的多个计算节点可以通过相互之间的数据通信协同执行该作业。其中,不同计算节点之间通过一个或者多个交换机实现通信连接。Typically, an HPC cluster is configured with a scheduler, and for a job submitted to the HPC cluster, the scheduler can randomly select multiple computing nodes from the HPC cluster and assign them to the job, so that the randomly selected multiple computing nodes are Computing nodes can collaboratively execute the job through data communication between each other. Among them, communication connections are realized between different computing nodes through one or more switches.
但是,这种资源调度方式容易导致HPC集群在执行作业时,多个计算节点之间的通信成本较高、执行该作业的效率较低。因此,如何实现降低多个计算节点执行作业时所存在的通信成本、提高执行作业的效率成为亟需解决的重要问题。However, this resource scheduling method can easily lead to high communication costs between multiple computing nodes and low efficiency in executing the job when the HPC cluster executes the job. Therefore, how to reduce the communication cost when multiple computing nodes execute jobs and improve the efficiency of job execution has become an important issue that needs to be solved.
发明内容Contents of the invention
本申请提供了一种资源调度方法、装置、调度器、计算设备、计算机可读存储介质及计算机程序产品,用以实现降低多个计算节点执行作业时所存在的通信成本、提高执行作业的效率。This application provides a resource scheduling method, device, scheduler, computing device, computer-readable storage medium and computer program product to reduce the communication cost when multiple computing nodes execute jobs and improve the efficiency of executing jobs. .
第一方面,本申请提供一种资源调度方法,该方法应用于HPC集群,该HPC集群包括调度器、多个计算节点和多个交换机,并且多个计算节点通过该多个交换机实现通信连接;调度器在进行资源调度时,具体是先获取待处理的一个或者多个作业,并且,针对每个作业,调度器根据HPC集群的拓扑结构,从多个计算节点中确定多个第一计算节点,其中,所确定的多个第一计算节点之间数据传输所经过的交换机的总数量小于阈值,从而调度器通知该多个第一计算节点执行该作业。In a first aspect, this application provides a resource scheduling method, which method is applied to an HPC cluster. The HPC cluster includes a scheduler, multiple computing nodes, and multiple switches, and the multiple computing nodes implement communication connections through the multiple switches; When the scheduler performs resource scheduling, it first obtains one or more jobs to be processed, and for each job, the scheduler determines multiple first computing nodes from multiple computing nodes according to the topology of the HPC cluster. , wherein the determined total number of switches through which data is transmitted between the plurality of first computing nodes is less than the threshold, so the scheduler notifies the plurality of first computing nodes to execute the job.
由于调度器在选取执行各个作业的多个第一计算节点时,限制了该多个第一计算节点之间实现通信连接时所跨越的交换机的数量,这使得该多个第一计算节点执行该作业所产生的通信成本较低,即可以不用跨越较多数量的交换机进行通信,而且,多个第一计算节点之间进行数据通信时所产生的网络传输时延也可以得到降低,以此可以提高第一计算节点执行作业的效率。When the scheduler selects multiple first computing nodes to execute each job, it limits the number of switches across when communication connections are realized between the multiple first computing nodes, which causes the multiple first computing nodes to execute the The communication cost generated by the job is low, that is, there is no need to communicate across a large number of switches, and the network transmission delay generated during data communication between multiple first computing nodes can also be reduced, so that Improve the efficiency of the first computing node in executing jobs.
其中,待处理的作业可以是一个或者多个,并且,当待处理的作业为多个时,针对每个作业,调度器均可以按照第一方面所介绍的资源调度方法为该作业调度相应的资源。There may be one or more jobs to be processed, and when there are multiple jobs to be processed, for each job, the scheduler may schedule corresponding resources for the job according to the resource scheduling method introduced in the first aspect. resource.
在一种可能的实施方式中,HPC集群中的多个计算节点与多个交换机之间的组网方式可 以是两层胖树网络结构,具体地,HPC集群中的多个交换机包括汇聚交换机和边缘交换机,并且,HPC集群中的多个计算节点通过边缘交换机接入HPC集群,该边缘交换机与汇聚交换机耦合。In a possible implementation, the networking mode between multiple computing nodes and multiple switches in the HPC cluster can be So there is a two-layer fat tree network structure. Specifically, the multiple switches in the HPC cluster include aggregation switches and edge switches, and multiple computing nodes in the HPC cluster access the HPC cluster through edge switches. The edge switches are connected to the aggregation switches. coupling.
在一种可能的实施方式中,调度器在从多个计算节点中确定多个第一计算节点时,具体可以是遍历HPC集群的拓扑结构,将接入同一边缘交换机的多个计算节点确定为多个第一计算节点。如此,所确定的多个第一计算节点均通过同一边缘交换机进行数据通信,这不仅可以使得多个第一计算节点执行该作业所产生的通信成本较低,而且,多个第一计算节点之间进行数据通信时所产生的网络传输时延也较低,以此可以提高第一计算节点执行作业的效率。In a possible implementation, when the scheduler determines multiple first computing nodes from multiple computing nodes, it may specifically traverse the topology structure of the HPC cluster and determine multiple computing nodes connected to the same edge switch as A plurality of first computing nodes. In this way, the determined multiple first computing nodes all perform data communication through the same edge switch, which not only makes the communication cost incurred by the multiple first computing nodes performing the job lower, but also reduces the communication cost among the multiple first computing nodes. The network transmission delay generated during data communication is also low, which can improve the efficiency of the first computing node in executing the job.
在一种可能的实施方式中,调度器在从多个计算节点中确定多个第一计算节点时,具体可以是遍历HPC集群的拓扑结构,确定接入各个边缘交换机的计算节点的数量,并且,当接入各个边缘交换机的计算节点的数量均小于第一数量阈值时,调度器将接入多个边缘交换机的多个计算节点确定为多个第一计算节点,该多个第一计算节点通过至少一个汇聚交换机实现通信连接。如此,在接入同一边缘交换机的计算节点的数量较小的情况下,调度器可以选择多个边缘交换机下的计算节点作为第一计算节点,在满足处理作业的计算节点的需求的同时,通过约束不同第一计算节点之间数据传输所经过的边缘交换机以及汇聚交换机的数量,可以尽可能降低多个第一计算节点执行该作业所产生的通信成本、降低多个第一计算节点之间进行数据通信时所产生的网络传输时延。In a possible implementation, when the scheduler determines multiple first computing nodes from multiple computing nodes, it may specifically traverse the topology structure of the HPC cluster, determine the number of computing nodes connected to each edge switch, and , when the number of computing nodes connected to each edge switch is less than the first quantity threshold, the scheduler determines multiple computing nodes connected to multiple edge switches as multiple first computing nodes, and the multiple first computing nodes Communication connections are implemented through at least one aggregation switch. In this way, when the number of computing nodes connected to the same edge switch is small, the scheduler can select computing nodes under multiple edge switches as the first computing node, while meeting the needs of computing nodes processing jobs, through Constraining the number of edge switches and aggregation switches that data transmission between different first computing nodes passes through can reduce the communication cost incurred by multiple first computing nodes performing the job as much as possible, and reduce the communication cost between multiple first computing nodes. The network transmission delay generated during data communication.
在一种可能的实施方式中,HPC集群中的多个交换机还包括核心交换机,并且,不同汇聚交换机通过该核心交换机进行耦合,这样,调度器在确定多个第一计算节点时,具体可以是遍历HPC集群的拓扑结构,确定通过边缘交换机连接各个汇聚交换机的计算节点的数量,并且,当通过边缘交换机连接各个汇聚交换机的计算节点的数量均小于第二数量阈值时,调度器将通过至少一个核心层交换机实现通信连接的多个计算节点确定为第一计算节点。如此,在连接同一汇聚交换机的计算节点的数量较小的情况下,调度器可以选择多个汇聚交换机下的计算节点作为第一计算节点,在满足处理作业的计算节点的需求的同时,通过约束不同第一计算节点之间数据传输所经过的边缘交换机、汇聚交换机以及核心交换机的数量,可以尽可能降低多个第一计算节点执行该作业所产生的通信成本、降低多个第一计算节点之间进行数据通信时所产生的网络传输时延。In a possible implementation, the multiple switches in the HPC cluster also include a core switch, and different aggregation switches are coupled through the core switch. In this way, when the scheduler determines multiple first computing nodes, the scheduler may specifically: Traverse the topology structure of the HPC cluster, determine the number of computing nodes connected to each aggregation switch through the edge switch, and when the number of computing nodes connected to each aggregation switch through the edge switch is less than the second quantity threshold, the scheduler will pass at least one A plurality of computing nodes implemented by the core layer switch for communication connection are determined as the first computing node. In this way, when the number of computing nodes connected to the same aggregation switch is small, the scheduler can select computing nodes under multiple aggregation switches as the first computing node, while meeting the needs of computing nodes processing jobs, through constraints The number of edge switches, aggregation switches and core switches through which data is transmitted between different first computing nodes can reduce the communication cost incurred by multiple first computing nodes performing the job as much as possible, and reduce the number of communication costs between multiple first computing nodes. The network transmission delay caused by data communication between devices.
在一种可能的实施方式中,调度器所确定的多个第一计算节点包括头计算节点以及至少一个代理计算节点,则,调度器在通过多个第一计算节点执行作业时,具体可以是向头计算节点发送执行指令,该执行指令用于通知头计算节点执行作业,而至少一个代理计算节点是由该头计算节点通知执行作业,并且,多个第一计算节点中除了头计算节点以及该至少一个代理计算节点之外剩余的第一计算节点,由代理计算节点通知执行作业。如此,调度器可以仅向一个头计算节点发送执行指令即可触发多个第一计算节点执行作业;并且,当代理计算节点的数量为多个时,可以由多个代理计算节点同时通知其余第一计算节点执行作业,从而可以提高通知第一计算节点执行作业的并发性,进而可以提高第一计算节点执行作业的效率。In a possible implementation, the plurality of first computing nodes determined by the scheduler include a head computing node and at least one agent computing node. Then, when the scheduler executes a job through the plurality of first computing nodes, the scheduler may specifically: Send an execution instruction to the head computing node, the execution instruction is used to notify the head computing node to execute the job, and at least one agent computing node is notified by the head computing node to execute the job, and among the plurality of first computing nodes, in addition to the head computing node and The remaining first computing nodes other than the at least one agent computing node are notified by the agent computing node to execute the job. In this way, the scheduler can only send execution instructions to one head computing node to trigger multiple first computing nodes to execute jobs; and when the number of agent computing nodes is multiple, multiple agent computing nodes can simultaneously notify the remaining third computing nodes. One computing node executes the job, thereby improving the concurrency of notifying the first computing node to execute the job, thereby improving the efficiency of the first computing node executing the job.
在一种可能的实施方式中,上述至少一个代理计算节点中包括第一代理计算节点以及第二代理计算节点,并且该第一代理计算节点由头计算节点通知执行作业,第二代理计算节点由第一代理计算节点通知执行作业,该第二代理计算节点还用于通知多个第一计算节点中剩余的第一计算节点(也即除头计算节点、第一代理计算节点以及第二代理节点之外的第一计算节点)执行作业。如此,可以利用至少两层结构的代理计算节点进一步提高通知第一计算节点启动执行作业的效率。 In a possible implementation, the above-mentioned at least one agent computing node includes a first agent computing node and a second agent computing node, and the first agent computing node is notified by the head computing node to execute the job, and the second agent computing node is notified by the head computing node to execute the job. An agent computing node notifies the execution job, and the second agent computing node is also used to notify the remaining first computing nodes among the plurality of first computing nodes (that is, excluding the head computing node, the first agent computing node, and the second agent node). (outside the first computing node) to execute the job. In this way, the agent computing node with at least a two-layer structure can be used to further improve the efficiency of notifying the first computing node to start executing the job.
在一种可能的实施方式中,头计算节点与至少一个代理计算节点的总数量,小于多个第一计算节点中剩余的第一计算节点的数量。如此,可以利用少量的计算节点并行通知较多数量的计算节点,由此可以有效提高通知第一计算节点启动执行作业的并发性,进而可以提高多个第一计算节点启动执行作业的效率。In a possible implementation, the total number of head computing nodes and at least one agent computing node is less than the number of remaining first computing nodes among the plurality of first computing nodes. In this way, a small number of computing nodes can be used to notify a larger number of computing nodes in parallel, thereby effectively improving the concurrency of notifying the first computing node to start executing the job, and thereby improving the efficiency of multiple first computing nodes starting to execute the job.
在一种可能的实施方式中,多个计算节点与多个交换机的组网方式包括以下网络中的一种:两层胖树网络、三层胖树网络、二维网格网络、三维网格网络、二维环网、三维环网。如此,在多种组网方式中的网络结构中,均可以利用上述方式降低通知多个执行作业的计算节点的通信成本、提高多个计算节点执行作业的效率,从而可以提高方案实施的灵活性。In a possible implementation, the networking method of multiple computing nodes and multiple switches includes one of the following networks: a two-layer fat tree network, a three-layer fat tree network, a two-dimensional grid network, and a three-dimensional grid. Network, two-dimensional ring network, three-dimensional ring network. In this way, in network structures in various networking modes, the above methods can be used to reduce the communication cost of notifying multiple computing nodes that execute jobs, and improve the efficiency of multiple computing nodes executing jobs, thereby improving the flexibility of solution implementation. .
在一种可能的实施方式中,调度器还可以获取HPC集群中的计算节点与交换机之间的第一连接信息以及多个交换机之间的第二连接信息,并根据该第一连接信息以及第二连接信息,生成HPC集群的拓扑结构,以便后续调度器根据该拓扑结构为待处理的作业调度相应的计算节点。In a possible implementation, the scheduler can also obtain the first connection information between the computing node and the switch in the HPC cluster and the second connection information between multiple switches, and calculate the first connection information and the second connection information between the switches in the HPC cluster. Second, connect the information to generate the topology structure of the HPC cluster, so that the subsequent scheduler can schedule the corresponding computing nodes for the jobs to be processed based on the topology structure.
可选地,HPC集群的拓扑结构,也可以是通过其它方式生成,比如,可以由技术人员生成并将其配置在调度器中等,本实施例对此并不进行限定。Optionally, the topology structure of the HPC cluster can also be generated in other ways. For example, it can be generated by technical personnel and configured in a scheduler, etc. This embodiment is not limited to this.
第二方面,本申请还提供了一种资源调度装置,所述资源调度装置包括用于执行第一方面或第一方面任一种可能实现方式中的资源调度方法的各个模块。In a second aspect, this application also provides a resource scheduling device, which includes various modules for executing the resource scheduling method in the first aspect or any possible implementation of the first aspect.
第三方面,本申请还提供了一种调度器,包括:处理器和存储器;该存储器用于存储计算机指令,该处理器用于根据存储器存储的计算机指令执行上述第一方面或第一方面的任一实现方法中资源调度方法。需要说明的是,该存储器可以集成于处理器中,也可以是独立于处理器之外。计算设备还可以包括总线。其中,处理器通过总线连接存储器。其中,存储器可以包括可读存储器以及随机存取存储器。In a third aspect, this application also provides a scheduler, including: a processor and a memory; the memory is used to store computer instructions, and the processor is used to execute the above-mentioned first aspect or any aspect of the first aspect according to the computer instructions stored in the memory. A resource scheduling method in an implementation method. It should be noted that the memory can be integrated into the processor or independent of the processor. The computing device may also include a bus. Among them, the processor is connected to the memory through a bus. The memory may include readable memory and random access memory.
第四方面,本申请还提供了一种计算设备,该计算设备包括调度器,并且,该调度器包括处理器和存储器;该存储器用于存储计算机指令,该处理器用于根据存储器存储的计算机指令执行上述第一方面或第一方面的任一实现方法中资源调度方法。In a fourth aspect, the application also provides a computing device, the computing device includes a scheduler, and the scheduler includes a processor and a memory; the memory is used to store computer instructions, and the processor is used to store computer instructions according to the memory Execute the resource scheduling method in the above first aspect or any implementation method of the first aspect.
第五方面,本申请提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当其在计算机上运行时,使得计算机执行上述第一方面以及第一方面中任意一种实施方式所述方法的操作步骤。In a fifth aspect, the present application provides a computer-readable storage medium. The computer-readable storage medium stores instructions, which when run on a computer, cause the computer to execute any one of the above-mentioned first aspect and the first aspect. The steps of the method described in the embodiment.
第六方面,本申请提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面以及第一方面中任意一种实施方式所述方法的操作步骤。In a sixth aspect, the present application provides a computer program product containing instructions that, when run on a computer, cause the computer to perform the operational steps of the method described in any one of the above-mentioned first aspects and the first aspect.
本申请在上述各方面提供的实现方式的基础上,还可以进行进一步组合以提供更多实现方式。Based on the implementation methods provided in the above aspects, this application can also be further combined to provide more implementation methods.
附图说明Description of the drawings
图1为本申请提供的一示例性HPC集群的架构示意图;Figure 1 is a schematic architectural diagram of an exemplary HPC cluster provided by this application;
图2为本申请提供的一示例性基于两层胖树网络结构的计算集群101的示意图;Figure 2 is a schematic diagram of an exemplary computing cluster 101 based on a two-layer fat tree network structure provided by this application;
图3为本申请提供的一示例性基于三层胖树网络结构的计算集群101的示意图;Figure 3 is a schematic diagram of an exemplary computing cluster 101 based on a three-layer fat tree network structure provided by this application;
图4为本申请提供的一种资源调度方法的流程示意图;Figure 4 is a schematic flow chart of a resource scheduling method provided by this application;
图5a为本申请提供的两层胖树组网结构对应的拓扑结构示意图;Figure 5a is a schematic diagram of the topology corresponding to the two-layer fat tree networking structure provided by this application;
图5b为本申请提供的三层胖树组网结构对应的拓扑结构示意图;Figure 5b is a schematic diagram of the topology corresponding to the three-layer fat tree networking structure provided by this application;
图6为本申请提供的一种资源调度装置的结构示意图;Figure 6 is a schematic structural diagram of a resource scheduling device provided by this application;
图7为本申请提供的一种计算设备的硬件结构示意图。 Figure 7 is a schematic diagram of the hardware structure of a computing device provided by this application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请中的技术方案进行描述。The technical solutions in this application will be described below with reference to the drawings in the embodiments of this application.
参见图1,为本申请提供的一种HPC集群的示意图。如图1所示,HPC集群100包括计算集群101以及调度器102。其中,计算集群101包括多个计算节点,如图1所示,并且该多个计算节点通过多个交换机实现通信连接。调度器102中可以配置有队列,该队列可以用于暂存提交至HPC集群的一个或者多个作业,如数据聚合计算作业等。并且,针对队列中的各个作业,调度器102可以从计算集群101中调度一定数量的计算节点,以便利用调度的多个计算节点执行该作业。Refer to Figure 1, which is a schematic diagram of an HPC cluster provided by this application. As shown in FIG. 1 , the HPC cluster 100 includes a computing cluster 101 and a scheduler 102 . The computing cluster 101 includes multiple computing nodes, as shown in Figure 1 , and the multiple computing nodes implement communication connections through multiple switches. The scheduler 102 may be configured with a queue, which may be used to temporarily store one or more jobs submitted to the HPC cluster, such as data aggregation calculation jobs. Furthermore, for each job in the queue, the scheduler 102 can schedule a certain number of computing nodes from the computing cluster 101 so as to utilize the multiple scheduled computing nodes to execute the job.
示例性地,计算集群101中的计算节点,可以通过具有计算能力的设备实现,如服务器等。具体地,具有计算能力的设备中可以设置一个或多个处理器,如利用中央处理器(central processing unit,CPU)或专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD)、FPGA、通用阵列逻辑(generic array logic,GAL)或其任意组合实现等。For example, the computing nodes in the computing cluster 101 can be implemented by devices with computing capabilities, such as servers. Specifically, one or more processors can be provided in a device with computing capabilities, such as using a central processing unit (CPU) or an application-specific integrated circuit (ASIC), or a programmable logic device (programmable logic device, PLD) implementation, the above PLD can be a complex programmable logical device (complex programmable logical device, CPLD), FPGA, general array logic (generic array logic, GAL) or any combination thereof.
调度器102可以通过硬件或者软件实现。其中,当通过硬件实现时,调度器102例如可以是服务器等具有数据处理能力的设备。当通过软件实现时,调度器102例如可以是运行在计算设备上的应用程序。Scheduler 102 can be implemented in hardware or software. When implemented by hardware, the scheduler 102 may be, for example, a server or other device with data processing capabilities. When implemented in software, scheduler 102 may be, for example, an application running on a computing device.
值得注意的是,HPC集群100中还可以进一步配置有其它功能的节点,如图1中HPC集群中还可以包括管理节点103,该管理节点103用于对计算集群101中的计算节点进行管理等。并且,图1所示的HPC集群100的架构仅作为一种示例性说明,本申请还可以应用于其它可适用的HPC集群的架构中。It is worth noting that the HPC cluster 100 can be further configured with nodes with other functions. As shown in Figure 1, the HPC cluster can also include a management node 103. The management node 103 is used to manage the computing nodes in the computing cluster 101, etc. . Moreover, the architecture of the HPC cluster 100 shown in FIG. 1 is only used as an example, and the present application can also be applied to other applicable HPC cluster architectures.
实际应用时,HPC集群100中的多个计算节点与多个交换机,可以通过多种组网方式构建得到计算集群101。在一种示例中,多个计算节点与多个交换机之间的组网方式可以是两层胖树(fat-tree)网络结构,如图2所示(图2中是以包括8个交换机、n个计算节点为例进行说明)。在两层胖树网络中,交换机可以分为两类,包括边缘交换机(edge switch)以及汇聚交换机(aggregation switch)。其中,每个计算节点通过边缘交换机接入HPC集群100,并且,边缘交换机与汇聚交换机耦合,从而连接同一边缘交换机的不同计算节点之间可以通过该边缘交换机进行通信,如图2中计算节点1以及计算节点2之间可以通过边缘交换机1进行通信。不同边缘交换机之间可以通过汇聚交换机进行通信,如图2中边缘交换机1与边缘交换机2之间可以通过汇聚交换机1进行通信,或者可以通过汇聚交换机2/汇聚交换机3进行通信等。In actual application, multiple computing nodes and multiple switches in the HPC cluster 100 can be constructed through various networking methods to obtain the computing cluster 101. In one example, the networking method between multiple computing nodes and multiple switches may be a two-layer fat-tree network structure, as shown in Figure 2 (Figure 2 includes 8 switches, n computing nodes are taken as an example for illustration). In a two-layer fat tree network, switches can be divided into two categories, including edge switches and aggregation switches. Each computing node is connected to the HPC cluster 100 through an edge switch, and the edge switch is coupled with the aggregation switch, so that different computing nodes connected to the same edge switch can communicate through the edge switch, as shown in Figure 2 for computing node 1. And computing nodes 2 can communicate through edge switch 1. Different edge switches can communicate through aggregation switches. In Figure 2, edge switch 1 and edge switch 2 can communicate through aggregation switch 1, or they can communicate through aggregation switch 2/aggregation switch 3, etc.
在另一种示例中,多个计算节点与多个交换机之间的组网方式可以是三层胖树网络结构,如图3所示(图3中是以包括12个交换机、n个计算节点为例进行说明)。其中,在两层胖树网络的基础上,三层胖树网络中的交换机除了包括边缘交换机以及汇聚交换机之外,还包括核心交换机(core switch),该核心交换机与汇聚交换机耦合,并用于实现不同汇聚交换机之间的数据通信,如图3中汇聚交换机2与汇聚交换机3之间可以通过核心交换机1进行通信,或者可以通过核心交换机2进行通信等。In another example, the networking mode between multiple computing nodes and multiple switches may be a three-layer fat tree network structure, as shown in Figure 3 (Figure 3 includes 12 switches and n computing nodes). Take an example to illustrate). Among them, based on the two-layer fat tree network, the switches in the three-layer fat tree network include not only edge switches and aggregation switches, but also core switches. The core switches are coupled with the aggregation switches and used to implement For data communication between different aggregation switches, as shown in Figure 3, aggregation switch 2 and aggregation switch 3 can communicate through core switch 1, or can communicate through core switch 2, etc.
当然,在其它示例中,多个计算节点与多个交换机之间的组网方式也可以是二维网格(2D mesh)网络、三维网格(3D mesh)网络、二维环网(2D torus)、三维环网(3D torus)等,本申请实施例对此并不进行限定。 Of course, in other examples, the networking method between multiple computing nodes and multiple switches may also be a two-dimensional mesh (2D mesh) network, a three-dimensional mesh (3D mesh) network, or a two-dimensional torus network (2D torus). ), 3D torus, etc., the embodiments of the present application are not limited to this.
调度器102在为队列中的作业分配计算资源时,可以从计算集群101中随机选取多个计算节点,并指示该多个计算节点执行该作业。具体实现时,调度器102可以向其中一个计算节点发送启动执行作业的指令,并由该计算节点触发其余计算节点启动执行该作业。但是,该计算节点与其余各个计算节点之间可能通过多个交换机实现通信连接,这使得该计算节点在触发其余计算节点启动执行该作业时,需要跨交换机进行通信,导致多个计算节点执行作业所产生的通信成本较高,而且,计算节点之间跨交换机通信的网络传输时延较高,从而导致多个计算节点执行作业的效率较低。When allocating computing resources to a job in the queue, the scheduler 102 may randomly select multiple computing nodes from the computing cluster 101 and instruct the multiple computing nodes to execute the job. During specific implementation, the scheduler 102 can send an instruction to start executing the job to one of the computing nodes, and the computing node triggers the other computing nodes to start executing the job. However, communication connections between the computing node and other computing nodes may be realized through multiple switches, which requires the computing node to communicate across switches when triggering other computing nodes to start executing the job, resulting in multiple computing nodes executing the job. The resulting communication cost is high, and the network transmission delay for cross-switch communication between computing nodes is high, resulting in low efficiency in executing jobs on multiple computing nodes.
基于此,本申请实施例提供了一种资源调度方法,调度器102在为待处理的作业调度计算资源时,根据HPC集群100的拓扑结构,从计算集群101中确定出一个计算节点集合,计算节点集合包括至少一个计算节点,计算节点集合中计算节点用于执行待处理的作业,并通知计算节点集合中的计算节点执行该作业。为了便于描述,可以将执行上述作业的计算节点称为第一计算节点,也即计算节点集合中包括至少一个第一计算节点。其中,该多个第一计算节点之间数据传输所经过的交换机的总数量小于阈值。由于调度器102在选取执行作业的多个第一计算节点时,限制了该多个第一计算节点之间实现通信连接时所跨越的交换机的数量,这使得该多个第一计算节点执行作业所产生的通信成本较低,即可以不用跨越较多数量的交换机进行通信,而且,多个第一计算节点之间进行数据通信时所产生的网络传输时延也可以得到降低,以此可以提高第一计算节点执行作业的效率。Based on this, embodiments of the present application provide a resource scheduling method. When scheduling computing resources for a job to be processed, the scheduler 102 determines a set of computing nodes from the computing cluster 101 according to the topology of the HPC cluster 100, and calculates The node set includes at least one computing node, and the computing nodes in the computing node set are used to execute the pending job, and notify the computing nodes in the computing node set to execute the job. For convenience of description, the computing node that performs the above job may be called a first computing node, that is, the computing node set includes at least one first computing node. Wherein, the total number of switches through which data transmission between the plurality of first computing nodes passes is less than a threshold. When the scheduler 102 selects a plurality of first computing nodes to execute a job, it limits the number of switches across the communication connections between the plurality of first computing nodes, which causes the plurality of first computing nodes to execute the job. The resulting communication cost is low, that is, there is no need to communicate across a large number of switches, and the network transmission delay generated during data communication between multiple first computing nodes can also be reduced, which can improve The efficiency of the first computing node in executing the job.
下面,结合附图进一步介绍本申请提供的调度器102为作业调度计算资源的过程。Next, the process of scheduling computing resources for jobs by the scheduler 102 provided by the present application will be further introduced with reference to the accompanying drawings.
参见图4,图4为本申请实施例提供的一种资源调度方法的流程示意图,其中,该方法可以应用于图1所示的HPC集群100,或者也可以适用于其它架构的HPC集群中。本实施例中,以应用于HPC集群100为、并由该HPC集群100中的调度器102执行该方法为例进行示例性说明。如图4所示,该方法具体可以包括:Referring to Figure 4, Figure 4 is a schematic flowchart of a resource scheduling method provided by an embodiment of the present application. This method can be applied to the HPC cluster 100 shown in Figure 1, or can also be applied to HPC clusters of other architectures. In this embodiment, the method is applied to the HPC cluster 100 and executed by the scheduler 102 in the HPC cluster 100 as an example for illustrative description. As shown in Figure 4, the method may specifically include:
S401:调度器102获取待处理的作业。S401: The scheduler 102 obtains the jobs to be processed.
本实施例中,对于提交至HPC集群100的作业,可以由调度器102为其分配处理该作业的计算资源。In this embodiment, for a job submitted to the HPC cluster 100, the scheduler 102 may allocate computing resources for processing the job.
在一种可能的实施方式中,HPC集群100可以对外提供交互界面,该交互界面可以通过客户端呈现给用户。这样,客户端可以根据用户在交互界面上的操作,生成相应的作业,并将其提交至调度器102中,如通过消息传递接口(message passing interface,MPI)提交的作业等。当提交至调度器102的作业数量较少时,调度器102可以直接为提交的各个作业分配计算资源,以便利用分配的计算资源执行该作业。当提交至调度器102的作业数量较多时,作业可以被暂存至调度器102中的队列,并且,在完成为当前作业调度计算资源后,调度器102可以继续从队列中取出一个作业,如可以按照作业提交至调度器102的顺序取出作业,或者按照作业的优先级取出作业等,并为从队列中取出的作业调度相应的计算资源。In a possible implementation, the HPC cluster 100 can provide an interactive interface to the outside, and the interactive interface can be presented to the user through a client. In this way, the client can generate corresponding jobs based on the user's operations on the interactive interface and submit them to the scheduler 102, such as jobs submitted through the message passing interface (MPI). When the number of jobs submitted to the scheduler 102 is small, the scheduler 102 can directly allocate computing resources to each submitted job so that the allocated computing resources are used to execute the job. When the number of jobs submitted to the scheduler 102 is large, the jobs can be temporarily stored in the queue in the scheduler 102, and after completing the scheduling of computing resources for the current job, the scheduler 102 can continue to take out a job from the queue, such as Jobs may be retrieved in the order in which they are submitted to the scheduler 102, or based on their priority, and corresponding computing resources may be scheduled for the jobs retrieved from the queue.
S402:调度器102根据HPC集群的拓扑结构,从计算集群101包括的多个计算节点中确定第一计算节点,该多个第一计算节点之间数据传输所经过的交换机的总数量小于阈值。S402: The scheduler 102 determines a first computing node from a plurality of computing nodes included in the computing cluster 101 according to the topology of the HPC cluster, and the total number of switches through which data is transmitted between the plurality of first computing nodes is less than a threshold.
本实施例中,调度器102可以为待处理的作业分配计算资源时,可以尽可能选取网络位置相对集中的计算节点作为处理该作业的第一计算节点,以此尽可能降低多个第一计算节点处理该作业所产生的通信成本。In this embodiment, when the scheduler 102 allocates computing resources to a job to be processed, it can select a computing node with a relatively concentrated network location as the first computing node to process the job as much as possible, so as to reduce the number of first computing nodes as much as possible. The communication cost incurred by the node to process the job.
具体实现时,调度器102在为作业调度计算资源之前,可以从本地读取HPC集群的拓扑结构,或者可以接收管理节点103发送的拓扑结构等,该拓扑结构可以指示计算集群101中的计算节点与交换机之间的连接关系。其中,调度器102获取的拓扑结构,包括计算节点的 互联网协议(internet protocol,IP)地址与交换机的标识。举例来说,当采用两层胖树网络结构进行组网时,如图5a所示,拓扑结构可以采用“计算节点IP地址:[边缘交换机名称]”的结构;当采用三层胖树网络结构进行组网时,如图5b所示,拓扑结构可以采用“计算节点IP地址:[边缘交换机名称][汇聚交换机名称,汇聚交换机名称…]”,其中,一个边缘交换机可以与一个或者多个核心交换机互连。During specific implementation, before scheduling computing resources for a job, the scheduler 102 can read the topology structure of the HPC cluster locally, or can receive the topology structure sent by the management node 103, etc., and the topology structure can indicate the computing nodes in the computing cluster 101. The connection relationship with the switch. Among them, the topology obtained by the scheduler 102 includes the computing node's Internet protocol (IP) address and switch identification. For example, when a two-layer fat tree network structure is used for networking, as shown in Figure 5a, the topology can adopt the structure of "compute node IP address: [edge switch name]"; when a three-layer fat tree network structure is used When networking, as shown in Figure 5b, the topology can use "computing node IP address: [edge switch name] [aggregation switch name, aggregation switch name...]", where an edge switch can communicate with one or more cores Switch interconnect.
实际应用时,调度器102或者管理节点103可以采集计算集群101中的计算节点与交换机的信息,生成HPC集群的拓扑结构。以调度器102生成拓扑结构为例,调度器102可以获取第一连接信息以及第二连接信息,该第一连接信息可以指示HPC集群100中计算节点与交换机之间的连接关系,第二连接信息可以指示不同交换机之间的连接关系,从而调度器102可以根据该第一连接信息以及第二连接信息,生成HPC集群的拓扑结构。实际应用时,也可以是由用户根据计算节点、交换机之间的互连关系,人工生成该拓扑结构,并将其配置于调度器102中。本实施例中,对于调度器102获取拓扑结构的具体实现方式并不进行限定。In actual application, the scheduler 102 or the management node 103 can collect information about the computing nodes and switches in the computing cluster 101 and generate the topology structure of the HPC cluster. Taking the topology generated by the scheduler 102 as an example, the scheduler 102 can obtain first connection information and second connection information. The first connection information can indicate the connection relationship between the computing nodes and switches in the HPC cluster 100. The second connection information The connection relationship between different switches can be indicated, so that the scheduler 102 can generate the topology structure of the HPC cluster based on the first connection information and the second connection information. In actual application, the user can also manually generate the topology structure based on the interconnection relationship between computing nodes and switches, and configure it in the scheduler 102 . In this embodiment, the specific implementation manner in which the scheduler 102 obtains the topology structure is not limited.
然后,调度器102可以遍历该HPC集群的拓扑结构,并从计算集群101中确定出多个第一计算节点,该多个第一计算节点之间数据传输所经过的交换机的总数量小于阈值。其中,该阈值可以为固定值,如由管理人员预先配置于调度器102等;或者,该阈值可以由调度器102根据待处理的作业进行实时确定,并且不同作业对应的阈值可以不同。比如,对于作业1,限制交换机数量的阈值可以是数值a;而对于作业2,阈值可以是数量b等。并且,调度器102为作业分配的第一计算节点的数量可以是固定值,或者调度器102可以根据作业进行确定,如调度器102为作业3分配32个第一计算节点,为作业4分配64个第一计算节点等。调度器102将确定的多个第一计算节点分配给该作业,以便后续利用该多个第一计算节点执行该作业。Then, the scheduler 102 can traverse the topology structure of the HPC cluster and determine a plurality of first computing nodes from the computing cluster 101, and the total number of switches through which data is transmitted between the plurality of first computing nodes is less than a threshold. The threshold may be a fixed value, such as being preconfigured in the scheduler 102 by a manager; or the threshold may be determined in real time by the scheduler 102 based on the jobs to be processed, and the thresholds corresponding to different jobs may be different. For example, for job 1, the threshold that limits the number of switches can be the value a; for job 2, the threshold can be the number b, etc. Moreover, the number of first computing nodes allocated by the scheduler 102 to the job may be a fixed value, or the scheduler 102 may determine it according to the job. For example, the scheduler 102 allocates 32 first computing nodes to job 3 and 64 to job 4. the first computing node, etc. The scheduler 102 allocates the determined plurality of first computing nodes to the job, so that the plurality of first computing nodes are subsequently used to execute the job.
示例性地,本实施例提供了以下几种确定第一计算节点的调度策略。Illustratively, this embodiment provides the following scheduling strategies for determining the first computing node.
在第一种调度策略中,调度器102可以通过遍历HPC集群的拓扑结构,确定接入各个边缘交换机的计算节点的数量。当存在接入某个边缘交换机的计算节点的数量达到第一数量阈值时,调度器102可以将与该边缘交换机连接的多个计算节点分配给该作业,作为处理该作业的第一计算节点。此时,多个第一计算节点之间数据传输所经过的交换机的总数量为1,即不同第一计算节点之间可以仅通过该边缘交换机进行通信。In the first scheduling strategy, the scheduler 102 can determine the number of computing nodes connected to each edge switch by traversing the topology structure of the HPC cluster. When the number of computing nodes connected to a certain edge switch reaches the first quantity threshold, the scheduler 102 may allocate multiple computing nodes connected to the edge switch to the job as the first computing node to process the job. At this time, the total number of switches through which data transmission between multiple first computing nodes passes is 1, that is, different first computing nodes can communicate with each other only through this edge switch.
在进一步的实施方式中,部分计算节点的负载可能较大,或者不同作业对于计算节点可能存在不同的要求,如部分作业对于计算节点处理作业的时延要求较高,而其它作业对于计算节点处理作业的时延要求较低等,基于此,调度器102在遍历拓扑结构时,具体可以确定计算节点的负载、作业要求等信息,确定能够处理该作业的可用的计算节点(如负载较小的计算节点、处理作业时延较低的计算节点等),从而调度器102可以将与接入同一边缘交换机的多个可用的计算节点作为第一计算节点分配给该作业。In further embodiments, the load of some computing nodes may be relatively large, or different jobs may have different requirements for computing nodes. For example, some jobs may have higher latency requirements for computing node processing, while other jobs may have higher latency requirements for computing node processing. The job has lower latency requirements, etc. Based on this, when traversing the topology, the scheduler 102 can specifically determine the load of the computing node, job requirements and other information, and determine the available computing nodes that can process the job (such as those with smaller loads). Computing nodes, computing nodes with lower processing job latency, etc.), so that the scheduler 102 can allocate multiple available computing nodes connected to the same edge switch as the first computing node to the job.
在第二种调度策略中,调度器102可以遍历HPC集群的拓扑结构,确定接入各个边缘交换机的计算节点的数量。当接入各个边缘交换机的计算节点的数量均小于第一数量阈值时,调度器102可以将多个边缘交换机下的计算节点分配给该作业。此时,为减小分配给该作业的计算节点之间的通信成本,调度器102可以限制不同边缘交换机之间进行数据通信时所经过的交换机(如汇聚交换机)的数量,将数据通信时经过交换机的数量最少的多个边缘交换机下的计算节点,作为第一计算节点分配给该作业。举例来说,假设第一数量阈值为80,并且,调度器102通过遍历拓扑结构确定,接入边缘交换机1的计算节点的数量为64、接入边缘交换机2的计算节点的数量为32、接入边缘交换机3的计算节点的数量为32,其中,边缘 交换机1与边缘交换机2之间进行数据传输时经过1个汇聚交换机,边缘交换机1与边缘交换机3之间进行数据传输时经过2个汇聚交换机。则,调度器102在为作业分配80个计算节点时,可以将接入边缘交换机1的64个计算节点以及接入边缘交换机2的其中16个节点,作为第一计算节点分配给该作业。这样,后续该80个第一计算节点之间进行数据传输时,可以仅经过边缘交换机1,或者仅经过边缘交换机2,或者经过边缘交换机1、1个汇聚交换机以及边缘交换机2,以使得分配给该作业的80个第一计算节点之间的通信成本最低。In the second scheduling strategy, the scheduler 102 can traverse the topology structure of the HPC cluster and determine the number of computing nodes connected to each edge switch. When the number of computing nodes connected to each edge switch is less than the first quantity threshold, the scheduler 102 may allocate computing nodes under multiple edge switches to the job. At this time, in order to reduce the communication cost between the computing nodes assigned to the job, the scheduler 102 can limit the number of switches (such as aggregation switches) through which data communication between different edge switches passes. The computing node under multiple edge switches with the smallest number of switches is assigned to the job as the first computing node. For example, assume that the first quantity threshold is 80, and the scheduler 102 determines by traversing the topology structure that the number of computing nodes of the access edge switch 1 is 64, the number of computing nodes of the access edge switch 2 is 32, and the number of computing nodes of the access edge switch 2 is 32. The number of computing nodes entering edge switch 3 is 32, among which the edge Data transmission between switch 1 and edge switch 2 passes through one aggregation switch, and data transmission between edge switch 1 and edge switch 3 passes through two aggregation switches. Then, when the scheduler 102 allocates 80 computing nodes to a job, the scheduler 102 can allocate 64 computing nodes of the access edge switch 1 and 16 of the nodes of the access edge switch 2 as the first computing nodes to the job. In this way, subsequent data transmission between the 80 first computing nodes can only pass through edge switch 1, or only through edge switch 2, or through edge switch 1, an aggregation switch and edge switch 2, so that the distribution to The job has the lowest communication cost among the 80 first computing nodes.
进一步地,当存在多个边缘交换机之间数据传输所经过的汇聚交换机的数量相同时,调度器102还可以结合边缘交换机的物理位置,优先选择将物理位置相对较近的多个边缘交换机下的计算节点作为第一计算节点分配给待处理的作业,以进一步降低第一计算节点之间的通信成本。或者,调度器102可以结合边缘交换机的负载,优先选择将负载较小的多个边缘交换机下的计算节点作为第一计算节点分配给待处理的作业,以均衡化网络中边缘交换机的负载等,本实施例对此并不进行限定。Furthermore, when the number of aggregation switches through which data is transmitted between multiple edge switches is the same, the scheduler 102 can also combine the physical locations of the edge switches to give priority to the multiple edge switches whose physical locations are relatively close. The computing node is assigned to the job to be processed as the first computing node to further reduce the communication cost between the first computing nodes. Alternatively, the scheduler 102 may combine the load of the edge switches and preferentially select the computing nodes under multiple edge switches with smaller loads as the first computing node to be assigned to the to-be-processed job to balance the load of the edge switches in the network, etc. This embodiment does not limit this.
在第三种调度策略中,计算集群101中的交换机还包括核心交换机,如计算集群101可以通过三层胖树网络结构进行组网等。则,调度器102可以遍历HPC集群的拓扑结构,确定通过边缘交换机连接各个汇聚交换机的计算节点,并且,在连接各个汇聚交换机的计算节点的数量均小于第二数量阈值时,调度器102可以将通过至少一个核心层交换机实现通信连接的多个计算节点确定为分配给待处理作业的第一计算节点,以此限制不同汇聚交换机之间进行数据传输时所经过的交换机的数量,进而降低多个第一计算节点之间的通信成本。In the third scheduling strategy, the switches in the computing cluster 101 also include core switches. For example, the computing cluster 101 can be networked through a three-layer fat tree network structure. Then, the scheduler 102 can traverse the topology structure of the HPC cluster, determine the computing nodes connected to each aggregation switch through the edge switch, and when the number of computing nodes connected to each aggregation switch is less than the second quantity threshold, the scheduler 102 can Multiple computing nodes that are communicated through at least one core layer switch are determined as the first computing node assigned to the job to be processed, thereby limiting the number of switches that pass through during data transmission between different aggregation switches, thereby reducing the number of The communication cost between the first computing nodes.
实际应用时,调度器102可以同时执行上述三种调度策略,如可以优先执行第一调度策略,并且在不存在某个边缘交换机的计算节点的数量达到第一数量阈值时,调度器102可以执行第二调度策略,并且,在不存在连接各个汇聚交换机的计算节点的数量达到第二数量阈值时,调度器102可以进一步执行第三调度策略,以此实现尽可能降低处理作业的不同计算节点之间的通信成本。或者,调度器102也可以仅执行上述三种调度策略中的部分策略,或者执行其它可适用的调度策略,本实施例对此并不进行限定。In actual application, the scheduler 102 can execute the above three scheduling strategies at the same time. For example, the first scheduling strategy can be executed first, and when the number of computing nodes that do not exist in a certain edge switch reaches the first quantity threshold, the scheduler 102 can execute The second scheduling strategy, and when the number of computing nodes connected to each aggregation switch reaches the second quantity threshold, the scheduler 102 can further execute the third scheduling strategy, so as to reduce the number of different computing nodes processing the job as much as possible. communication costs. Alternatively, the scheduler 102 may also execute only part of the above three scheduling strategies, or execute other applicable scheduling strategies, which is not limited in this embodiment.
S403:调度器102通知多个第一计算节点执行该作业。S403: The scheduler 102 notifies multiple first computing nodes to execute the job.
调度器102在确定分配给待处理的作业的多个第一计算节点后,可以通知该多个第一计算节点执行该作业。After determining the plurality of first computing nodes allocated to the job to be processed, the scheduler 102 may notify the plurality of first computing nodes to execute the job.
示例性地,调度器102可以逐个通知各个第一计算节点,以触发各个第一计算节点启动执行该作业。或者,调度器102可以通知该多个第一计算节点中的头计算节点(head node)启动执行该作业,该头计算节点为多个第一计算节点中的其中一个第一计算节点,并由该头计算节点通知其余第一计算节点启动执行作业。具体实现时,调度器102可以向头计算节点发送针对该作业的执行指令,如mpirun指令等,该执行指令用于通知头计算节点执行该作业;头计算节点对该执行指令进行解析,获取该执行指令中携带的其余第一计算节点的IP地址,从而头计算节点可以根据其余第一计算节点的IP地址,逐个通知各个第一计算节点启动执行该作业。其中,调度器102可以从多个第一计算节点中随机选择一个第一计算节点作为头计算节点,并指示该头计算节点通知其余第一计算节点启动执行作业。或者,调度器102可以根据各个第一计算节点与调度器102之间的网络距离或通信时延,将多个第一计算节点中与调度器102之间的网络距离最小或者通信时延最短的第一计算节点确定为头计算节点,并指示该头计算节点通知其余第一计算节点启动执行作业。实际应用时,调度器102也可以按照其它规则从多个第一计算节点中确定头计算节点,本实施例对此并不进行限定。For example, the scheduler 102 may notify each first computing node one by one to trigger each first computing node to start executing the job. Alternatively, the scheduler 102 may notify a head computing node (head node) among the plurality of first computing nodes to start executing the job. The head computing node is one of the first computing nodes among the plurality of first computing nodes and is configured by The head computing node notifies the remaining first computing nodes to start executing the job. During specific implementation, the scheduler 102 can send an execution instruction for the job, such as an mpirun instruction, to the head computing node. The execution instruction is used to notify the head computing node to execute the job; the head computing node parses the execution instruction and obtains the The IP addresses of the remaining first computing nodes are carried in the execution instruction, so that the head computing node can notify each first computing node one by one to start executing the job based on the IP addresses of the remaining first computing nodes. The scheduler 102 may randomly select a first computing node from a plurality of first computing nodes as a head computing node, and instruct the head computing node to notify the remaining first computing nodes to start executing the job. Alternatively, the scheduler 102 may select the one with the smallest network distance or the shortest communication delay between the multiple first computing nodes and the scheduler 102 based on the network distance or communication delay between each first computing node and the scheduler 102 . The first computing node is determined as the head computing node, and the head computing node is instructed to notify other first computing nodes to start executing the job. In actual application, the scheduler 102 may also determine the head computing node from multiple first computing nodes according to other rules, which is not limited in this embodiment.
实际应用时,调度器102为待处理作业分配的第一计算节点的数量较多,比如调度器102 为作业分配1000个第一计算节点等,此时,头计算节点逐个通知其余第一计算节点启动执行作业,可能耗时较长。因此,在其它可能的实施方式中,头计算节点可以采用树形结构的通知方式,提高通知第一计算节点启动执行作业的并发性,以此减小通知耗时、提高通知效率。以头计算节点利用两层树形节点启动其余第一计算节点为例,头计算节点在接收到执行指令后,可以确定其余第一计算节点中的至少一个代理计算节点,如该执行指令中可以携带有代理计算节点的IP地址等。然后,头计算节点可以根据HPC集群的网络拓扑以及代理计算节点的IP地址,通过相应的交换机通知各个代理计算节点启动执行作业;然后,头计算节点以及代理计算节点可以分别通知不同的第一计算节点启动执行作业,从而通过头计算节点以及代理计算节点并行通知其余第一计算节点的方式,提高通知效率,加速执行作业。实际应用时,头计算节点与代理计算节点的总数量,可以小于多个第一计算节点中剩余的第一计算节点的数量。In actual application, the scheduler 102 allocates a larger number of first computing nodes to the jobs to be processed. For example, the scheduler 102 Allocate 1,000 first computing nodes to the job. At this time, the head computing node notifies the remaining first computing nodes one by one to start executing the job, which may take a long time. Therefore, in other possible implementations, the head computing node may use a tree-structured notification method to improve the concurrency of notifying the first computing node to start executing the job, thereby reducing notification time and improving notification efficiency. Taking the head computing node to use two-layer tree nodes to start the remaining first computing nodes as an example, after receiving the execution instruction, the head computing node can determine at least one agent computing node among the remaining first computing nodes. For example, the execution instruction can It carries the IP address of the agent computing node, etc. Then, the head computing node can notify each agent computing node to start the execution job through the corresponding switch according to the network topology of the HPC cluster and the IP address of the agent computing node; then, the head computing node and the agent computing node can notify different first computing nodes respectively. The node starts the execution job, thereby improving the notification efficiency and accelerating the execution of the job through the head computing node and the agent computing node notifying the remaining first computing nodes in parallel. In actual application, the total number of head computing nodes and agent computing nodes may be less than the number of remaining first computing nodes among the plurality of first computing nodes.
举例来说,在图2所示的两层胖树网络中,假设分配给待处理作业的第一计算节点包括计算节点1至计算节点a、计算节点a+1至计算节点b,则调度器102可以先向计算节点1发送执行指令,以通知计算节点1启动执行作业,该计算节点1即为上述头计算节点。其中,该执行指令中可以指定计算节点a+1为代理计算节点并通知其启动执行作业,则,计算节点1可以触发计算节点a+1作为代理计算节点通知处理该作业的其余计算节点启动执行作业。这样,计算节点1可以逐个通知计算节点2至计算节点a,代理计算节点(即计算节点a+1)可以逐个通知计算节点a+2至计算节点b,以此提高通知效率。For example, in the two-layer fat tree network shown in Figure 2, assuming that the first computing node allocated to the job to be processed includes computing node 1 to computing node a, computing node a+1 to computing node b, then the scheduler 102 may first send an execution instruction to the computing node 1 to notify the computing node 1 to start the execution job. The computing node 1 is the above-mentioned head computing node. Among them, the execution instruction can specify computing node a+1 as an agent computing node and notify it to start executing the job. Then, computing node 1 can trigger computing node a+1 to serve as an agent computing node and notify the remaining computing nodes processing the job to start execution. Operation. In this way, computing node 1 can notify computing nodes 2 to computing node a one by one, and the agent computing node (ie, computing node a+1) can notify computing nodes a+2 to computing node b one by one, thereby improving notification efficiency.
其中,代理计算节点可以是不同接入边缘交换机下的计算节点,这样,由各个边缘交换机下的计算节点作为代理计算节点,通知该边缘交换机下的其余第一计算节点,可以进一步降低通知多个第一计算节点启动执行作业所产生的通信成本。比如,对于计算节点a+2至计算节点b,相比于由计算节点1通过边缘交换机1、汇聚交换机1以及边缘交换机2通知这些计算节点启动执行作业的方式而言,由代理计算节点通过边缘交换机2通知其执行作业,可以有效降低通知计算节点a+2至计算节点b启动执行作业时所产生的通信成本,同时也能够进一步提高通知效率。Among them, the agent computing node can be a computing node under different access edge switches. In this way, the computing nodes under each edge switch serve as the agent computing nodes and notify the remaining first computing nodes under the edge switch, which can further reduce the number of notifications for multiple The communication cost incurred by the first computing node to start executing the job. For example, for computing nodes a+2 to computing node b, compared to the way in which computing node 1 notifies these computing nodes to start executing jobs through edge switch 1, aggregation switch 1 and edge switch 2, the agent computing node uses the edge to notify these computing nodes to start executing jobs. The switch 2 notifies it to execute the job, which can effectively reduce the communication cost incurred when notifying the computing node a+2 to the computing node b to start the execution job, and can also further improve the notification efficiency.
值得注意的是,上述实现示例是以采用两层树形结构通知多个第一计算节点启动执行作业,当第一计算节点的数量较多时(如第一计算节点的数量达到1000及以上等),调度器102还可以采用三层以上(包括三层)的树形结构来通知多个第一计算节点启动执行作业。以采用三层树形结构通知多个第一计算节点为例,调度器102可以向头计算节点发送执行指令,以通知头计算节点启动执行作业。然后,头计算节点从该执行指令中解析出第一代理计算节点以及第二代理计算节点的IP地址,根据各个第一代理计算节点的IP地址通知该第一代理计算节点启动执行作业,并指示各个第一代理计算节点通知第二代理计算节点启动执行作业。最后,各个第二代理计算节点在收到第一代理计算节点的通知后,可以并行通知剩余的第一计算节点(即除头计算节点以及代理计算节点之外的第一计算节点)启动执行作业。示例性地,不同第一代理计算节点可以接入不同的边缘交换机,第二代理计算节点可以是与第一代理计算节点接入同一边缘交换机的第一计算节点,并且,第二代理计算节点的数量可以大于第一代理计算节点的数量,如此,可以利用多个第二代理计算节点通过各自接入的边缘交换机通知该边缘交换机下的其余第一计算节点启动执行作业。It is worth noting that the above implementation example uses a two-layer tree structure to notify multiple first computing nodes to start execution jobs. When the number of first computing nodes is large (for example, the number of first computing nodes reaches 1,000 or more) , the scheduler 102 may also use a tree structure of more than three levels (including three levels) to notify multiple first computing nodes to start execution jobs. Taking the use of a three-layer tree structure to notify multiple first computing nodes as an example, the scheduler 102 may send an execution instruction to the head computing node to notify the head computing node to start the execution job. Then, the head computing node parses the IP addresses of the first agent computing node and the second agent computing node from the execution instruction, notifies the first agent computing node to start the execution job according to the IP address of each first agent computing node, and instructs the first agent computing node to start the execution job. Each first agent computing node notifies the second agent computing node to start executing the job. Finally, after receiving the notification from the first agent computing node, each second agent computing node can notify the remaining first computing nodes (that is, the first computing nodes except the head computing node and the agent computing node) in parallel to start executing the job. . For example, different first agent computing nodes may access different edge switches, the second agent computing node may be a first computing node that accesses the same edge switch as the first agent computing node, and the second agent computing node The number may be greater than the number of first agent computing nodes. In this way, multiple second agent computing nodes may be used to notify the remaining first computing nodes under the edge switch to start execution jobs through the edge switches they are connected to.
值得注意的是,当调度器102为待处理作业所需分配的第一计算节点的数量较多时,调度器102可以基于上述方式确定多个第一计算节点,以尽可能降低多个第一计算节点之间的通信成本,而对于实际应用中的部分作业,调度器102位待处理作业分配的第一计算节点的 数量较少(比如分配5个第一计算节点等),此时,调度器102也可以从计算集群101中随机选择处理该作业的多个第一计算节点等。或者,调度器102可以配置有两种资源调度方式,其中,作业在被提交至HPC集群100时,可以携带资源调度方式的标识。并且,当待资源调度方式的标识指示资源调度方式1时,调度器102可以采用上述过程为待处理的作业分配多个第一计算节点;而当待资源调度方式的标识指示资源调度方式2时,调度器102可以从计算集群101中随机选择多个计算节点,作为处理该作业的第一计算节点。It is worth noting that when the scheduler 102 needs to allocate a large number of first computing nodes for a job to be processed, the scheduler 102 can determine multiple first computing nodes based on the above method to reduce the number of first computing nodes as much as possible. The communication cost between nodes. For some jobs in actual applications, the scheduler 102 assigns the first computing node to the job to be processed. If the number is small (such as allocating 5 first computing nodes, etc.), at this time, the scheduler 102 can also randomly select multiple first computing nodes to process the job from the computing cluster 101. Alternatively, the scheduler 102 may be configured with two resource scheduling modes, in which the job may carry an identifier of the resource scheduling mode when it is submitted to the HPC cluster 100 . Moreover, when the identifier of the resource scheduling mode to be processed indicates resource scheduling mode 1, the scheduler 102 can use the above process to allocate multiple first computing nodes to the job to be processed; and when the identifier of the resource scheduling mode to be processed indicates resource scheduling mode 2, , the scheduler 102 may randomly select multiple computing nodes from the computing cluster 101 as the first computing node to process the job.
实际应用场景中,调度器102可以为多个不同的作业调度不同的资源。比如,对于提交至HPC集群100的作业1,调度器102可以从计算集群101确定用于处理该作业1的多个第一计算节点。而对于提交至HPC集群100的作业2,调度器102可以从计算集群101中确定用于处理该作业2的多个第二计算节点,其中,调度器102为作业2确定多个第二计算节点的方式,与上述为作业1确定多个第一计算节点的实现方式类似,具体可参见前述过程描述。进一步地,当作业在HPC集群100中被执行完成时,HPC集群100可以向客户端返回执行结果,该执行结果可以指示该作业执行成功、执行错误等,还可以包括其它与该作业相关的信息,如HPC集群100在执行作业的过程中所生成的中间数据等。In actual application scenarios, the scheduler 102 can schedule different resources for multiple different jobs. For example, for job 1 submitted to the HPC cluster 100, the scheduler 102 may determine a plurality of first computing nodes from the computing cluster 101 for processing the job 1. For job 2 submitted to the HPC cluster 100, the scheduler 102 can determine multiple second computing nodes for processing the job 2 from the computing cluster 101, where the scheduler 102 determines multiple second computing nodes for the job 2. The method is similar to the above-mentioned implementation method of determining multiple first computing nodes for job 1. For details, please refer to the foregoing process description. Further, when the job is executed in the HPC cluster 100, the HPC cluster 100 can return an execution result to the client. The execution result can indicate successful execution of the job, execution error, etc., and can also include other information related to the job. , such as intermediate data generated by the HPC cluster 100 during job execution, etc.
本实施例中,由于调度器102在选取执行作业的多个第一计算节点时,限制了该多个第一计算节点之间实现通信连接时所跨越的交换机的数量,这使得该多个第一计算节点执行作业所产生的通信成本较低,即可以不用跨越较多数量的交换机进行通信,而且,多个第一计算节点之间进行数据通信时所产生的网络传输时延也可以得到降低,以此可以提高第一计算节点执行作业的效率。In this embodiment, when the scheduler 102 selects a plurality of first computing nodes to execute a job, it limits the number of switches across the communication connections between the plurality of first computing nodes, which makes the plurality of first computing nodes The communication cost incurred by a computing node when executing a job is low, that is, it does not need to communicate across a large number of switches, and the network transmission delay generated when data communication is performed between multiple first computing nodes can also be reduced. , thereby improving the efficiency of the first computing node in executing the job.
作为一种可能的实施例,图4所述的待处理的作业可以为一个或多个,即调度器102可以逐个处理单个作业的资源分配和调度过程,也可以批量处理多个作业的资源分配和调度过程。具体实施时,调度器可以根据获取业务的时间顺序逐个执行各个作业。可选地,调度器也可以根据作业关联业务的优先级或作业所需资源的满足度逐个执行各个作业,其中,作业所需资源的满足度用于指示系统中空闲资源与作业所需资源的匹配程度,当系统中空闲资源大于或等于作业所需资源时,可以判定调度器可以执行作业的调度过程;当系统中空闲资源小于作业所需资源时,可以判定调度器无法执行作业的调度过程,此时,系统空闲资源无法满足作业调度过程的资源需求。可选地,调度器还可以将待处理的多个作业划分至少一个组,执行各个组的批量处理过程,如按照图4所示方法分别执行每组中各个作业的调度过程。As a possible embodiment, there can be one or more jobs to be processed as shown in Figure 4 , that is, the scheduler 102 can process the resource allocation and scheduling process of a single job one by one, or can process the resource allocation of multiple jobs in batches. and scheduling process. During specific implementation, the scheduler can execute each job one by one according to the time sequence of obtaining services. Optionally, the scheduler can also execute each job one by one according to the priority of the job-related business or the satisfaction of the resources required by the job, where the satisfaction of the resources required by the job is used to indicate the relationship between idle resources and resources required by the job in the system. Matching degree. When the idle resources in the system are greater than or equal to the resources required by the job, it can be determined that the scheduler can execute the scheduling process of the job; when the idle resources in the system are less than the resources required by the job, it can be determined that the scheduler cannot execute the scheduling process of the job. , at this time, the idle resources of the system cannot meet the resource requirements of the job scheduling process. Optionally, the scheduler can also divide the multiple jobs to be processed into at least one group and execute the batch processing process of each group. For example, the scheduler can separately execute the scheduling process of each job in each group according to the method shown in Figure 4.
值得注意的是,本领域的技术人员根据以上描述的内容,能够想到的其他合理的步骤组合,也属于本申请的保护范围内。其次,本领域技术人员也应该熟悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请所必须的。It is worth noting that those skilled in the art can think of other reasonable step combinations based on the above description, which also fall within the protection scope of this application. Secondly, those skilled in the art should also be familiar with the fact that the embodiments described in the specification are preferred embodiments, and the actions involved are not necessarily necessary for this application.
上文中结合图1至图5,详细描述了本申请所提供的资源分配方法,下面将结合图6至图7,描述根据本申请所提供的资源分配装置和计算设备。The resource allocation method provided by the present application is described in detail above with reference to FIGS. 1 to 5 . Next, the resource allocation device and computing device provided by the present application will be described with reference to FIGS. 6 to 7 .
图6为本申请提供的一种资源分配装置的结构示意图。其中,资源调度装置600应用于HPC集群中的调度器,该HPC集群还包括多个计算节点和多个交换机,所述多个计算节点通过所述多个交换机实现通信连接如图6所示,资源分配装置600可以包括:Figure 6 is a schematic structural diagram of a resource allocation device provided by this application. The resource scheduling device 600 is used as a scheduler in an HPC cluster. The HPC cluster also includes multiple computing nodes and multiple switches. The multiple computing nodes implement communication connections through the multiple switches, as shown in Figure 6. The resource allocation device 600 may include:
获取模块601,用于获取待处理的作业;Acquisition module 601, used to obtain jobs to be processed;
确定模块602,用于根据所述HPC集群的拓扑结构,从所述多个计算节点中确定多个第一计算节点,所述多个第一计算节点之间数据传输所经过的交换机的总数量小于阈值;Determining module 602, configured to determine a plurality of first computing nodes from the plurality of computing nodes according to the topology structure of the HPC cluster, and the total number of switches through which data is transmitted between the plurality of first computing nodes. less than the threshold;
通知模块603,用于通知所述多个第一计算节点执行所述作业。 Notification module 603 is used to notify the plurality of first computing nodes to execute the job.
在一种可能的实施方式中,多个交换机包括汇聚交换机和边缘交换机,所述多个计算节点通过所述边缘交换机接入所述HPC集群,所述边缘交换机与所述汇聚交换机耦合。In a possible implementation, the multiple switches include an aggregation switch and an edge switch, the multiple computing nodes access the HPC cluster through the edge switch, and the edge switch is coupled with the aggregation switch.
在一种可能的实施方式中,所述确定模块602,用于遍历所述HPC集群的拓扑结构,将接入同一边缘交换机的多个计算节点确定为所述多个第一计算节点。In a possible implementation, the determination module 602 is configured to traverse the topology structure of the HPC cluster and determine multiple computing nodes connected to the same edge switch as the multiple first computing nodes.
在一种可能的实施方式中,所述确定模块602,用于:In a possible implementation, the determining module 602 is used to:
遍历所述HPC集群的拓扑结构,确定接入各个边缘交换机的计算节点的数量;Traverse the topology structure of the HPC cluster and determine the number of computing nodes connected to each edge switch;
当接入各个边缘交换机的计算节点的数量均小于第一数量阈值时,将接入多个边缘交换机的多个计算节点确定为所述多个第一计算节点,所述多个第一计算节点通过至少一个汇聚交换机实现通信连接。When the number of computing nodes connected to each edge switch is less than the first quantity threshold, multiple computing nodes connected to multiple edge switches are determined as the multiple first computing nodes, and the multiple first computing nodes Communication connections are implemented through at least one aggregation switch.
在一种可能的实施方式中,所述多个交换机还包括核心交换机,所述汇聚交换机与所述核心交换机耦合,所述确定模块602,用于:In a possible implementation, the multiple switches further include a core switch, the aggregation switch is coupled with the core switch, and the determining module 602 is used to:
遍历所述HPC集群的拓扑结构,确定通过边缘交换机连接各个汇聚交换机的计算节点的数量;Traverse the topology structure of the HPC cluster and determine the number of computing nodes connected to each aggregation switch through edge switches;
当通过边缘交换机连接各个汇聚交换机的计算节点的数量均小于第二数量阈值时,将通过至少一个核心层交换机实现通信连接的多个计算节点确定为所述多个第一计算节点。When the number of computing nodes connected to each aggregation switch through the edge switch is less than the second quantity threshold, a plurality of computing nodes that are communicated through at least one core layer switch are determined to be the plurality of first computing nodes.
在一种可能的实施方式中,所述多个第一计算节点包括头计算节点以及至少一个代理计算节点,所述通知模块603,用于向所述头计算节点发送执行指令,所述执行指令用于通知所述头计算节点执行所述作业,所述至少一个代理计算节点由所述头计算节点通知执行所述作业,所述多个第一计算节点中除所述头计算节点以及所述至少一个代理计算节点之外剩余的第一计算节点,由所述至少一个代理计算节点通知执行所述作业。In a possible implementation, the plurality of first computing nodes include a head computing node and at least one agent computing node, and the notification module 603 is configured to send an execution instruction to the head computing node. The execution instruction Used to notify the head computing node to execute the job, and the at least one proxy computing node is notified by the head computing node to execute the job. Among the plurality of first computing nodes, except the head computing node and the The remaining first computing nodes other than the at least one agent computing node are notified by the at least one agent computing node to execute the job.
在一种可能的实施方式中,所述至少一个代理计算节点包括第一代理计算节点以及第二代理计算节点,所述第一代理计算节点由所述头计算节点通知执行所述作业,所述第二代理计算节点由所述第一代理计算节点通知执行所述作业,所述第二代理计算节点还用于通知所述多个第一计算节点中剩余的第一计算节点执行所述作业。In a possible implementation, the at least one agent computing node includes a first agent computing node and a second agent computing node, and the first agent computing node is notified by the head computing node to execute the job, and the The second agent computing node is notified by the first agent computing node to execute the job, and the second agent computing node is further configured to notify the remaining first computing nodes among the plurality of first computing nodes to execute the job.
在一种可能的实施方式中,所述头计算节点与所述至少一个代理计算节点的总数量,小于所述多个第一计算节点中剩余的第一计算节点的数量。In a possible implementation, the total number of the head computing node and the at least one agent computing node is less than the number of remaining first computing nodes among the plurality of first computing nodes.
在一种可能的实施方式中,所述多个计算节点与所述多个交换机的组网方式包括以下网络中的一种:In a possible implementation, the networking mode of the multiple computing nodes and the multiple switches includes one of the following networks:
两层胖树网络、三层胖树网络、二维网格网络、三维网格网络、二维环网、三维环网。Two-layer fat tree network, three-layer fat tree network, two-dimensional grid network, three-dimensional grid network, two-dimensional ring network, three-dimensional ring network.
在一种可能的实施方式中,所述获取模块,还用于获取所述HPC集群中的计算节点与交换机之间的第一连接信息以及所述多个交换机之间的第二连接信息;In a possible implementation, the acquisition module is further configured to acquire first connection information between computing nodes and switches in the HPC cluster and second connection information between the multiple switches;
所述资源调度装置600还包括:The resource scheduling device 600 also includes:
生成模块604,用于根据所述第一连接信息以及所述第二连接信息,生成所述HPC集群的拓扑结构。Generating module 604, configured to generate the topology structure of the HPC cluster according to the first connection information and the second connection information.
应理解的是,本发明本申请实施例的资源调度装置600可以通过CPU或专用集成电路(application-specific integrated circuit,ASIC)实现,或可编程逻辑器件(programmable logic device,PLD)实现,上述PLD可以是复杂程序逻辑器件(complex programmable logical device,CPLD),现场可编程门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。当通过软件实现图1至图5所示的资源调度方法时,资源调度装置600及其各个模块也可以为软件模块。It should be understood that the resource scheduling device 600 in this embodiment of the present invention can be implemented by a CPU or an application-specific integrated circuit (ASIC), or a programmable logic device (PLD). The above-mentioned PLD It can be a complex programmable logical device (CPLD), a field-programmable gate array (FPGA), a general array logic (GAL) or any combination thereof. When the resource scheduling methods shown in FIGS. 1 to 5 are implemented through software, the resource scheduling device 600 and its respective modules may also be software modules.
根据本申请实施例的资源调度装置600可对应于执行本申请实施例中描述的方法,并且 资源调度装置600的各个模块的上述和其它操作和/或功能为了实现图4中的各个方法的相应流程,为了简洁,在此不再赘述。The resource scheduling device 600 according to the embodiment of the present application may correspond to executing the method described in the embodiment of the present application, and The above and other operations and/or functions of each module of the resource scheduling device 600 are in order to implement the corresponding processes of each method in Figure 4, and for the sake of simplicity, they will not be described again here.
本实施例中,由于调度器在选取执行作业的多个第一计算节点时,限制了该多个第一计算节点之间实现通信连接时所跨越的交换机的数量,这使得该多个第一计算节点执行作业所产生的通信成本较低,即可以不用跨越较多数量的交换机进行通信,而且,多个第一计算节点之间进行数据通信时所产生的网络传输时延也可以得到降低,以此可以提高第一计算节点执行作业的效率。In this embodiment, when the scheduler selects multiple first computing nodes to execute the job, it limits the number of switches across when the communication connections are realized between the multiple first computing nodes. This makes the multiple first computing nodes The communication cost caused by the execution of the job by the computing node is low, that is, it does not need to communicate across a large number of switches, and the network transmission delay generated when data communication is performed between multiple first computing nodes can also be reduced. In this way, the efficiency of the first computing node in executing the job can be improved.
图7为本申请提供的一种计算设备800的示意图,如图所示,所述计算设备800包括调度器700,该调度器700包括处理器701、存储器702、通信接口703以及总线704。其中,处理器701、存储器702、通信接口703通过总线704进行通信,也可以通过无线传输等其他手段实现通信。该总线704除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线704。Figure 7 is a schematic diagram of a computing device 800 provided by this application. As shown in the figure, the computing device 800 includes a scheduler 700, which includes a processor 701, a memory 702, a communication interface 703, and a bus 704. Among them, the processor 701, the memory 702, and the communication interface 703 communicate through the bus 704. Communication can also be achieved through other means such as wireless transmission. In addition to the data bus, the bus 704 may also include a power bus, a control bus, a status signal bus, etc. However, for the sake of clarity, the various buses are labeled bus 704 in the figure.
处理器701可以是CPU,该处理器701还可以是其他通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立器件组件等。The processor 701 can be a CPU. The processor 701 can also be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs) or other programmable logic devices, discrete Gate or transistor logic devices, discrete device components, etc.
存储器702可以包括只读存储器和随机存取存储器,并向处理器701提供计算机指令和数据。并且,存储器702可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。Memory 702 may include read-only memory and random access memory and provides computer instructions and data to processor 701 . Also, memory 702 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. Among them, non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase electrically programmable read-only memory (EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of illustration, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchlink DRAM, SLDRAM) and direct Memory bus random access memory (direct rambus RAM, DR RAM).
通信接口703用于与调度器700连接的其它器件或者设备进行通信。The communication interface 703 is used to communicate with other devices or devices connected to the scheduler 700 .
进一步地,计算设备800还可以包括通信接口705以及总线706,其中,调度器700、通信接口705通过总线706进行通信,也可以通过无线传输等其他手段实现通信。与总线704类似,总线706除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都标为总线706。Further, the computing device 800 may also include a communication interface 705 and a bus 706, wherein the scheduler 700 and the communication interface 705 communicate through the bus 706, or may communicate through other means such as wireless transmission. Similar to the bus 704, the bus 706 may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. However, for the sake of clarity, the various buses are labeled bus 706 in the figure.
通信接口705用于与计算设备800连接的其它设备进行通信,如用于与计算设备800连接的计算节点进行通信等。The communication interface 705 is used to communicate with other devices connected to the computing device 800, such as communicating with a computing node connected to the computing device 800.
应理解,根据本申请实施例的计算设备800可对应于本申请实施例中的资源调度装置600,并可以对应于执行根据本申请实施例中图4所示方法中的计算设备800,并且计算设备800所实现的上述和其它操作和/或功能分别为了实现图4中的各个方法的相应流程,为了简洁,在此不再赘述。It should be understood that the computing device 800 according to the embodiment of the present application may correspond to the resource scheduling device 600 in the embodiment of the present application, and may correspond to the computing device 800 executing the method shown in FIG. 4 according to the embodiment of the present application, and calculate The above and other operations and/or functions implemented by the device 800 are respectively intended to implement the corresponding processes of each method in Figure 4. For the sake of simplicity, they will not be described again here.
作为一种可能的实现方式,本申请还提供一种调度器,该调度器的结构如图7所示的调度器700,用于实现图4所示方法中相应流程,为了简洁,在此不再赘述。As a possible implementation method, this application also provides a scheduler. The structure of the scheduler is the scheduler 700 shown in Figure 7, which is used to implement the corresponding process in the method shown in Figure 4. For the sake of simplicity, it is not mentioned here. Again.
作为另一种可能的实现方式,本申请还提供一种芯片,该芯片由电子器件构成,可以用 于实现上述图4所述方法的操作步骤。As another possible implementation, this application also provides a chip, which is composed of electronic devices and can be To implement the operating steps of the method described in Figure 4 above.
作为另一种可能的实现方式,本申请还提供一种芯片,该芯片还可以包括处理器,处理器用于实现上述图4所述方法的操作步骤。As another possible implementation manner, this application also provides a chip, which may further include a processor, and the processor is used to implement the operation steps of the method described in Figure 4 above.
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载或执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘(solid state drive,SSD)The above embodiments may be implemented in whole or in part by software, hardware, firmware, or any other combination. When implemented using software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded or executed on a computer, the processes or functions described in the embodiments of the present application are generated in whole or in part. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, e.g., the computer instructions may be transferred from a website, computer, server, or data center Transmission to another website, computer, server or data center by wired (such as coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (such as infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or a data center that contains one or more sets of available media. The usable media may be magnetic media (eg, floppy disk, hard disk, tape), optical media (eg, DVD), or semiconductor media. The semiconductor medium may be a solid state drive (SSD)
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。 The above are only specific embodiments of the present application, but the protection scope of the present application is not limited thereto. Any person familiar with the technical field can easily think of various equivalent methods within the technical scope disclosed in the present application. Modification or replacement, these modifications or replacements shall be covered by the protection scope of this application. Therefore, the protection scope of this application should be subject to the protection scope of the claims.

Claims (14)

  1. 一种资源调度方法,其特征在于,所述方法应用于高性能计算HPC集群,所述HPC集群包括调度器、多个计算节点和多个交换机,所述多个计算节点通过所述多个交换机实现通信连接,所述方法由所述调度器执行,包括:A resource scheduling method, characterized in that the method is applied to a high-performance computing HPC cluster. The HPC cluster includes a scheduler, multiple computing nodes and multiple switches. The multiple computing nodes pass through the multiple switches. To realize communication connection, the method is executed by the scheduler, including:
    获取待处理的作业;Get pending jobs;
    根据所述HPC集群的拓扑结构,从所述多个计算节点中确定多个第一计算节点,所述多个第一计算节点之间数据传输所经过的交换机的总数量小于阈值;According to the topological structure of the HPC cluster, a plurality of first computing nodes are determined from the plurality of computing nodes, and the total number of switches through which data is transmitted between the plurality of first computing nodes is less than a threshold;
    通知所述多个第一计算节点执行所述作业。Notify the plurality of first computing nodes to execute the job.
  2. 根据权利要求1所述的方法,特征在于,所述多个交换机包括汇聚交换机和边缘交换机,所述多个计算节点通过所述边缘交换机接入所述HPC集群,所述边缘交换机与所述汇聚交换机耦合。The method according to claim 1, characterized in that the plurality of switches include an aggregation switch and an edge switch, the plurality of computing nodes access the HPC cluster through the edge switch, and the edge switch is connected to the aggregation switch. Switch coupling.
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述HPC集群的拓扑结构,从所述多个计算节点中确定多个第一计算节点,包括:The method of claim 2, wherein determining a plurality of first computing nodes from the plurality of computing nodes according to the topological structure of the HPC cluster includes:
    遍历所述HPC集群的拓扑结构,将接入同一边缘交换机的多个计算节点确定为所述多个第一计算节点。Traverse the topology structure of the HPC cluster and determine multiple computing nodes connected to the same edge switch as the multiple first computing nodes.
  4. 根据权利要求2所述的方法,其特征在于,所述根据所述HPC集群的拓扑结构,从所述多个计算节点中确定多个第一计算节点,包括:The method of claim 2, wherein determining a plurality of first computing nodes from the plurality of computing nodes according to the topological structure of the HPC cluster includes:
    遍历所述HPC集群的拓扑结构,确定接入各个边缘交换机的计算节点的数量;Traverse the topology structure of the HPC cluster and determine the number of computing nodes connected to each edge switch;
    当接入各个边缘交换机的计算节点的数量均小于第一数量阈值时,将接入多个边缘交换机的多个计算节点确定为所述多个第一计算节点,所述多个第一计算节点通过至少一个汇聚交换机实现通信连接。When the number of computing nodes connected to each edge switch is less than the first quantity threshold, multiple computing nodes connected to multiple edge switches are determined as the multiple first computing nodes, and the multiple first computing nodes Communication connections are implemented through at least one aggregation switch.
  5. 根据权利要求2所述的方法,其特征在于,所述多个交换机还包括核心交换机,所述汇聚交换机与所述核心交换机耦合,所述根据所述HPC集群的拓扑结构,从所述多个计算节点中确定多个第一计算节点,包括:The method according to claim 2, characterized in that the plurality of switches further comprise a core switch, the aggregation switch is coupled to the core switch, and according to the topology structure of the HPC cluster, the plurality of switches are selected from the plurality of switches. A plurality of first computing nodes are determined among the computing nodes, including:
    遍历所述HPC集群的拓扑结构,确定通过边缘交换机连接各个汇聚交换机的计算节点的数量;Traverse the topology structure of the HPC cluster and determine the number of computing nodes connected to each aggregation switch through edge switches;
    当通过边缘交换机连接各个汇聚交换机的计算节点的数量均小于第二数量阈值时,将通过至少一个核心层交换机实现通信连接的多个计算节点确定为所述多个第一计算节点。When the number of computing nodes connected to each aggregation switch through the edge switch is less than the second quantity threshold, a plurality of computing nodes that are communicated through at least one core layer switch are determined to be the plurality of first computing nodes.
  6. 根据权利要求1至5任一项所述的方法,其特征在于,所述多个第一计算节点包括头计算节点以及至少一个代理计算节点,所述通知所述多个第一计算节点执行所述作业,包括:The method according to any one of claims 1 to 5, characterized in that the plurality of first computing nodes include a head computing node and at least one agent computing node, and the notification of the plurality of first computing nodes to execute the The above tasks include:
    向所述头计算节点发送执行指令,所述执行指令用于通知所述头计算节点执行所述作业,所述至少一个代理计算节点由所述头计算节点通知执行所述作业,所述多个第一计算节点中除所述头计算节点以及所述至少一个代理计算节点之外剩余的第一计算节点,由所述至少一个代理计算节点通知执行所述作业。Send an execution instruction to the head computing node, the execution instruction is used to notify the head computing node to execute the job, the at least one agent computing node is notified by the head computing node to execute the job, the plurality of The remaining first computing nodes among the first computing nodes except the head computing node and the at least one agent computing node are notified by the at least one agent computing node to execute the job.
  7. 根据权利要求6所述的方法,其特征在于,所述至少一个代理计算节点包括第一代理计算节点以及第二代理计算节点,所述第一代理计算节点由所述头计算节点通知执行所述作业,所述第二代理计算节点由所述第一代理计算节点通知执行所述作业,所述第二代理计算 节点还用于通知所述多个第一计算节点中剩余的第一计算节点执行所述作业。The method of claim 6, wherein the at least one agent computing node includes a first agent computing node and a second agent computing node, and the first agent computing node is notified by the head computing node to execute the job, the second agent computing node is notified by the first agent computing node to execute the job, the second agent computing node The node is further configured to notify the remaining first computing nodes among the plurality of first computing nodes to execute the job.
  8. 根据权利要求6或7所述的方法,其特征在于,所述头计算节点与所述至少一个代理计算节点的总数量,小于所述多个第一计算节点中剩余的第一计算节点的数量。The method according to claim 6 or 7, characterized in that the total number of the head computing node and the at least one agent computing node is less than the number of remaining first computing nodes among the plurality of first computing nodes. .
  9. 根据权利要求1至8任一项所述的方法,其特征在于,所述多个计算节点与所述多个交换机的组网方式包括以下网络中的一种:The method according to any one of claims 1 to 8, characterized in that the networking mode of the plurality of computing nodes and the plurality of switches includes one of the following networks:
    两层胖树网络、三层胖树网络、二维网格网络、三维网格网络、二维环网、三维环网。Two-layer fat tree network, three-layer fat tree network, two-dimensional grid network, three-dimensional grid network, two-dimensional ring network, three-dimensional ring network.
  10. 根据权利要求1至9任一项所述的方法,其特征在于,所述方法还包括:The method according to any one of claims 1 to 9, characterized in that the method further includes:
    获取所述HPC集群中的计算节点与交换机之间的第一连接信息以及所述多个交换机之间的第二连接信息;Obtain first connection information between computing nodes and switches in the HPC cluster and second connection information between the multiple switches;
    根据所述第一连接信息以及所述第二连接信息,生成所述HPC集群的拓扑结构。According to the first connection information and the second connection information, a topology structure of the HPC cluster is generated.
  11. 一种资源调度装置,其特征在于,所述资源调度装置应用于高性能计算HPC集群中的调度器,所述HPC集群还包括多个计算节点和多个交换机,所述多个计算节点通过所述多个交换机实现通信连接,所述资源调度装置包括:A resource scheduling device, characterized in that the resource scheduling device is applied to a scheduler in a high-performance computing HPC cluster. The HPC cluster also includes a plurality of computing nodes and a plurality of switches. The plurality of computing nodes pass through The plurality of switches implement communication connections, and the resource scheduling device includes:
    获取模块,用于获取待处理的作业;Get module, used to get pending jobs;
    确定模块,用于根据所述HPC集群的拓扑结构,从所述多个计算节点中确定多个第一计算节点,所述多个第一计算节点之间数据传输所经过的交换机的总数量小于阈值;Determining module, configured to determine a plurality of first computing nodes from the plurality of computing nodes according to the topology structure of the HPC cluster, and the total number of switches through which data is transmitted between the plurality of first computing nodes is less than threshold;
    通知模块,用于通知所述多个第一计算节点执行所述作业。A notification module, configured to notify the plurality of first computing nodes to execute the job.
  12. 一种调度器,所述调度器包括处理器和存储器,所述存储器,用于存储计算机指令;所述处理器,用于根据所述计算机指令执行如权利要求1至10任一项所述方法的操作步骤。A scheduler, the scheduler includes a processor and a memory, the memory is used to store computer instructions; the processor is used to execute the method according to any one of claims 1 to 10 according to the computer instructions operating steps.
  13. 一种计算设备,其特征在于,包括调度器,所述调度器包括处理器和存储器;所述存储器,用于存储计算机指令;所述处理器,用于根据所述计算机指令执行如权利要求1至10任一项所述方法的操作步骤。A computing device, characterized in that it includes a scheduler, the scheduler including a processor and a memory; the memory is used to store computer instructions; the processor is used to execute the claim 1 according to the computer instructions The steps of any one of the methods described in to 10.
  14. 一种计算机可读存储介质,其特征在于,包括指令,当该指令在计算设备上运行时,使得所述计算设备执行如权利要求1至10中任一项所述方法的操作步骤。 A computer-readable storage medium, characterized by comprising instructions that, when run on a computing device, cause the computing device to perform the operational steps of the method according to any one of claims 1 to 10.
PCT/CN2023/080047 2022-03-08 2023-03-07 Resource scheduling method, apparatus, and related device WO2023169408A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210227649.1 2022-03-08
CN202210227649.1A CN116775258A (en) 2022-03-08 2022-03-08 Resource scheduling method and device and related equipment

Publications (1)

Publication Number Publication Date
WO2023169408A1 true WO2023169408A1 (en) 2023-09-14

Family

ID=87935997

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/080047 WO2023169408A1 (en) 2022-03-08 2023-03-07 Resource scheduling method, apparatus, and related device

Country Status (2)

Country Link
CN (1) CN116775258A (en)
WO (1) WO2023169408A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012252591A (en) * 2011-06-03 2012-12-20 Hitachi Ltd Process allocation system, process allocation method, and process allocation program
CN102904750A (en) * 2012-09-24 2013-01-30 曙光信息产业(北京)有限公司 Activity allocation method based on network topology
CN104579801A (en) * 2015-02-10 2015-04-29 广州市品高软件开发有限公司 Method for dispatching software-defined network controller cluster
CN113094179A (en) * 2021-04-23 2021-07-09 曙光信息产业(北京)有限公司 Job distribution method and device, electronic equipment and readable storage medium
CN113190358A (en) * 2021-05-25 2021-07-30 曙光信息产业(北京)有限公司 Job distribution method and device, electronic equipment and readable storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012252591A (en) * 2011-06-03 2012-12-20 Hitachi Ltd Process allocation system, process allocation method, and process allocation program
CN102904750A (en) * 2012-09-24 2013-01-30 曙光信息产业(北京)有限公司 Activity allocation method based on network topology
CN104579801A (en) * 2015-02-10 2015-04-29 广州市品高软件开发有限公司 Method for dispatching software-defined network controller cluster
CN113094179A (en) * 2021-04-23 2021-07-09 曙光信息产业(北京)有限公司 Job distribution method and device, electronic equipment and readable storage medium
CN113190358A (en) * 2021-05-25 2021-07-30 曙光信息产业(北京)有限公司 Job distribution method and device, electronic equipment and readable storage medium

Also Published As

Publication number Publication date
CN116775258A (en) 2023-09-19

Similar Documents

Publication Publication Date Title
US10838890B2 (en) Acceleration resource processing method and apparatus, and network functions virtualization system
US8898505B2 (en) Dynamically configureable placement engine
EP3754915B1 (en) Data processing method and apparatus
US20150236963A1 (en) Qos in a system with end-to-end flow control and qos aware buffer allocation
WO2021136137A1 (en) Resource scheduling method and apparatus, and related device
WO2013117136A1 (en) Capacity-based multi-task scheduling method, device and system
WO2019233322A1 (en) Resource pool management method and apparatus, resource pool control unit, and communication device
WO2014194869A1 (en) Request processing method, device and system
WO2020192649A1 (en) Data center management system
WO2021018183A1 (en) Resource allocation method and resource offloading method
Mishra et al. Time efficient dynamic threshold-based load balancing technique for Cloud Computing
CN107251486A (en) A kind of method for extending linkage, apparatus and system
CN111158909B (en) Cluster resource allocation processing method, device, equipment and storage medium
WO2014183531A1 (en) Method and device for allocating remote memory
WO2024021489A1 (en) Task scheduling method and apparatus, and kubernetes scheduler
WO2017075796A1 (en) Method and apparatus for allocating virtual resources in network functions virtualization (nfv) network
Mishra et al. Pareto-optimal cost optimization for large scale cloud systems using joint allocation of resources
Mershad et al. A study of the performance of a cloud datacenter server
WO2014101502A1 (en) Memory access processing method based on memory chip interconnection, memory chip, and system
WO2023169408A1 (en) Resource scheduling method, apparatus, and related device
CN111835809B (en) Work order message distribution method, work order message distribution device, server and storage medium
CN116886496A (en) DPU-based data processing method, device, equipment and readable storage medium
WO2021022947A1 (en) Method for deploying virtual machine and related device
Alghamdi et al. Fog Network Area Management Model for Managing Fog-cloud Resources in IoT Environment
TW202205091A (en) System for computing and method for arranging nodes thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23765986

Country of ref document: EP

Kind code of ref document: A1