CN117041259A

CN117041259A - Scheduling method and device for computing resources

Info

Publication number: CN117041259A
Application number: CN202311285593.6A
Authority: CN
Inventors: 宛清
Original assignee: New H3C Technologies Co Ltd
Current assignee: New H3C Technologies Co Ltd
Priority date: 2023-09-28
Filing date: 2023-09-28
Publication date: 2023-11-10
Anticipated expiration: 2043-09-28
Also published as: CN117041259B

Abstract

The application provides a scheduling method and device of computing resources. The method realizes that when a single interface board in the leaf switch cannot meet the calculation resources required to be occupied by the calculation tasks, the calculation subtasks of the calculation tasks are respectively processed in parallel by using the calculation resources on a plurality of interface boards so as to cooperatively complete the calculation tasks, thereby reducing the probability that the idle calculation resources of the interface boards are vacant, improving the utilization rate of the calculation resources in network calculation and improving the efficiency of network calculation.

Description

Scheduling method and device for computing resources

Technical Field

The present application relates to the field of communications technologies, and in particular, to a method and an apparatus for scheduling computing resources.

Background

With the advent of high performance computing (High Performance Computing, HPC), collective communication systems have been widely used because replacing a large number of point-to-point operations with collective operations can effectively improve computing performance. In the aggregate communication system, the data processing and calculation processes are all completed by a server, the server needs to complete task calculation through a plurality of times of data communication interaction of channels, and a great deal of time is consumed on communication delay of the server. Based on this, in order to further improve the computing efficiency, it is proposed In the industry to use the network computing (In-network Computing, INC) technology to offload the computation to the communicating switch, thereby reducing the number of data communications, reducing the communication latency for completing the computing task, improving the computing efficiency, and completing the high-performance data computing function.

In network computing, in order to have both computing and communication functions, each interface board of a switch is configured with a hardware computing unit (note that each interface board has computing resources), and a conventional computing resource scheduling manner is that if an idle computing resource of any interface board in the switch can satisfy a computing resource sum M required by each computing subtask of a certain computing task, all task messages of the computing task are led to the interface board for computing. However, in the above manner, when each interface board cannot meet the sum M of computing resources required by the computing task, the computing task cannot be allocated, and at this time, the sum of idle computing resources of each interface board has a high probability of meeting the total amount of computing resources occupied by the computing task, which causes a problem that idle computing resources of each interface board are empty, i.e., the utilization rate of computing resources in network computing is low.

Disclosure of Invention

In view of the above, the present application provides a method and apparatus for scheduling computing resources to improve the utilization of computing resources in network computing.

In a first aspect, an embodiment of the present application provides a method for scheduling computing resources, where the method is applied to a main control board in a leaf switch, and the method includes:

Receiving task information sent by each server; the task information sent by any server comprises a server process identifier of a computing subtask which belongs to a computing task and computing resources required by the computing subtask;

determining whether an interface board with idle computing resources meeting M exists from the interface boards according to the sum M of computing resources required by each computing subtask belonging to the computing task;

if not, distributing corresponding target interface boards for each computing subtask from each interface board according to the computing resources required by each computing subtask belonging to the computing task and the idle computing resources existing in each interface board, and designating at least one interface board from each interface board as a convergence board;

when the target interface board is not a convergent board, a first type mapping relation and a second type mapping relation corresponding to the computing task are issued to the target interface board, and when the target interface board is a convergent board, a first type mapping relation corresponding to the computing task is issued to the target interface board; the first kind of mapping relation at least comprises a server process identifier for running a calculation subtask and a corresponding relation between target interface boards to which the calculation subtask is distributed, so that any target interface board forwards a task message sent by each server process to the target interface board corresponding to the server process for calculation to obtain a calculation subtresult; the second type of mapping relation is used for indicating each target interface board which is not designated as the convergence board to send the calculation sub-result to the convergence board for summarizing to obtain the convergence result, and the convergence board outputs the convergence result.

In a second aspect, an embodiment of the present application further provides a method for scheduling a computing resource, where the interface board includes a first identifier, and the method includes:

receiving a task message, wherein the task message comprises a server process identifier for running a computing sub-task;

searching an interface board identifier corresponding to the server process identifier from a first type of mapping relation stored locally;

if the interface board identification is different from the first identification, sending a task message to an interface board indicated by the interface board identification, and calculating the task message by the interface board indicated by the interface board identification;

if the interface board identification is the same as the first identification, the task message is calculated to obtain a calculation sub-result.

In a third aspect, an embodiment of the present application further provides a scheduling apparatus for computing resources, where the apparatus is applied to an interface board in a leaf switch, and the apparatus includes:

the first receiving module is used for receiving task information sent by each server; the task information sent by any server comprises a server process identifier of a computing subtask which belongs to a computing task and computing resources required by the computing subtask;

the first determining module is used for determining whether one interface board with idle computing resources meeting M exists in the interface boards according to the sum M of computing resources required by each computing subtask belonging to the computing task;

The allocation module is used for allocating corresponding target interface boards for each computing subtask from each interface board according to the computing resources required by each computing subtask belonging to the computing task and the idle computing resources existing in each interface board if one interface board with idle computing resources meeting M does not exist in each interface board, and designating at least one interface board from each interface board as a convergence board;

the issuing module is used for issuing the first type mapping relation and the second type mapping relation corresponding to the computing task to the target interface board when the target interface board is not the convergent board, and issuing the first type mapping relation corresponding to the computing task to the target interface board when the target interface board is the convergent board; the first kind of mapping relation at least comprises a server process identifier for running a calculation subtask and a corresponding relation between target interface boards to which the calculation subtask is distributed, so that any target interface board forwards a task message sent by each server process to the target interface board corresponding to the server process for calculation to obtain a calculation subtresult; the second type of mapping relation is used for indicating each target interface board which is not designated as the convergence board to send the calculation sub-result to the convergence board for summarizing to obtain the convergence result, and the convergence board outputs the convergence result.

In a fourth aspect, an embodiment of the present application further provides a scheduling apparatus for a computing resource, where the apparatus is applied to an interface board in a leaf switch, where the interface board includes a first identifier; the device comprises:

the second receiving module is used for receiving a task message, wherein the task message comprises a server process identifier for running a calculation sub-task;

the searching module is used for searching the interface board identification corresponding to the server process identification from the locally stored first type of mapping relation;

the sending module is used for sending a task message to the interface board indicated by the interface board identifier if the interface board identifier is different from the first identifier, and calculating the task message by the interface board indicated by the interface board identifier;

and the processing module is used for carrying out calculation processing on the task message if the interface board identification is the same as the first identification, so as to obtain a calculation sub-result.

In a fifth aspect, an embodiment of the present application further provides an electronic device, including: a processor and a memory for storing computer program instructions which, when executed by the processor, cause the processor to perform the steps of the method as above.

In a sixth aspect, embodiments of the present application also provide a machine-readable storage medium storing computer program instructions which, when executed, enable the steps of the method as above to be carried out.

According to the technical scheme, when the main control board of the leaf switch determines that the idle computing resources do not exist in the local interface boards and can meet the computing resource sum M required by each computing subtask belonging to the computing task, according to the computing resources required by each computing subtask and the idle computing resources existing in the local interface boards, corresponding target interface boards are distributed for each computing subtask from each interface board, at least one interface board is designated as a convergence board from each interface board, and when the target interface board is not the convergence board, the first type mapping relation and the second type mapping relation corresponding to the computing task are issued to the target interface board, and when the target interface board is the convergence board, the first type mapping relation corresponding to the computing task is issued to the target interface board, so that the task message is guided to be computed according to the first type mapping relation and the second type mapping relation after any subsequent interface board receives the task message. When a single interface board in the leaf switch cannot meet the calculation resources required to be occupied by the calculation tasks, the calculation subtasks of the calculation tasks are respectively processed in parallel by using the calculation resources on the plurality of interface boards so as to cooperatively complete the calculation tasks, thereby reducing the probability that the idle calculation resources of the interface boards are vacant, improving the utilization rate of the calculation resources in network calculation, and improving the efficiency of network calculation.

Drawings

FIG. 1 is a block diagram of a networking architecture of an online computing system according to an exemplary embodiment of the present application;

FIG. 2 is a flow chart of a method for scheduling computing resources according to an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a method for scheduling computing resources according to another exemplary embodiment of the present application;

fig. 4 is a schematic diagram of a calculation path of each task packet received by a leaf switch according to an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of a basic hardware structure of a device where a computing resource scheduling apparatus according to an embodiment of the present application is located;

FIG. 6 is a schematic diagram illustrating a computing resource scheduling apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computing resource scheduling apparatus according to another embodiment of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, the same numbers in different drawings identify the same or similar elements, unless otherwise indicated. The implementations described in the following exemplary examples do not represent all implementations consistent with the application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items.

In order to better understand the technical solution provided by the embodiments of the present application and make the above objects, features and advantages of the embodiments of the present application more obvious, the technical solution in the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

In order to facilitate understanding of the method, before describing the method, a system provided by an embodiment of the present application is described:

referring to fig. 1, fig. 1 is a block diagram of a networking architecture of an online computing system according to an exemplary embodiment of the present application. It should be noted that, the implementation environments of the computing resource scheduling method provided in the embodiment of the present application are all the networking architecture of the online computing system shown in fig. 1. As shown in fig. 1, the system includes: a plurality of servers connected to the network computing manager 101, the spine switch located at the upper layer, the leaf switches located at the lower layer, and each leaf switch. The high port density switch of the spine switch, the leaf switch, serves as an access layer, provides network connectivity to multiple server servers, and interfaces with the spine switch. The number of the spine switch and the leaf switch may be one or more, and embodiments of the present application are not particularly limited.

The leaf switch is provided with a main control board and a plurality of interface boards, each interface board provides computing resources in the form of a hardware computing unit, and the hardware computing unit can be an FPGA board card or a computing chip.

As an embodiment, the specific workflow of the online computing system is as follows:

1. based on the computing needs (e.g., computing weather data to predict subsequent weather, or computing oil exploration data to predict oil exploration locations), if a task manager is present in the system, the user enters a create instruction for a computing task in the task manager, the create instruction carrying the computing task with multiple computing subtasks in parallel, and each computing subtask corresponding to a server process running on each designated server. Alternatively, if the system does not have a task manager, the user initiates the server process directly on the designated server.

For example, the server 1, the server 2, the server 3 and the server 4 in fig. 1 are selected to participate in the operation, and the total of 10 computing subtasks are calculated for each computing subtask, and each computing subtask corresponds to a running server process on the server, and the server processes are distinguished by a server process identifier (rank ID) in the operation process. The server processes operated by the server 1 are rank ID 1 and rank ID 2, the server processes operated by the server 2 are rank ID 3, rank ID 4 and rank ID 5, the server processes operated by the server 3 are rank ID 6, rank ID 7 and rank ID 8, and the server processes operated by the server 4 are rank ID 9 and rank ID 10.

The user designates a server participating in the calculation task according to the busyness of each server.

2. Any designated server transmits task information to the online computing manager 101, and the task information transmitted by the server includes at least: an identification of a computing task (a task ID), an identification of a server process running a computing subtask of the computing task, computing resources required by the computing subtask, a total number of all computing subtasks (which may also be referred to as a total number of server processes running all computing subtasks of the computing task).

3. When the network computing manager 101 aggregates the task information sent by all the servers participating in the present computing task, illustratively, when the total number of the task information carrying the task ID received by the network computing manager 101 is equal to the total number of all the computing sub-tasks of the computing task carried in any one of the task information, it is determined that all the task information sent by the designated servers are aggregated.

At this time, the network computing manager 101 determines whether servers where server processes of all computing subtasks running the computing task are located are distributed across switches, if the servers are distributed across switches, for each of a plurality of leaf switches to which the servers belong, the network computing manager 101 forwards task information sent by a server to which the leaf switch is connected to the leaf switch, designates a leaf switch from a plurality of leaf switches commonly connected to the plurality of leaf switches as a summarized leaf switch, and sends address information of the designated leaf switch to the leaf switch (at this time, the leaf switch gathers the received aggregated results sent by the leaf switches to obtain a target result, and sends the target result to each leaf switch that has sent the aggregated results).

If all servers are not distributed across switches, the network computing manager 101 sends the task information sent by all switches to the same leaf switch to which all servers belong together.

4. If all the servers belong to the same leaf switch (i.e., all the servers are not distributed across switches), the leaf switch executes the flow shown in fig. 2 and 3 and embodiment 1, which will be described in detail later, and will not be repeated here.

If all the servers belong to multiple leaf switches (i.e. all servers of the girl book are distributed across switches), the flow shown in embodiment 2 is referred to, and details thereof will not be described herein.

The above presents a workflow of an on-network computing system and the following describes in detail the scheduling method of computing resources performed by a leaf switch:

referring to fig. 2, fig. 2 is a flowchart of a method for scheduling computing resources according to an exemplary embodiment of the present application. As an embodiment, the method may be applied to a main control board in a leaf switch.

It should be noted that, for convenience of description, the embodiments shown in fig. 2 and 3 are described in a case where servers where server processes of all computing subtasks of a computing task are located are not distributed across switches.

As shown in fig. 2, the process may include the steps of:

s201, receiving task information sent by each server; the task information sent by any one of the servers includes a server process identification of the computing subtask running on the server as belonging to the computing task and the computing resources required by the computing subtask.

Illustratively, the task information sent by any server includes: an identification of the computing task (task ID), an identification of a server process running a computing subtask belonging to the computing task on the server, and computing resources required by the computing subtask.

S202, determining whether an interface board with idle computing resources meeting M exists from the local interface boards according to the sum M of computing resources required by each computing subtask belonging to the computing task.

If not, the following step S203 is performed. If yes, the interface board is allocated to the computing task, and all task messages of the computing task are all led to the interface board for computing.

In this embodiment, the sum M of the computing resources required by each computing subtask belonging to the computing task may be calculated by the online computing manager and forwarded to the master control board of the leaf switch by the online computing manager, so that the local master control board obtains the sum M of the computing resources, or may be obtained by the master control board performing an operation on the computing resources required by each computing subtask.

S203, determining whether one interface board with idle computing resources meeting M exists in the interface boards according to the sum M of computing resources required by each computing subtask belonging to the computing task, if not, distributing corresponding target interface boards for each computing subtask in the interface boards according to the computing resources required by each computing subtask belonging to the computing task and the idle computing resources existing in the interface boards, and designating at least one interface board from the interface boards as a convergence board.

In this embodiment, alternatively, only one interface board is selected as the aggregation board, and all the computing subtasks of the interface boards that are assigned with computing subtasks but are not assigned as the aggregation boards are summarized to the one aggregation board. Optionally, if the number of the computing subtasks is excessive, a plurality of interface boards may be selected as the aggregation boards, and each of the interface boards as the aggregation boards aggregates a part of the computing subtasks allocated to the computing subtasks but not designated as the interface boards of the aggregation boards, and finally all the computing subtasks are summarized to one aggregation board.

Optionally, if the aggregation board is one of the plurality of target interface boards, the idle computing resources of the aggregation board satisfy the computing resources required by the computing subtasks allocated by the aggregation board and the computing resources required by the aggregation operation. The aggregation operation is an operation that the aggregation board gathers the received calculation sub-results to obtain an aggregation result.

Alternatively, if the aggregation board is not any one of the plurality of target interface boards, the idle computing resources of the aggregation board satisfy the computing resources required for the aggregation operation.

S204, when the target interface board is not a convergent board, the first type mapping relation and the second type mapping relation corresponding to the calculation task are issued to the target interface board, and when the target interface board is a convergent board, the first type mapping relation corresponding to the calculation task is issued to the target interface board; the first kind of mapping relation at least comprises a server process identifier for running a calculation subtask and a corresponding relation between target interface boards to which the calculation subtask is distributed, so that any target interface board forwards a task message sent by each server process to the target interface board corresponding to the server process for calculation to obtain a calculation subtresult; the second type of mapping relation is used for indicating each target interface board which is not designated as the convergence board to send the calculation sub-result to the convergence board for summarizing to obtain the convergence result, and the convergence board outputs the convergence result.

Illustratively, the first type of mapping relationship includes at least: task ID, server process identification running the computation subtask, and correspondence between the target interface boards to which the computation subtask is assigned.

In this embodiment, there are many implementation manners for obtaining the first type of mapping relationship in specific implementation, and the following is described by way of example.

For example, the computing task has 10 computing subtasks, the server processes of the 10 computing subtasks are identified as rank IDs 1 to 10, the server processes represented by rank IDs 1 to 10 are all run by the server connected down by the leaf1 switch, and the computing resources required by each of rank IDs 1 to 10 are 50M. The leaf1 switch is provided with 8 interface boards (marked as slots 1 to 8), the idle computing resource quantity of each of slots 1 to 8 is 300M, 200M of slot 5, 200M of slot 6 and 0M of the rest slots.

First, starting allocation from slot 6 and slot 5 with the least amount of free computing resources, 200M may process 4 rank, for slot 5 rank ID 1, rank ID 2, rank ID 3 and rank ID 4, and for slot 6 rank ID 5, rank ID 6, rank ID 7 and rank ID 8. Then reallocating slot 4 with the largest amount of idle computing resources, and remaining 2 rank ID 9 and rank ID 10, and processing rank ID 9 and rank ID 10 by slot 4.

The first class of mapping is presented in the following manner:

task ID+rank ID 1- - -slot 5

Task ID+rank ID 2- - -slot 5

Task ID+rank ID 3- - -slot 5

Task ID+rank ID 4- - -slot 5

Task ID+rank ID 5- - -slot 6

Task ID+rank ID 6- - -slot 6

Task ID+rank ID 7- - -slot 6

Task ID+rank ID 8- - -slot 6

Task ID+rank ID 9- - -slot 4

Task ID+rank ID 10- - -slot 4

As an embodiment, the method further comprises: distributing corresponding logic process service identifiers for each target interface board which is not designated as a convergence board; the logical process service identification is different from either server process identification. The second type of mapping relation at least comprises a logic process identifier corresponding to a target interface board which is not designated as a convergence board and a corresponding relation between the convergence boards, so that any target interface board which is not designated as a convergence board forms a new task message according to the distributed logic process service identifier and a calculation sub-result, and forwards the new task message to the convergence board according to the second type of mapping relation.

Exemplary, second-class mappings include: and calculating the corresponding relation among the identification ID of the task, the logical process service identification corresponding to the target interface boards which are not used as the convergence boards and the identification of the convergence boards. There are many implementations for obtaining the second type of mapping relation in specific implementation, and the following is described by way of example.

For example, the master control board determines that slot4 is used as a convergence board, from the summary of slot5 to the summary of slot4 as a logic process service identifier rank ID 11, and from the summary of slot6 to the summary of slot4 as a logic process service identifier rank ID 12, each of rank ID 11 and rank ID 12 needs to occupy 50M of computing resources, and then the amount of idle computing resources of slot4 and 200M can be scheduled. At this time, the amount of idle computing resources of each of slots 1 to 8 is 200M for slot4, and the remaining slots are 0M.

At this time, the second-type mapping relationship is presented in the following manner:

task ID+rank ID 11- - -slot 4

Task ID+rank ID 12- - -slot 4

After the main control board generates the first type mapping relation and the second type mapping relation, the first type mapping relation is issued to each local interface board, and the second type mapping relation is issued to the interface boards (namely, slots 5 and 6) which are allocated with the calculation subtasks and are not designated as convergence boards so as to guide the subsequent received task messages to calculate.

As an embodiment, the method further comprises: and transmitting a third type mapping relation corresponding to the calculated task to the convergence board, so that the convergence board determines whether to collect the data to be converged or not based on the third type mapping relation, and executes convergence operation when determining to collect the data to be converged.

In this embodiment, the third type of mapping relationship includes task ID, and service identifiers of each logical process generated by the main control board, for example, task id+rankid11+rankid12— slot4, and then the aggregation board calculates after all task messages of the rank ID 11 and the rank ID 12 are aligned.

Thus, the flow shown in fig. 2 is completed.

Through the effect achieved by the flow of fig. 2, when the master control board of the leaf switch determines that no idle computing resource exists in each local interface board and can meet the computing resource sum M required by each computing subtask belonging to a computing task, according to the computing resource required by each computing subtask and the idle computing resource existing in each local interface board, a corresponding target interface board is allocated to each computing subtask from each interface board, at least one interface board is designated as a convergence board from each interface board, and when the target interface board is not the convergence board, a first type mapping relation and a second type mapping relation corresponding to the computing task are issued to the target interface board, and when the target interface board is the convergence board, the first type mapping relation corresponding to the computing task is issued to the target interface board, so that after any subsequent interface board receives a task message, the task message is guided to perform computation according to the first type mapping relation and the second type mapping relation. When a single interface board in the leaf switch cannot meet the calculation resources required to be occupied by the calculation tasks, the calculation subtasks of the calculation tasks are respectively processed in parallel by using the calculation resources on the plurality of interface boards so as to cooperatively complete the calculation tasks, thereby reducing the probability that the idle calculation resources of the interface boards are vacant, improving the utilization rate of the calculation resources in network calculation, and improving the efficiency of network calculation.

The above describes in detail how the master control board of the leaf switch plans the subsequent calculation of the task message, and how the interface board of the leaf switch specifically operates after receiving the task message sent by each server process is described in detail below:

referring to fig. 3, fig. 3 is a flowchart illustrating a method for scheduling computing resources according to another exemplary embodiment of the present application. As an embodiment, the method may be applied to an interface board, where the interface board and the main control board provided in the above embodiment are located in the same leaf switch. Each interface board comprises a first identification, which is the interface board identification mentioned in the above embodiments, the first identification being used to distinguish the identity of the respective interface board.

As shown in fig. 3, the process may include the steps of:

s301, receiving a task message, wherein the task message comprises a server process identifier for running a computing sub-task.

As an embodiment, before step S301, the method further comprises:

when the interface board is not designated as a sink of a computing task by the master board (the sink of a computing task refers to a target interface board to which the master board is assigned a computing subtask but is not designated as a sink in the embodiment shown in fig. 2), the first type of mapping relationship and the second type of mapping relationship are received.

When the interface board is designated as a convergence board, a first type of mapping relationship is received, wherein the first type of mapping relationship at least comprises a corresponding relationship between a server process identifier running each computing subtask and an identifier of a target interface board to which each computing subtask is allocated. The second kind of mapping relation is used for indicating that the calculation sub-results are sent to the convergence board to be summarized to obtain convergence results, and the convergence board outputs the convergence results.

It should be noted that, the first type of mapping relationship and the second type of mapping relationship are issued by the main control board on the same leaf switch as the interface board by executing the steps shown in fig. 2.

S302, searching an interface board identifier corresponding to the server process identifier from the locally stored first type of mapping relation.

S303, if the interface board identification is different from the first identification, sending a task message to the interface board indicated by the interface board identification, and calculating the task message by the interface board indicated by the interface board identification;

s304, if the interface board identification is the same as the first identification, calculating the task message to obtain a calculation sub-result.

As an embodiment, a header (record, MPI header) of the task packet sent by each server at least includes: the task ID, the rank ID, the identity of each piece of data in the task message (denoted as sequence number), the operation type of the collective operation (for example, allReduce), the data type and the data quantity, and these information are edited according to a fixed format to form an MPI header.

After the MPI header of the task message of any server is edited in the above manner, correspondingly, each interface board determines the slot ID corresponding to the rank ID according to the rank ID and the first type mapping relation in the MPI header of the received task message, if the slot ID is the first identification, the task message is calculated to obtain a calculation sub-result, and if the slot ID is not the first identification, the task message corresponding to the rank ID is led to the interface board represented by the slot ID for calculation.

It should be noted that, the interface board performing calculation processing on each pair of task messages may perform calculation after all allocated calculation subtasks are aligned, and when receiving the first type mapping relationship, the interface board may determine, according to the first identifier and the interface board identifier in the first type mapping relationship, which messages sent by the server processes need to be executed by the interface board, for example, for slot5, after receiving 10 expressions of the first type mapping relationship mentioned in the foregoing embodiment, it is clearly known that rank ID 1, rank ID 2, rank ID 3, and rank ID4 need to perform calculation by the interface board itself, and then after receiving the messages, after all task messages of rank ID 1, rank ID 2, rank ID 3, and rank ID4 are aligned, operation is performed. Similarly, for slot6, the operations are performed after all task messages of rank ID 5, rank ID 6, rank ID 7, and rank ID 8 are in order.

The above details illustrate that the task message sent by each server process is transmitted to the corresponding interface board for calculation.

As an embodiment, when the interface board performing the above steps S301 to S304 is not designated as a sink board of a computing task by the main control board, in other words, a target interface board to which a computing sub-task is assigned by the main control board is not a sink board. After the interface board obtains the computation sub-interface, the interface board sends the computation sub-result to the convergence board according to the locally stored second class mapping relation.

Illustratively, the second class of mappings includes task IDs, logical process identifications corresponding to interface boards that are not designated as a sink, and correspondence between the sink.

The specific implementation manner of sending the calculation sub-result to the convergence board according to the above-mentioned second type of mapping relation stored locally is as follows: and forming a new task message according to the obtained logic process service identifier and the calculation sub-result, and forwarding the new task message to the convergence board according to the second type mapping relation. Specifically, since the calculation sub-result of the interface board not designated as the sink board needs to be transmitted to the interface board designated as the sink board, the calculation sub-result on any interface board not designated as the sink board needs to be allocated with the calculation sub-task to reform the task message, and the header of the reformed task message is only changed in rank ID in the header compared with the header of the task message received by the interface board. The rank ID in the header of the reformed task message is changed into the rank ID of the logic process service identifier. And each target interface board which is not designated as the aggregation board sends the reformed task message to the aggregation board according to the second type of mapping relation.

As one embodiment, when the interface board is designated as a convergence board, receiving the computation sub-results sent by the interface boards of the computation tasks not designated as the convergence board, performing a convergence operation on the local computation sub-results and the received computation sub-results, and outputting a convergence result.

As one embodiment, when the interface board is used as a convergence board, a third type of mapping relation is received; the third type of mapping relationship is used for indicating the aggregation board to determine whether to aggregate the data to be aggregated, and the step S304 of executing the aggregation operation on the local computation sub-result and the received computation sub-result is performed on the premise of determining to aggregate the data to be aggregated based on the third type of mapping relationship. For example, for slot4, slot4 will operate after the task messages that are reformed by the computation sub-results of rank ID 11-12 are all in order.

Alternatively, slot4 may first perform an operation after the task messages of rank ID 8-9 are aligned to obtain a local computation sub-result, and then, after the task messages that are formed again by the computation sub-results of rank ID 11-12 are aligned, summarize the local computation sub-result and the computation sub-results of rank ID 11-12 to obtain a convergence result. The slot4 may further collect the task message of the rank ID 8-9 and the computation sub-result of the rank ID 11-12 to obtain an aggregate result after the task messages re-formed by the computation sub-results of the rank ID 11-12 are all in order.

Since the servers of the server processes belonging to all the computing sub-tasks of the computing task are not distributed across the switch in the embodiment, the outputting of the aggregate result in the step S304 is performed as sending the target result to the server process through the remote direct data access (Remote Direct Memory Access, RDMA) channel established between the present leaf switch and the server of the server process; the RDMA channel is established according to the QPN number of the server where the server process is located, the IP address of the server, and the QPN number and IP address of the leaf switch.

For example, an RDMA channel may be expressed by:

the identification ID of the computing task+rank ID n- - -local qpn+peer ipaddr+peer qpn, n is any positive integer from 1 to 10.

For example, task ID 3+rank ID 0 corresponds to 1.1.1.1-2.2.2.2 5,1.1.1.1 being the IP address of the own switch, 3 being the local qpn number of the process on the server to which the switch connects rank ID 0 corresponds, 2.2.2.2 being the server IP address, 5 being the local qpn number of the own switch to which the process on the server connects.

In this embodiment, the above obtained target result is added with an MPI header, where the rank IDs in the MPI header are the rank IDs in the received task information sent by the server, for example, rank IDs 1-10, and the target result is sent to the server process that has sent the computing subtask to the present leaf switch by using the rank ID in the newly added MPI and the above established RDMA.

Thus, the flow shown in fig. 3 is completed.

The effect achieved by the flow of fig. 3 is achieved by guiding the received task message to operate in the above manner, so as to obtain the target result of the computing task, which realizes that when a single interface board in the leaf switch cannot meet the computing resources required to be occupied by the computing task, each computing subtask of the computing task is processed in parallel by utilizing the computing resources on a plurality of interface boards in the leaf switch respectively, so as to cooperatively complete the computing task, thereby reducing the idle computing resource of the interface board, improving the utilization rate of the computing resources in network computing, and improving the efficiency of network computing.

If servers where server processes of all computing subtasks of the computing task are distributed across the switch, after each Leaf switch executes the steps and the converging board obtains the converging result, each Leaf switch sends the converging result to the spine switch, and the spine switch gathers the converging results sent by each Leaf switch to obtain the target result of the computing task.

And after the spine switch obtains the target result, the target result is returned to the aggregation board of each Leaf switch, and the aggregation board continues to send the target result to the server process according to the established RDMA channel.

For a more detailed understanding of the present process, the specific steps of the process provided herein are set forth in more detail in examples 1 and 2 below.

Example 1:

in embodiment 1, the leaf-spine networking is as shown in fig. 1, and server processes are respectively started up on the server 1, the server 2, the server 3 and the server 4 connected under the leaf1 switch, that is, servers where server processes of all computing subtasks of the computing task in embodiment 1 are located are not distributed across switches.

This example 1 includes the following steps:

1. the user inputs a creation instruction of a calculation task at the task manager based on the calculation demand to start a server process on the server 1, the server 2, the server 3, and the server 4 connected under the leaf1 switch, respectively.

Wherein the computing task is divided into 10 sub-computing tasks, each sub-computing task corresponding to a server process running on a server. The server processes operated by the server 1 are rank ID 1 and rank ID 2, the server processes operated by the server 2 are rank ID 3, rank ID 4 and rank ID 5, the server processes operated by the server 3 are rank ID 6, rank ID 7 and rank ID 8, and the server processes operated by the server 4 are rank ID 9 and rank ID 10.

2. The server 1, the server 2, the server 3, and the server 4 respectively transmit the task information to the online computing manager 101.

For example, the task information transmitted by the server 1 includes: the task ID, the rank ID 1 and the rank ID 2, the computational resources required by each of the rank ID 1 and the rank ID 2 are 50M, and the total number of server processes is 10.

3. After the network computing manager receives the task information sent by each server, and after the task information corresponding to the rank ID 1-10 is in order, determining that the servers where the rank ID 1-10 is located are all affiliated to the leaf1, and then sending the task information of each of the 10 sub-computing tasks to the leaf1 switch.

The task information of the 10 computing subtasks received by the leaf1 switch is expressed as follows:

the required computing resources for task ID, rank ID 1 are 50M.

The required computing resources for task ID, rank ID 2 are 50M.

Task ID, rank ID 3 requires 50M computing resources.

The required computing resources for task ID, rank ID 4 are 50M.

The required computing resources for task ID, rank ID 5 are 50M.

The required computing resources for task ID, rank ID 6 are 50M.

The required computing resources for task ID, rank ID 7 are 50M.

The required computing resources for task ID, rank ID 8 are 50M.

The required computing resources for task ID, rank ID 9 are 50M.

The required computing resources for task ID, rank ID 10 are 50M.

4. After the leaf1 switch receives the task information of the 10 computing subtasks through the local master control board, the idle computing resource amounts of slots 1 to 8 are obtained, wherein slot 4 is 300M, slot 5 is 200M, slot 6 is 200M, and the rest slots are 0M. The interface board designated as the convergence board is slot 4.

(1) The leaf1 switch main control board generates a first type of mapping relation and transmits the first type of mapping relation to each local interface board.

The first type of mapping is presented in the following manner:

task ID+rank ID 1- - -slot 5

Task ID+rank ID 2- - -slot 5

Task ID+rank ID 3- - -slot 5

Task ID+rank ID 4- - -slot 5

Task ID+rank ID 5- - -slot 6

Task ID+rank ID 6- - -slot 6

Task ID+rank ID 7- - -slot 6

Task ID+rank ID 8- - -slot 6

Task ID+rank ID 9- - -slot 4

Task ID+rank ID 10- - -slot 4

(2) And the Leaf1 switch main control board generates a second type mapping relation and transmits the second type mapping relation to slot 5 and slot 6.

The second type of mapping is presented in the following manner:

task ID+rank ID 11- - -slot 4

Task ID+rank ID 12- - -slot 4

(3) And the Leaf1 switch main control board generates a third type mapping relation and issues the third type mapping relation to the Slo4.

The third type of mapping is presented in the following manner:

task ID+rankId11+rankID 12- - -slot 4

5. After the leaf1 switch receives the task messages uploaded from the server 1, the server 2, the server 3 and the server 4 through the interface boards, the task message carrying the rank ID 1 is drained to the slot 3, and so on. Please refer to fig. 4 for path planning of the specific task message.

After slot 5 calculates the task message carrying the rank ID 1 to 4 to obtain a calculation sub-result, forming a new message by the rank ID 11 to be drained to slot4, after slot 6 calculates the task message carrying the rank ID 5 to 8 to obtain a calculation sub-result, forming a new message by the rank ID 12 to be drained to slot4, and slot4 gathers the task messages of the rank ID 9 and the rank ID 10 and the reformed message sent by slot 5 and slot 6 to obtain a gathering result.

6. slot4 sends the aggregate result to the server processes of each of server 1, server 2, server 3, and server 4 via the established RDMA channel.

Task ID+rankn- - -local qpn+peer ipaddr+peer qpn, n is any positive integer from 1 to 10.

For example, the computing task identifier id3+rank ID 0 corresponds to 1.1.1.1— 2.2.2.2 5,1.1.1.1 is the IP address of the leaf1 switch, 3 is the local qpn number of the process on the server to which the switch connection rankID 0 corresponds, 2.2.2.2 is the server IP address, and 5 is the local qpn number of the leaf1 switch to which the process on the server is connected.

Through the steps, the server process of the computing subtask running the computing task in each server obtains an aggregation result.

Example 2:

in embodiment 2, as shown in fig. 1, the leaf-spine networking starts up server processes on the server 1, the server 2, and the server 6 and the server 8 connected under the leaf1 switch, respectively, that is, servers where server processes of all calculation subtasks of the calculation tasks in this embodiment 2 are located are distributed across switches (across the leaf1 switch and the leaf2 switch).

This example 2 includes the following steps:

1. the user inputs a creation instruction of a calculation task at the task manager based on the calculation requirement to start a server process on the server 1, the server 2, the server 6 and the server 8 connected with the leaf1 switch and the leaf2 switch respectively.

The identities of the processes operated by the server 1 in the operation are rank ID 1 and rank ID 2, the identities of the processes operated by the server 2 in the operation are rank ID 3, rank ID 4 and rank ID 5, the identities of the processes operated by the server 6 in the operation are rank ID 6, rank ID 7 and rank ID 8, and the identities of the processes operated by the server 8 in the operation are rank ID 9 and rank ID 10.

2. The server 1, the server 2, the server 6, and the server 8 respectively transmit the task information to the online computing manager 101.

For example, the task information sent by the server 6 to the online computing manager includes: the task names, rank ID 6, rank ID 7 and rank ID 8 are calculated, the calculation resources required by each of the rank ID 6, rank ID 7 and rank ID 8 are 50M, and the total number of server processes is 10.

3. After the network computing manager receives the task information sent by each server, when the task information corresponding to the rank ID 1-10 is in good order, determining that the server where the rank ID 1-5 is located is affiliated to the leaf1, and the server where the rank ID 6-10 is located is affiliated to the leaf2, sending 5 sub-computing tasks to the leaf1 switch, and sending 5 sub-computing tasks to the leaf2 switch.

The task information of the 5 computing subtasks received by the leaf1 switch is expressed as follows:

The task ID, rank ID 1 has a computational resource requirement of 50M.

The task ID, rank ID 2 has a computational resource requirement of 50M.

The task ID, rank ID 3 has a computational resource requirement of 50M.

The task ID, rank ID 4 has a computational resource requirement of 50M.

The task ID, rank ID 5 has a computational resource requirement of 50M.

The task information of the 5 computing subtasks received by the Leaf2 switch is expressed as follows:

the task ID, rank ID 6 has a computational resource requirement of 50M.

The task ID, rank ID 7 has a computational resource requirement of 50M.

The task ID, rank ID 8 has a computational resource requirement of 50M.

The task ID, rank ID 9 has a computational resource requirement of 50M.

The task ID, rank ID 10, and rank ID 10 have a computational resource requirement of 50M.

4. When the network computing manager determines that the server where the rank ID 1-10 is located is a cross-leaf 1 switch and a leaf2 switch, the spine1 switch is designated as a spine switch for summarizing the aggregation results of the leaf1 and the leaf2 switches. And sending the IP address of the spine1 switch, the qpn number of the spine1 switch and the rank ID 11 to the leaf1 switch, and sending the IP address of the spine1 switch, the qpn number of the spine1 switch and the rank ID 12 to the leaf2 switch (the rank ID 11 and the rank ID 12 are logical process service identifiers which are different from any server process identifiers). The rank ID 11 and rank ID 12 are also sent to the spine1 switch, with a number of logical process service identities of 2. And, when the network computing manager also sends qpn number of the leaf1 switch and qpn number of the leaf2 switch to the spine1 switch for subsequent target results on the spine1 switch, the target results are returned to the interface boards of the leaf1 switch and the leaf2 switch as aggregation boards.

5. After the leaf1 switch receives the 5 computing subtasks through the local master control board, obtaining that the respective computing resource residual amounts of slots 1 to 8 are 200M, 150M and 0M. slot5 is designated as a convergence plate.

(1) The master control board of the leaf1 switch generates a first type of mapping relation and transmits the first type of mapping relation to each interface board.

The first type of mapping is presented in the following manner:

task ID+rank ID 1- - -slot 3

Task ID+rank ID 2- - -slot 3

Task ID+rank ID 3- - -slot 3

Task ID+rank ID 4- - -slot 3

Task ID+rank ID 5- - -slot 5

(2) And the Leaf1 switch main control board generates a second type mapping relation and transmits the second type mapping relation to slot 3.

The second type of mapping is presented in the following manner:

task ID+rank ID 13- - -slot 5

(3) And the leaf1 switch main control board generates a third type of mapping relation and transmits the third type of mapping relation to slot 5.

The third type of mapping is presented in the following manner:

task ID+rank ID 13- - -slot 5

(4) And the master control board of the leaf1 switch generates a fourth type of mapping relation and issues the fourth type of mapping relation to slot 5. The fourth type of mapping relationship is used for indicating that the aggregation result output by the leaf1 switch is summarized to the designated spine1 switch.

The fourth type of mapping is presented in the following manner:

task ID+rank ID 11- - -spin 1

6. After the leaf2 switch receives the 5 computing subtasks through the local master control board, the residual quantity of the computing resources of each of slots 1 to 6 on the leaf2 switch is obtained, wherein slot3 is 200M, slot 5 is 150M, and the rest slots are 0M. slot 5 is designated as a convergence plate.

(1) And the leaf2 switch main control board generates a first type of mapping relation and transmits the first type of mapping relation to each interface board.

The first type of mapping is presented in the following manner:

task ID+rank ID 6- - -slot 3

Task ID+rank ID 7- - -slot 3

Task ID+rank ID 8- - -slot 3

Task ID+rank ID 9- - -slot 3

Task ID+rank ID 10- - -slot 5

(2) And the leaf2 switch main control board generates a second type mapping relation and transmits the second type mapping relation to slot 5.

The second type of mapping is presented in the following manner:

task ID+rank ID 14- - -slot 5

(3) And the leaf1 switch main control board generates a third type of mapping relation and transmits the third type of mapping relation to slot 3.

The third type of mapping is presented in the following manner:

task ID+rank ID 14- - -slot 5

(4) And the master control board of the leaf1 switch generates a fourth type of mapping relation and issues the fourth type of mapping relation to slot 3. The fourth type of mapping relationship is used for indicating that the aggregation result output by the leaf1 switch is summarized to the designated spine1 switch.

The fourth mapping relationship is presented in the following manner:

task ID+rank ID 12- - -spin 1

7. After the leaf1 switch receives the task messages uploaded from the server 1 and the server 2 through the interface boards, the task messages carrying the rank ID 1 are drained to slot 3, and so on. After slot5 in the leaf1 switch calculates task messages carrying rank ID 1 to 4 to obtain a calculation sub-result, the calculation sub-result is formed into a new message by rank ID 13 to be drained to slot5, and the slot5 gathers the task messages of rank ID 5 and the reformed messages of rank ID 13 to obtain a gathering result.

The workflow in the leaf2 switch is similar to that in the leaf1 switch and will not be described in detail here.

8. The aggregation result of the slot5 of the leaf1 switch is formed into a new message by the rank ID 11 and sent to the spine1 switch, and the aggregation result of the slot5 of the leaf2 switch is formed into a new message by the rank ID 12 and sent to the spine1 switch for summarization.

9. The spine1 switch gathers the reformed message sent by the leaf1 switch and the reformed message sent by the leaf2 switch to obtain a target result, and sends the target result to slot5 in the leaf1 switch and slot5 in the leaf2 switch.

10. Slot 5 in the leaf1 switch returns the target result to the servers 1, rank1-5 in server 2, which the leaf1 switch is down connected to, through the established RDMA channel. Slot 5 in the leaf2 switch returns the target result to the servers 6, rank6-10 in server 8 connected down to the leaf2 switch by way of the established RDMA channel.

Specific steps of the methods provided herein in the case of non-cross-switch distribution and cross-switch distribution are set forth in detail above by examples 1 and 2.

The embodiment of the computing resource scheduling device can be applied to a main control board or an interface board in a leaf switch. The apparatus embodiments may be implemented by software, or may be implemented by hardware or a combination of hardware and software. Taking a software implementation as an example, the device in a logic sense is formed by reading corresponding computer program instructions in a nonvolatile memory into a memory by a processor of a device where the device is located for operation. In terms of hardware, as shown in fig. 5, a hardware structure diagram of a device where the computing resource scheduling apparatus of the present application is located is shown in fig. 5, and in addition to the processor, the memory bus, the network interface, and the nonvolatile memory shown in fig. 5, the device where the apparatus is located in the embodiment generally includes other hardware according to the actual function of the device, which is not described herein again.

Fig. 6 is a schematic structural diagram of a computing resource scheduling apparatus according to an embodiment of the present application. The computing resource scheduling device 600 provided by the embodiment of the application can be applied to a main control board in a leaf switch as an embodiment of the computing resource scheduling device 600. As shown in fig. 6, the computing resource scheduling apparatus 600 includes a first receiving module 601, a first determining module 602, an allocating module 603, and a transmitting module 604.

A first receiving module 601, configured to receive task information sent by each server; the task information sent by any server comprises a server process identifier of a computing subtask which belongs to a computing task and computing resources required by the computing subtask;

a first determining module 602, configured to determine, from among the interface boards, whether there is an interface board with idle computing resources satisfying M, according to a sum M of computing resources required by each computing subtask belonging to the computing task;

an allocation module 603, configured to allocate, if there is no interface board with idle computing resources satisfying M in each interface board, a corresponding target interface board from each interface board for each computing subtask according to computing resources required by each computing subtask belonging to the computing task and idle computing resources existing in each interface board, and designate at least one interface board from each interface board as a convergence board;

The issuing module 604 is configured to issue, when the target interface board is not a convergence board, a first type mapping relationship and a second type mapping relationship corresponding to the computing task to the target interface board;

when the target interface board is a convergence board, a first type mapping relation corresponding to the calculation task is issued to the target interface board;

the first kind of mapping relation at least comprises a server process identifier for running a calculation subtask and a corresponding relation between target interface boards to which the calculation subtask is distributed, so that any target interface board forwards a task message sent by each server process to the target interface board corresponding to the server process for calculation to obtain a calculation subtresult; the second type of mapping relation is used for indicating each target interface board which is not designated as the convergence board to send the calculation sub-result to the convergence board for summarizing to obtain the convergence result, and the convergence board outputs the convergence result.

The issuing module 604 is further configured to, for each target interface board that is not designated as a sink, allocate a corresponding logical process service identifier to the target interface board; the logical process service identifier is different from any server process identifier;

the second type of mapping relation at least comprises a logic process identifier corresponding to a target interface board which is not designated as a convergence board and a corresponding relation between the convergence boards, so that any target interface board which is not designated as a convergence board forms a new task message according to the distributed logic process service identifier and a calculation sub-result, and forwards the new task message to the convergence board according to the second type of mapping relation.

As an embodiment, when the aggregation board is one of the target interface boards, the idle computing resources of the aggregation board meet the computing resources required by the assigned computing subtasks of the aggregation board, and the computing resource aggregation operation required by the aggregation operation refers to the operation that the aggregation board gathers the received computing subtasks to obtain an aggregation result; or,

when the convergence board is different from the target interface board, the idle computing resources of the convergence board meet the computing resources required by the convergence operation.

As one embodiment, the issuing module 604 is further configured to:

and issuing a third type mapping relation corresponding to the computing task to the convergence board, so that the convergence board determines whether to collect the data to be converged or not based on the third type mapping relation, and executes convergence operation when determining to collect the data to be converged.

Fig. 7 is a schematic structural diagram of a computing resource scheduling apparatus according to an embodiment of the present application. As shown in fig. 7, the computing resource scheduling device 700 provided in the embodiment of the present application is applied to an interface board, where the interface board includes a first identifier, and the device includes: a second receiving module 701, a searching module 702, a sending module 703 and a processing module 704.

A second receiving module 701, configured to receive a task packet, where the task packet includes a server process identifier for running a computing sub-task;

The searching module 702 is configured to search an interface board identifier corresponding to the server process identifier from the locally stored first type of mapping relationship;

a sending module 703, configured to send a task message to an interface board indicated by the interface board identifier if the interface board identifier is different from the first identifier, and perform calculation processing on the task message by the interface board indicated by the interface board identifier;

and the processing module 704 is configured to perform calculation processing on the task message if the interface board identifier is the same as the first identifier, so as to obtain a calculation sub-result.

As an example of an implementation of this embodiment,

the second receiving module 701 is specifically configured to: before the task message is received,

when the interface board is not designated as a convergence board of the computing task by the main control board, receiving a first type of mapping relation and a second type of mapping relation;

when the interface board is designated as a convergence board, receiving a first type of mapping relation;

the first kind of mapping relation at least comprises a corresponding relation between a server process identifier for running each computing subtask and an identifier of a target interface board to which each computing subtask is allocated;

the second kind of mapping relation is used for indicating that the calculation sub-results are sent to the convergence board to be summarized to obtain convergence results, and the convergence board outputs the convergence results.

As an example of an implementation of this embodiment,

after performing calculation processing on the task message to obtain a calculation sub-result, when the interface board is not designated as a convergence board for calculating the task by the main control board, the sending module 703 is configured to send the calculation sub-result to the convergence board according to the locally stored second type mapping relationship;

when the interface board is designated as a convergence board, the second receiving module 701 is configured to receive the computation sub-results sent by the interface boards of the computation tasks that are not designated as the convergence board, and the processing module 704 is configured to perform a convergence operation on the local computation sub-results and the received computation sub-results and output a convergence result.

As an example of an implementation of this embodiment,

the second type of mapping relation at least comprises a logic process identifier corresponding to the interface board which is not designated as the convergence board and a corresponding relation between the convergence boards;

according to the second type of mapping relation stored locally, sending the calculation sub-result to the convergence board comprises:

forming a new task message according to the obtained logic process service identifier and the calculation sub-result;

and forwarding the new task message to the convergence board according to the second type mapping relation.

As an example of an implementation of this embodiment,

the second receiving module 701 is specifically configured to receive a third type of mapping relationship when the interface board is designated as a aggregation board, where the third type of mapping relationship is used to instruct the aggregation board to determine whether to aggregate data to be aggregated;

The aggregation operation is performed on the local calculation sub-result and the received calculation sub-result on the premise that the data to be aggregated is determined to be aggregated on the basis of the third-class mapping relation.

As an example of an implementation of this embodiment,

when the interface board is designated as a convergence board, outputting the convergence result specifically includes:

for each server process sending the computing subtasks, sending an aggregation result to the server process through the established RDMA channel;

the RDMA channel is established according to the QPN number of the server where the server process is located, the IP address of the server, and the QPN number and IP address of the leaf switch.

For the device embodiments, reference is made to the description of the method embodiments for the relevant points, since they essentially correspond to the method embodiments. The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purposes of the present application. Those of ordinary skill in the art will understand and implement the present application without undue burden.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and structural equivalents thereof, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on a manually-generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general purpose and/or special purpose microprocessors, or any other type of central processing unit. Typically, the central processing unit will receive instructions and data from a read only memory and/or a random access memory. The essential elements of a computer include a central processing unit for carrying out or executing instructions and one or more memory devices for storing instructions and data. Typically, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks, etc. However, a computer does not have to have such a device. Furthermore, the computer may be embedded in another device, such as a mobile phone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices including, for example, semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CD-ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features of specific embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. On the other hand, the various features described in the individual embodiments may also be implemented separately in the various embodiments or in any suitable subcombination. Furthermore, although features may be acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, although operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Furthermore, the processes depicted in the accompanying drawings are not necessarily required to be in the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The foregoing description of the preferred embodiments of the application is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the application.

Claims

1. The method for scheduling the computing resources is characterized by being applied to a main control board and comprising the following steps of:

receiving task information sent by each server; the task information sent by any server comprises a server process identifier of a computing subtask which runs on the server and belongs to the computing task and computing resources required by the computing subtask;

determining whether an interface board with idle computing resources meeting the requirement M exists in each interface board according to the sum M of computing resources required by each computing subtask belonging to the computing task;

when the target interface board is not a convergence board, a first type mapping relation and a second type mapping relation corresponding to the computing task are issued to the target interface board;

when the target interface board is a convergence board, a first type mapping relation corresponding to the computing task is issued to the target interface board;

the first kind of mapping relation at least comprises a server process identifier for running a calculation subtask and a corresponding relation between target interface boards to which the calculation subtask is distributed, so that any target interface board forwards a task message sent by each server process to the target interface board corresponding to the server process for calculation to obtain a calculation subtresult; and the second type of mapping relation is used for indicating each target interface board which is not designated as a convergence board to send the calculation sub-result to the convergence board for summarizing to obtain a convergence result, and the convergence board outputs the convergence result.

2. The method according to claim 1, wherein the method further comprises: distributing corresponding logic process service identifiers for each target interface board which is not designated as a convergence board; the logic process service identifier is different from any server process identifier;

the second type mapping relationship at least comprises a logic process identifier corresponding to a target interface board which is not designated as a convergence board and a corresponding relationship between the convergence boards, so that any target interface board which is not designated as a convergence board forms a new task message according to the assigned logic process service identifier and a calculation sub-result, and forwards the new task message to the convergence boards according to the second type mapping relationship.

3. The method of claim 1, wherein when the convergence board is one of the target interface boards, the idle computing resources of the convergence board satisfy the computing resources required by the assigned computing subtasks of the convergence board and the computing resources required by the convergence operation; the aggregation operation refers to the operation that the aggregation board gathers the received calculation sub-results to obtain an aggregation result, or,

4. The method according to claim 1, characterized in that the method further comprises:

and issuing a third type mapping relation corresponding to the computing task to the convergence board so that the convergence board can determine whether to collect data to be converged or not based on the third type mapping relation, and executing convergence operation when determining to collect the data to be converged.

5. A method for scheduling computing resources, the method being applied to an interface board, the interface board including a first identifier, the method comprising:

if the interface board identification is different from the first identification, sending the task message to an interface board indicated by the interface board identification, and calculating the task message by the interface board indicated by the interface board identification;

and if the interface board identifier is the same as the first identifier, performing calculation processing on the task message to obtain a calculation sub-result.

6. The method of claim 5, wherein prior to receiving the task message, the method further comprises:

when the interface board is not designated as a convergence board of a computing task by the main control board, receiving the first type mapping relation and the second type mapping relation;

when the interface board is designated as the convergence board, receiving the first type of mapping relation;

the first type of mapping relation at least comprises a server process identifier for running each computing subtask and a corresponding relation between identifiers of target interface boards allocated to each computing subtask;

and the second type of mapping relation is used for indicating that the calculation sub-results are sent to a convergence board to be summarized to obtain convergence results, and the convergence board outputs the convergence results.

7. The method according to claim 6, wherein after performing calculation processing on the task message to obtain a calculation sub-result, the method further comprises:

when the interface board is not designated as a convergence board of a calculation task by the main control board, sending the calculation sub-result to the convergence board according to the locally stored second-class mapping relation;

and when the interface board is designated as the convergence board, receiving the calculation sub-results sent by other interface boards which are not designated as calculation tasks of the convergence board, executing convergence operation on the local calculation sub-results and the received calculation sub-results, and outputting convergence results.

8. The method of claim 7, wherein the second type of mapping relationship includes at least a logical process identifier corresponding to an interface board that is not designated as a convergence board, and a correspondence relationship between the convergence boards;

the sending the computation sub-result to the convergence board according to the second type of mapping relation stored locally comprises:

and forwarding the new task message to the aggregation board according to the second type mapping relation.

9. The method of claim 7, wherein when the interface board is designated as the convergence board, the method further comprises:

receiving a third type of mapping relation, wherein the third type of mapping relation is used for indicating the aggregation board to determine whether to aggregate data to be aggregated;

the aggregation operation is executed on the local calculation sub-result and the received calculation sub-result on the premise of determining to-be-aggregated data based on the third type of mapping relation.

10. The method of claim 7, wherein when the interface board is designated as the convergence board, the outputting the convergence result specifically comprises:

For each server process sending the computing subtasks, sending the convergence result to the server process through the established RDMA channel;

the RDMA channel is established according to the QPN number of the server where the server process is located, the IP address of the server, and the QPN number and the IP address of the leaf switch.

11. A computing resource scheduling apparatus for use in a master control board in a leaf switch, the apparatus comprising:

the first receiving module is used for receiving task information sent by each server; the task information sent by any server comprises a server process identifier of a computing subtask which runs on the server and belongs to the computing task and computing resources required by the computing subtask;

the first determining module is used for determining whether an interface board with idle computing resources meeting the requirement M exists in the interface boards according to the sum M of computing resources required by each computing subtask belonging to the computing task;

the allocation module is used for allocating corresponding target interface boards for each computing subtask from each interface board according to the computing resources required by each computing subtask belonging to the computing task and the idle computing resources existing in each interface board if no idle computing resources exist in each interface board to meet the M, and allocating at least one interface board from each interface board as a convergence board;

The issuing module is used for issuing a first type mapping relation and a second type mapping relation corresponding to the computing task to the target interface board when the target interface board is not a convergence board;

12. A computing resource scheduling apparatus, the apparatus being applied to an interface board, the interface board including a first identifier, the apparatus comprising:

The searching module is used for searching the interface board identification corresponding to the server process identification from the locally stored first type mapping relation;

the sending module is used for sending the task message to the interface board indicated by the interface board identifier if the interface board identifier is different from the first identifier, and calculating the task message by the interface board indicated by the interface board identifier;

and the processing module is used for carrying out calculation processing on the task message if the interface board identifier is the same as the first identifier, so as to obtain a calculation sub-result.