WO2023179741A1

WO2023179741A1 - Computation system and data transmission method

Info

Publication number: WO2023179741A1
Application number: PCT/CN2023/083532
Authority: WO
Inventors: 勾文进
Original assignee: 华为技术有限公司
Priority date: 2022-03-24
Filing date: 2023-03-24
Publication date: 2023-09-28
Also published as: CN116841946A

Abstract

Disclosed in the embodiments of the present application are a computation system and a data transmission method, used for reducing a communication time delay of a computation system. The computation system in the embodiments of the present application comprises a first node and a second node, the first node being connected to a first network interface controller, and the first node being connected to the second node by means of the first network interface controller. The first node comprises a plurality of first processor cores, each first processor core being used for executing a computation task, generating first data according to the computation task, and sending the first data to the first network interface controller. The first network interface controller is used for executing an aggregation operation on the plurality of first data sent by the plurality of first processor cores, and sending the data generated after execution of the aggregation operation to the second node by means of the first network interface controller.

Description

Computing system and data transmission method

This application claims priority to the Chinese patent application filed with the China Patent Office on March 24, 2022, with the application number 202210295346.3 and the invention title "A computing system and data transmission method". The entire content of the patent application is incorporated by reference. incorporated in this application.

Technical field

Embodiments of the present application relate to the field of computers, and in particular, to a computing system and a data transmission method.

Background technique

With the development of high performance computing (HPC) technology, the message passing interface (message passing interface), as a cross-language communication protocol, accounts for an increasing proportion of HPC communications. Among them, collective communication is an integral part of MPI communication, and optimizing collective communication delay is an important direction for HPC communication optimization.

Set communication is divided into inter-node communication and intra-node communication, and intra-node communication and inter-node communication are executed serially. In the current collective communication process, when communicating between two nodes, it is first necessary for a processor in the sending node to aggregate all the data of all processor cores in the sending node, and then send it to the receiving node through the network card. node.

As the number of processor cores in a node increases, the time-consuming of aggregation operations within nodes in collective communication also increases. In addition, aggregation operations within nodes also increase the load on the processor cores.

Contents of the invention

This application provides a computing system and a data transmission method for reducing the communication delay of the computing system.

A first aspect of this application provides a computing system. The computing system includes a first node and a second node. The first node is connected to a first network card, and the first node is connected to the second node through the first network card. The first node includes a plurality of first processor cores. The first processor core is the processor core responsible for communicating with the network card in each processor core group. The first processor core may also be called a leader in this application. (leader) processor core or root processor core. Each first processor core is used to perform computing tasks, generate first data according to the computing tasks, and send the first data to the first network card. The first data includes data generated by collective communication between the first processor cores in the core group. ,Gather communication such as ALL-Reduce, All-to-All and ALL-Gather, etc. The first network card is used to identify multiple first processors and sent data, perform an aggregation operation on multiple first data sent by multiple first processor cores, and pass the data generated after performing the aggregation operation through the first network card. sent to the second node.

The first network card in the computing system provided in this application can implement aggregation operations on data sent by processor cores. Compared with data sent by multiple first processor cores in the first node, it is first processed by one of the first processors. The data is checked and aggregated before being sent to the first network card. The aggregation operation performed by the first network card reduces the communication between multiple first processor cores within the first node, thereby reducing the time-consuming aggregation operation within the node. And reduce the load on the processor core in the node.

In a possible implementation, the first node includes a plurality of core groups, the first core group is any one of the plurality of core groups, and the first core group includes a first processor core and at least a second processor core. The computing tasks performed by the first processor core in the first core group include aggregating data sent by at least one second processor core in the first core group and data in the first processor core into a first data.

In this application, multiple first processor cores of the first node perform aggregation operations on data sent by other processor cores in the same core group, thereby avoiding communication between the first processor cores of different core groups, thereby reducing the number of Multiple first processes within one node The communication delay between processor cores across core groups is further reduced, thereby further reducing the communication delay of the computing system.

In a possible implementation, after receiving the data sent by other processor cores in the same core group, the first processor core in the first core group performs an aggregation operation with the data in the first processor core to obtain the first data. , by packaging the first data into a first message and sending it to the first network card, where the first message includes a mark indicating that the first network card performs an aggregation operation on the first data. The first network card is used to aggregate the first data in the first packet with the mark. Specifically, the first network card receives the message sent by multiple processor cores in the first node, identifies the mark in the message, and performs an aggregation operation on the first data in the first message with the mark. The tag in the first message includes the target field in the header of the first message, such as the "Coll Tag" field.

In this application, the first network card can identify the packets that need to be aggregated according to the tag in the first packet, and perform the aggregation operation on the packets with the tags, so that it can accurately identify the packets that need to be aggregated, thereby improving the solution. achievability.

In a possible implementation, the first network card is also used to provide multiple first processor cores in the first network card. Specifically, the first network card pre-sets multiple first processor cores in the first node that need to interact with the first network card, that is, a leader processor core is set. There are many ways to set multiple first processor cores in the first network card. For example, when the computing system creates a communication domain, during the resource initialization process, the first node communicates with the network card based on the multiple first processor cores. All-Reduce collects communication operations, and the first network card can collect multiple first processor cores in the communication domain corresponding to the first network card.

In this application, the first network card can pre-set multiple first processor cores in the first node that communicate with the first network card, so that the first network card can identify the first processor core during the collective communication process, further improving the solution. achievability.

In a possible implementation, the first network card is also configured to receive a second message sent by a second node, the second node being an external node of the first node, and the second message includes an instruction indicating that the first network card is responsible for the second message. The message is marked for broadcast. The first network card sends the second message to the plurality of first processor cores according to the tag of the second message and the tags of the multiple first processor cores set in the first network card. Specifically, the first network card matches the tags of the second message according to the tags of the plurality of first processor cores, and the first network card performs a broadcast operation for the first processor core with the tags of the second message.

In this application, the first network card can perform a broadcast operation on the second message sent by the second node. Compared with the first network card forwarding the second message to the first node, and then the first processor core in the first node The solution of broadcasting the second message to the processor cores of different core groups avoids the second message from being broadcast across core groups in the first node, thereby reducing the delay in broadcasting the second message and further reducing calculations. System communication delay.

In a possible implementation, the first processor core in the first core group is used to send the second message to at least one second processor core in the first core group, that is, multiple first processors The cores broadcast the second message within their respective core groups.

In this application, after receiving the second message sent by the first network card, the multiple first processor cores in the first node distribute the second message to other processor cores in the core group where the first processing core is located. Since It avoids the first processor core from distributing the second message across core groups, thereby improving and reducing the communication delay of the computing system.

In a possible implementation, the first processor core and at least one second processor core included in the first core group belong to a central processing unit CPU, or to a non-uniform memory access NUMA unit. That is, the processor cores in the first node are divided into core groups according to hardware ownership. The processor cores belonging to the same CPU can be a core group, or the processor cores belonging to the same NUMA unit can be a core group.

In this application, a first processor core is set in each CPU of the first node or a first processor is set in each NUMA. The first processor directly interacts with the first network card, thereby reducing data communication across CPUs in the first node or data communication across NUMA units, thereby reducing communication delay of the computing system.

The second aspect of this application provides a data transmission method, which can be executed by a computing system or by a computing system. The components of the system, such as the execution of the processor, chip or chip system of the first network card or the first node in the computing system, can also be implemented by logic modules or software that can realize all or part of the functions of the computing system. The data transmission method provided in the first aspect includes: the first network card receives the first data sent by each processor in a plurality of first processor cores in the first node, wherein the plurality of first processor cores are the first network card The preset processor core is responsible for communicating with the first network card. Each first processor core in the first processor core is used to perform a computing task, generate first data according to the computing task, and send the first data to the first network card. The first data includes the first processor in the core group. The data generated by the core's collective communication, such as ALL-Reduce, All-to-All and ALL-Gather, etc. The first network card performs an aggregation operation on the plurality of first data. The first network card sends the data after performing the aggregation operation to the second node.

In this application, the first network card can implement aggregation operations on data sent by multiple first processor cores of the first node. Compared with data sent by multiple first processor cores in the first node, it is first processed by one of the first processor cores. One processor core performs an aggregation operation on the data and then sends it to the first network card, which reduces the communication delay between multiple first processor cores within the first node, thereby reducing the communication delay of the computing system.

In a possible implementation, when the first network card receives the first data sent by each of the plurality of first processor cores in the first node, the first network card receives the first data sent by the first processor core. A first message including first data, the first message including a mark indicating that the first network card performs an aggregation operation on the first data. During the process of the first network card performing an aggregation operation on multiple first data, the first network card aggregates the first data in the first message with the mark.

In the embodiment of the present application, the first network card receives the first message including the first data sent by multiple first processor cores, and identifies the message that needs to be aggregated according to the mark in the first message. For the message with the mark, The aggregation operation is performed on the packets, so that the first packet that needs to be aggregated can be accurately identified, which improves the realizability of the solution.

In a possible implementation, before the first network card receives the first data sent by each of the plurality of first processor cores in the first node, the first network card sets multiple first processor cores. Specifically, when the computing system creates a communication domain, during the resource initialization process, the first node performs an All-reduce set communication operation with the network card based on multiple first processor cores, and the first network card can collect the information corresponding to the first network card. multiple first processor cores of the communication domain.

In a possible implementation, the first network card receives a second message sent by the second node, and the second message includes a mark instructing the first network card to broadcast the second message. The first network card sends the second message to the plurality of first processor cores according to the tag of the second message and the tags of the multiple first processor cores set in the first network card.

In a possible implementation, the first node includes multiple core groups. The core group includes a first processor core and at least one second processor core, where the first core group is any one of the multiple core groups. , the first processor core in the first core group receives data sent by at least one second processor core in the first core group. The first processor core in the first core group aggregates data sent by at least one second processor core in the first core group and data in the first processor core into first data.

In a possible implementation, the first network card receives the first data sent by each of the plurality of first processor cores in the first node, and the first network card receives the first data from the first processor in the first core group. The core sends a first message including first data.

In a possible implementation, the first processor core in the first core group receives the second message sent by the first network card. The first processor core in the first core group sends the second message to at least one second processor core in the first core group.

The third invention of the present application provides a data transmission method, which is executed by a network card. The method includes the steps executed by the first network card in any method provided in the second aspect.

The fourth aspect of this application provides a network card, including a transceiver unit and a processing unit. The transceiver unit and the processing unit are used to implement the steps performed by the first network card in any method provided in the third aspect.

The fifth aspect of this application provides a computing device, including a transceiver unit and an aggregation unit executed by a first processor core, A sending unit executed by the second processor core. The transceiver unit and the aggregation unit are used to perform the functions performed by the first processor core in any method provided by the second aspect, and the sending unit is used to perform the second processor core in any method provided by the second aspect. The functions performed by the processor core.

A sixth aspect of the present application provides a network card, including a processor. The processor is coupled to a memory. The processor is configured to store instructions. When the instructions are executed by the processor, the network card performs any method provided in the third aspect.

A seventh aspect of the present application provides a computing device, including a plurality of core groups. The first core group is any one of the plurality of core groups. The first core group includes a first processor core and At least one second processor core. The computing tasks performed by the first processor core in the first core group include combining the data sent by the at least one second processor core in the first core group with the The data in the first processor core is aggregated into the first data, and the first processor in each core group is used to send the first data to the network card.

The eighth aspect of this application provides a chip, including an interface and a processing unit. The interface is used to send and receive data. The processing unit is used to perform the above-mentioned second aspect and the first network card in any possible implementation of the second aspect. function.

The eighth aspect of the present application provides a computer-readable storage medium on which instructions are stored. When the instructions are executed, the computer executes the first network card in the above-mentioned second aspect or any of the possible implementations of the second aspect. The method of execution, or to cause the computer to execute the method executed by the first node in the above second aspect or any possible implementation manner of the second aspect.

The first aspect of this application provides a computer program product. The computer program product includes instructions. When the instructions are executed, the computer implements the above-mentioned second aspect or any of the possible implementations of the second aspect and is executed by the first network card. method, or to enable the computer to implement the method executed by the first node in the above second aspect or any possible implementation manner of the second aspect.

It can be understood that the beneficial effects that can be achieved by any of the data transmission methods, network cards, computing devices, chips, computer-readable media or computer program products provided above can be referred to the beneficial effects in the corresponding computing system, which will not be discussed here. Repeat.

Description of the drawings

Figure 1 is a schematic diagram of a communication system architecture provided by an embodiment of the present application;

Figure 2 is a schematic flow chart of a data transmission method provided by an embodiment of the present application;

Figure 3a is a schematic diagram of collective communication provided by an embodiment of the present application;

Figure 3b is a schematic diagram of collective communication provided by an embodiment of the present application;

Figure 4 is a schematic diagram of a message format provided by an embodiment of the present application;

Figure 5 is a schematic diagram of a network card performing a message aggregation operation provided by an embodiment of the present application;

Figure 6 is a schematic diagram of another data transmission method according to an embodiment of the present application;

Figure 7 is a schematic structural diagram of a network card provided by an embodiment of the present application;

Figure 8 is a schematic structural diagram of a computing device provided by an embodiment of the present application;

Figure 9 is a schematic structural diagram of another network card provided by an embodiment of the present application.

Detailed ways

Embodiments of the present application provide a computing system and a data transmission method for reducing the communication delay of the computing system.

The terms "first", "second", "third", "fourth", etc. (if present) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects without necessarily using Used to describe a specific order or sequence. It is to be understood that data so used are interchangeable under appropriate circumstances so that the embodiments described herein can be used in applications other than those illustrated or described herein. Implemented in a sequence other than those described above. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.

In the embodiments of this application, words such as "exemplary" or "for example" are used to represent examples, illustrations or explanations. Any embodiment or design described as "exemplary" or "such as" in the embodiments of the present application is not to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "exemplary" or "such as" is intended to present the concept in a concrete manner.

In the following, some terms used in this application are explained to facilitate understanding by those skilled in the art.

Message passing interface (MPI) is a standardized message passing standard that can run on a variety of parallel computing systems. This standard defines the core syntax and semantics of the communication library. Users can write message passing programs using a variety of programming languages, such as C language, C++ language and Fortran language.

Collective communication is an important communication part of MPI. In this collective communication mode, all processes in the communication domain participate in communication. Collection communication models such as All-reduce, Broadcast and All-to-all.

The computing system and data transmission method provided by the embodiments of the present application will be introduced below with reference to the accompanying drawings.

Please refer to Figure 1, which is a schematic diagram of the system architecture of a collective communication scenario provided by an embodiment of the present application. As shown in Figure 1, the collective communication system provided by the embodiment of the present application includes multiple computing devices and multiple network cards and switches 103. The plurality of computing devices includes a first node 101 and a second node 104, and the plurality of network cards includes a first network card 102 and a second network card 105. Taking the first node 101 as an example, the first node 101 includes multiple processor cores, and the multiple processor cores can be divided into multiple processor core groups according to hardware ownership. For example, the core group can be divided according to the central processing unit CPU, that is, the processor cores belonging to the same CPU are divided into one processor core group. The division of core groups can also be divided according to non-uniform memory access NUMA units. The processor cores belonging to the same NUMA unit are divided into one processor core group.

In the system shown in FIG. 1 , data communication can be carried out between multiple computing devices, for example, data communication can be carried out between the first node 101 and the second node 104 . Data communication can also be carried out between processor cores in a single node device. For example, data communication between processor cores can be carried out between multiple processor cores in a single first node 101. Data communication between processor cores can be carried out. Communication includes data communication between processor cores in the same core group and data communication across core groups between processor cores in different core groups.

In the scenario of data communication between computing devices, the processor core in the first node 101 executes a computing task to generate data to be sent, packages the data to be sent, and sends it to the first network card 102. The first network card 102 transmits the data to be sent. The sent data is sent to the switch 103, and then the switch 103 forwards it to the second node 104.

In the embodiment of the present application, the first network card 102 can identify, aggregate and forward messages sent by multiple processor cores within the first node 101, thereby reducing the transmission of messages across core groups within the node device 101. The first network card 102 includes a processing module 1021. The processing module 1021 is specifically used to set the first processor core in the first node 101. The set first processor core is the one responsible for communicating with the network card in each processor core group. Processor core, the first processor core may also be called a leader processor core or a root processor core in this application. The processing module 1021 is also used to identify and aggregate the messages sent by the set first processor core. Among them, the processing module 1021 can be an application specific integrated circuit (ASIC), a field programmable gate array (field programmable gate array, FPGA), a microprocessor MPU (micro-processing unit) and other hardware forms. , there is no specific limit.

It should be noted that in the embodiment of the present application, the first network card 102 can also be integrated in the first node 101. When the first network card 102 is integrated in the first node 101, the first network card 102 can be a board in the first node 101. network card.

The communication scenarios between the processor core in the first node 101 and the first network card 102 include a many-to-one fan-in scenario (Fan-in) and a one-to-many fan-out (Fan-out) scenario. The many-to-one fan-in scenario refers to a process in which data sent by multiple processor cores in the first node 101 is aggregated by the first network card 102 and then sent by the first network card 102 to the second node 104 . The one-to-many fan-out scenario refers to the process in which the first network card 102 receives the data sent by the second node 104 and then distributes the data to multiple processor cores in the first node 101 .

The data transmission method provided in the embodiment of the present application includes a fan-in scenario and a data fan-out scenario. The following takes the data fan-in scenario as an example and introduces the data transmission method provided in the embodiment of the present application with reference to the accompanying drawings:

Please refer to Figure 2. Figure 2 is a schematic flowchart of a data transmission method in a fan-in scenario provided in the implementation of this application. The method includes the following steps:

201. Multiple first processor cores in the first node 101 execute computing tasks to generate first data.

In the embodiment of the present application, the first node 101 includes multiple core groups, wherein the first core group 1011 is any one of the multiple core groups in the first node 101, and the first core group 1011 includes a first processor. Core 10111 and at least one second processor core 10112, where the first processor core 10111 is a preset leader processor core in the first core group, and the leader processor core is responsible for interacting with the first network card. It can be understood that each of the plurality of core groups in the first node 101 is provided with a first processor core, and the first processor core interacts with the first network card.

Multiple first processor cores in the first node 101 execute computing tasks to generate first data. Specifically, when multiple first processor cores in the first node 101 execute computing tasks, taking the first processor core 10111 in the first core group 1011 as an example, the first processor core 10111 in the first core group 1011 The processor core 10111 aggregates the data sent by at least one second processor core 10112 in the first core group 1011 and the data in the first processor core 10111 into first data, and the first data is the first processor core The first data generated by performing a calculation task.

It should be understood that the first processor core of each of the plurality of core groups in the first node 101 can perform a computing task to generate the first data, that is, each of the plurality of first processor cores in the first node 101 can The processor cores perform calculation tasks to generate first data.

In some embodiments, the processor cores in the first node 101 are not divided into core groups. The first node 101 only includes first processor cores. Each first processor core sends the first data after executing its own computing task. To the network card, the network card performs aggregation.

For convenience of description, the following description takes the division of processing cores in the first node 101 into core groups as an example.

In an example of performing a computing task, the first node 101 performs the task of a high-performance computing HPC application, such as a weather forecast application. The first node 101 starts multiple processes, and the processor cores where the multiple processes are located need to perform collective communication. ,Set communication includes Reduce, Broadcast and All-Reduce. For example, during All-Reduce communication, a reduction operation is performed on the data in the input buffer of each processor core in the communication domain, and the result of the reduction operation is returned to the output buffer of each processor core. That is, it is the first data for performing calculation tasks. The above-mentioned reduction operations include summation, maximum value or minimum value, etc.

In All-Reduce communication, communication parameters include the starting address of the input buffer revbuf, the starting address of the output buffer sendbuf, the number of data count and the data type datatype. In All-Reduce communication, all processor cores in the first node have the same communication parameters, so the processes in each processor core provide input buffers and output buffers with the same length and the same element type.

Since the core group division in the embodiment of the present application is based on the hardware ownership of the processor core, that is, the first core group 1011 may be a central processing unit CPU unit or a non-uniform memory access NUMA unit. The steps in which multiple first processor cores of the first node execute computing tasks to generate the first data in the two core group division methods are described below with reference to the accompanying drawings.

Please refer to Figure 3a, which is a schematic diagram of collective communication provided by the implementation of this application. As shown in Figure 3a, the processor cores in the first node 102 are divided into multiple core groups according to the CPU ownership range. It is assumed that CPU1 is the first core group 1011 and CPU2 is the second core group 1012. The processor cores of the first core group 1011 include processor core 0 to processor core 47 in CPU1, where processor core 0 is the first processor core in the first core group 1011. The processor cores of the second core group 1012 include the processor core 48 to the processor core 95 in the CPU 2 , where the processor core 48 is the first processor core in the second core group.

In the example shown in Figure 3a, taking the processor core in CPU1 as an example, when the processor core in CPU1 is executing a computing task, processor cores 1 to 47 send data to processor core 0. , the processor 0 performs an aggregation operation on the data sent from the processor cores 1 to 47 and the data of the processor core 0 to obtain the first data.

Please refer to Figure 3b, which is a schematic diagram of another collective communication provided by the implementation of this application. As shown in Figure 3b, the processor cores in the first node 101 are divided into multiple core groups according to the NUMA ownership range. Assume that NUMA1 is the first core group 1011, NUMA2 is the second core group 1012, and NUMA3 is the third core group 1013. NUMA4 is the fourth core group 1014. The processing cores of the first core group 1011 include processor core 0 to processor core 23 in NUMA1, where processor core 0 is the first processor core in the first core group 1011 . The processor cores of the second core group 1012 include processor cores 24 to 47 in NUMA2, where the processor core 24 is the first processor core in the second core group 1012 . The processor cores of the third core group 1013 include processor core 48 to processor core 71 in NUMA3, where processor core 48 is the first processor core in the third core group. The processor cores of the fourth core group 1014 include processor cores 72 to 95 in NUMA4, where the processor core 72 is the first processor core in the fourth core group 1014 .

In the example shown in Figure 3b, taking the processor core in NUMA1 as an example, when the processor core in NUMA1 is executing a computing task, processor core 1 to processor core 23 send data to processor core 0. , the processor 0 performs an aggregation operation on the data sent from the processor cores 1 to 23 and the data of the processor core 0 to obtain the first data.

202. Each first processor core among the plurality of first processor cores in the first node 101 sends the first data to the first network card 102 .

Each first processor core among the plurality of first processor cores in the first node 101 sends the first data to the first network card 102 . Specifically, after each first processor core of the plurality of first processor cores in the first node 101 packages the first data into a first message, it sends the first message to the first network card 102, where, The first message includes a mark indicating that the first network card 102 performs an aggregation operation on a plurality of first data.

It can be understood that, in addition to receiving the first message that needs to be aggregated, the first network card 102 will also receive the regular message sent by the first node 1011. The first network card 102 will identify the need to perform the aggregation operation based on the above mark. The first packet is aggregated, and other regular packets are forwarded directly.

Please refer to Figure 4, which is a schematic diagram of a message format of a first message provided by an embodiment of the present application. As shown in Figure 4, the MPI+ header of the first message includes multiple fields, among which the "Coll tag" field or the "Coll tag high 32" field is the tag of the first message. The first network card can identify the first packet that needs to be aggregated based on the "Coll tag" field or the "Coll tag high 32" field. The MPI+ header of the first message also includes other fields, such as communication domain identifier Comm ID, operation type identifier Opt code, and data type identifier data type.

Please continue to refer to Figure 3a. In the example shown in Figure 3a, the processor core 0 in the CPU1 of the first node 101 packages the generated first data into a first message and then sends it to the first network card 102. The first The network card 102 identifies the first packet as a packet that needs to be aggregated according to the tag in the first packet. Similarly, the processor core 0 in CPU2 of the first node 101 also packages the generated data into a first message and sends it to the first network card 102. The first network card 102 receives the first messages sent by the leader cores in multiple CPUs 1. .

Correspondingly, in the example shown in Figure 3b, the leader cores in multiple NUMA units in the first node 101 send the first message to the first network card 102, and the processor cores 0 and NUMA2 units in the NUMA1 unit process it. The processor core 24, the processor core 48 in the NUMA3 unit, and the processor core 72 in the NUMA4 unit respectively send the first message to the first network card 102. The first network card 102 receives the first messages sent from the leader cores of multiple NUMA units, and identifies the messages that need to be aggregated according to the tags of the multiple first messages.

203. The first network card 102 performs an aggregation operation on the first data sent by the plurality of first processor cores.

The first network card 102 performs an aggregation operation on the first data sent by multiple first processor cores. Specifically, after receiving multiple messages sent by the first node, the first network card 102 identifies the first message according to the mark in the message header, and performs an aggregation operation on the first data in the first message with the mark.

Please refer to Figure 5. Figure 5 is a schematic diagram of a network card performing an aggregation operation on packets according to an embodiment of the present application. As shown in Figure 5, after the network interface controller 1022 (NIC) of the first network card 102 receives the first message sent by the leader processor of a different core group of the first node 101, the first network card 102 internally The processing module 1021 identifies the mark in the first message, performs an aggregation operation on the first data in the first message with the mark, generates an aggregated message, and sends the aggregated message to the third A network interface controller 1022 inside the network card 102.

204. The first network card 102 sends the data generated after the aggregation operation to the second node.

The first network card 102 sends the data generated after the aggregation operation to the second network card 105 . Specifically, the network interface controller 1022 of the first network card 102 sends the packet generated by the aggregation operation to the switch, and the switch forwards it to the second network card 105 connected to the second node 104 .

It can be seen from the above embodiments that in the embodiment of the present application, the first network card can perform an aggregation operation on the data sent by the first processor core. Compared with the data sent by multiple first processor cores in the first node, One of the first processor cores performs an aggregation operation on the data and then sends it to the first network card. The solution of having the first network card perform the aggregation operation reduces the communication delay between multiple first processor cores within the first node. This in turn reduces the communication delay of the computing system.

Steps 201 to 204 of the above embodiment describe in detail the data transmission method of the computing system in the fan-in scenario. The data transmission method of the computer system in the fan-out scenario is similar to the above method. The following is a description of the data transmission method in the fan-out scenario with reference to Figure 6 The transmission method is introduced. The data transmission method in the fan-out scenario includes the following steps:

601. The first network card 102 receives the second message sent by the second node 104.

The first network card 102 receives the second message sent by the second node 104. Specifically, the second node 104 sends the second message to the switch 103, and the switch 103 forwards the second message to the first network card 102 according to the destination address of the second message. The second message also includes a mark used to instruct the first network card 102 to broadcast the second message. This mark is the same as the mark of the first message in the above fan-in scenario, and will not be described again here.

602. The first network card 102 broadcasts the second message to multiple first processor cores in the first node 101 according to the tag of the second message.

The first network card 102 broadcasts the second message to the plurality of first processor cores in the first node 101 according to the tag of the second message. Specifically, the first network card 102 broadcasts the second message to multiple first processors in the first node 101 according to the mark of the second message and the marks of the multiple first processor cores set in the first network card 102 Organ core.

It should be understood that the first network card 102 will receive multiple messages sent by the second node 104, the first network card 102 will identify the messages sent by the second node 104, and perform a broadcast operation on the second messages with the above tags.

It should be noted that when the computing system creates a communication domain, the computing system will set the first processor core in the first network card 102, that is, the leader processor core that interacts with the network card will be set in the first network card 102 in advance, so that , the first network When the card 102 receives the second message requiring a broadcast operation, the card 102 may broadcast the second message to multiple leader processor cores in the first node.

The first network card 102 can set the first processor core in a variety of ways. For example, during the resource initialization process when the computing system creates a communication domain, the first node 1011 performs an All-process with the network card based on multiple first processor cores. Reduce aggregates communication operations to set the first processor core of the communication domain corresponding to the first network card 102 in the first network card 102 .

603. Multiple first processor cores distribute the second message to at least one second processor core in the same core group.

The plurality of first processor cores distribute the second message to at least one second processor core in the same core group. For example, in the first core group 1011 of the first node 101, the first processor core in the first core group 1011 distributes the second message to at least one second processor core in the first core group 1011. .

It can be seen from the above embodiments that in the embodiment of the present application, the first network card 102 can perform a broadcast operation on the second message sent by the second node 104. Compared with the first network card 102 forwarding the second message to the second node 104, A node 101, and then the first processor core in the first node 101 broadcasts the second message to the processor cores of different core groups, which avoids the second message being broadcast across core groups in the first node 101. This reduces the delay in broadcasting the second message and further reduces the communication delay in the computing system.

Please refer to Table 1. Table 1 is a table of delay reduction rates of a data transmission method provided by embodiments of the present application. As shown in Table 1, taking the first node with 96 processor cores in Table 1 as an example, the "node-based delay" refers to the data sent by the first processor cores of different core groups in the first node. After one of the first processor cores is aggregated, the communication delay is forwarded to the external node through the network card. "Socket-based delay" refers to the communication time in this application that the data sent by the first processor cores of different core groups in the first node is directly sent to the first network card, and then the first network card performs an aggregation operation and then forwards it to the external node. extension.

As can be seen from Table 1, taking 8 bytes of transmission data as an example, the "socket-based" data transmission method provided by the embodiment of the present application is compared with the "node-based" data transmission method. The communication time of the calculation system is Delay decreased by 26.28%.

Table 1

The above describes the data transmission method in the embodiment of the present application, and the following describes the relevant devices involved in the embodiment of the present application.

Please refer to Figure 7. Figure 7 is a schematic structural diagram of a network card provided by an embodiment of the present application. The network card is used to implement various steps performed by the first network card in the above embodiments. As shown in Figure 7, the network card 700 includes a transceiver unit 701 and a processing unit 702.

The transceiver unit 701 is configured to receive first data sent by each processor in a plurality of first processor cores in the first node. The processing unit 702 is configured to perform an aggregation operation on a plurality of first data. The transceiver unit 701 is also used to send the data after performing the aggregation operation to the second node.

In a possible implementation, the transceiver unit 701 is specifically configured to receive a first message including first data sent by the first processor core, where the first message includes a mark indicating that the first network card performs an aggregation operation on the first data. . Processing unit 702 The body is used to aggregate the first data in the first message with the mark.

In a possible implementation, before the transceiver unit 701 is configured to receive the first data sent by each of the plurality of first processor cores in the first node, the processing unit 702 is further configured to set multiple first Processor core.

In a possible implementation, the transceiver unit 701 is also configured to receive a second message sent by the second node, where the second message includes a mark instructing the first network card to broadcast the second message. The transceiver unit 701 is also configured to send the second message to multiple first processor cores according to the tag of the second message and the tags of the multiple first processor cores set in the first network card.

Please refer to FIG. 8 , which is a schematic structural diagram of a computing device provided by an embodiment of the present application. For example, the function module diagram of the first node. The first processor core of each core group in the computing device 800 includes a transceiver unit 801 and an aggregation unit 802, and the second processor core of each core group includes a transmitter unit 803. The transceiver unit 801 and the aggregation unit 802 are In executing the functions executed by the first processor core in FIGS. 2 and 6 , the sending unit is used to execute the functions executed by the second processor core in FIGS. 2 and 6 . The transceiving unit 801 of the first processor core is used to receive data sent by the sending unit 803 of at least one second processor core. The aggregation unit 802 is used to aggregate the data sent by at least one second processor core and the data in the first processor core into first data. The transceiver unit 801 of the first processor core is also used to send the first data to the network card. And instruct the network card to perform an aggregation operation on the first data sent by the transceiver unit 801 in the first processor in the multiple core groups.

In a possible implementation, the aggregation unit 802 is specifically configured to package the first data into a first message. The transceiver unit 801 is specifically configured to send a first message to the network card, where the first message includes a mark instructing the network card to perform an aggregation operation on the first data.

In a possible implementation, the transceiver unit 801 is also configured to receive a second message sent by the network card. The second message includes a mark instructing the network card to broadcast the second message. The plurality of first processor cores are network cards. Preconfigured processor cores.

In a possible implementation, the transceiving unit 801 is configured to send the second message to at least one second processor core in the first core group.

It should be understood that the division of units in the above device is only a division of logical functions. In actual implementation, all or part of the units may be integrated into a physical entity or physically separated. And the units in the device can all be implemented in the form of software calling through processing components; they can also all be implemented in the form of hardware; some units can also be implemented in the form of software calling through processing components, and some units can be implemented in the form of hardware. For example, each unit can be a separate processing element, or it can be integrated and implemented in a certain chip of the device. In addition, it can also be stored in the memory in the form of a program, and a certain processing element of the device can call and execute the unit. Function. In addition, all or part of these units can be integrated together or implemented independently. The processing element described here can also be a processor, which can be an integrated circuit with signal processing capabilities. During the implementation process, each step of the above method or each unit above can be implemented by an integrated logic circuit of hardware in the processor element or implemented in the form of software calling through the processing element.

It is worth noting that for the above method embodiments, for the sake of simple description, they are all expressed as a series of action combinations. However, those skilled in the art should know that the present invention is not limited by the described action sequence. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily necessary for the present application.

Based on the above description, those skilled in the art can think of other reasonable step combinations, which also fall within the protection scope of the present application. Secondly, those skilled in the art should also be familiar with the fact that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily necessary for the present application.

Please refer to FIG. 9 , which is a schematic structural diagram of a network card provided by an embodiment of the present application. As shown in Figure 9, the network card 900 includes: a processor 910, a memory 920 and an interface 930. The processor 910, the memory 920 and the interface 930 are coupled through a bus (not labeled in the figure). The memory 920 stores instructions. When the execution instructions in the memory 920 are executed, the network card 900 executes the method executed by the first network card or the first node in the above method embodiment.

The network card 900 may be one or more integrated circuits configured to implement the above methods, for example: one or more specific Integrated circuit (application specific integrated circuit, ASIC), or one or more microprocessors (digital signal processor, DSP), or one or more field programmable gate arrays (field programmable gate array, FPGA), or these A combination of at least two of the integrated circuit forms. For another example, when the unit in the device can be implemented in the form of a processing element scheduler, the processing element can be a general processor, such as a central processing unit (CPU) or other processor that can call a program. For another example, these units can be integrated together and implemented in the form of a system-on-a-chip (SOC).

The processor 910 may be a central processing unit (CPU), or other general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), or a field-controller processor. Field programmable gate array (FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. A general-purpose processor can be a microprocessor or any conventional processor.

Memory 920 may include read-only memory and random access memory and provides instructions and data to processor 910 . Memory 920 may also include non-volatile random access memory. For example, the memory 920 may be provided with multiple partitions, and each area is used to store private keys of different software modules.

Memory 920 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. Among them, non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase electrically programmable read-only memory (EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of illustration, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchlink DRAM, SLDRAM) and direct Memory bus random access memory (direct rambus RAM, DR RAM).

In addition to the data bus, the bus may also include a power bus, a control bus, a status signal bus, etc. The bus can be a peripheral component interconnect express (PCIe) bus, an extended industry standard architecture (EISA) bus, a unified bus (Ubus or UB), or a computer quick link (compute express link (CXL), cache coherent interconnect for accelerators (CCIX), etc. The bus can be divided into address bus, data bus, control bus, etc.

It can be understood that the network card 700 shown in Figure 7 and the network card 900 shown in Figure 9 in the embodiment of the present application can be the first network card in the system architecture shown in Figure 1, and the network card shown in Figure 8 in the embodiment of the present application The computing device 800 may be the first node in the system architecture shown in FIG. 1 .

In another embodiment of the present application, a computer-readable storage medium is also provided. Computer-executable instructions are stored in the computer-readable storage medium. When the processor of the device executes the computer-executed instructions, the device executes the above method embodiment. The method executed by the first network card or the first node.

In another embodiment of the present application, a computer program product is also provided. The computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium. When the processor of the device executes the computer execution instruction, the device executes the method executed by the first network card or the first node in the above method embodiment.

Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the systems, devices and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be described again here.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program code. .

Claims

A computing system includes a first node and a second node, the first node is connected to a first network card, and the first node is connected to the second node through the first network card;

The first node includes a plurality of first processor cores, each first processor core is used to perform a computing task, generate first data according to the computing task, and send the first data to the first network card;

The first network card is configured to perform an aggregation operation on a plurality of first data sent by the plurality of first processor cores, and send the data generated after performing the aggregation operation to the second node through the first network card. .
The computing system according to claim 1, wherein the first node includes a plurality of core groups, the first core group is any one of the plurality of core groups, and the first core group includes A first processor core and at least one second processor core. The computing tasks performed by the first processor core in the first core group include processing the at least one second processor in the first core group. The data sent by the processor core and the data in the first processor core are aggregated into the first data.
The computing system of claim 2, wherein the first processor core in the first core group packages the first data into a first message and sends it to the first network card, and the The first message includes a mark indicating that the first network card performs an aggregation operation on the first data;

The first network card is used to aggregate the first data in the first packet with the mark.
The computing system according to any one of claims 1 to 3, wherein the first network card is further configured to provide the plurality of first processor cores in the first network card.
The computing system of claim 4, wherein the first network card is further configured to receive a second message sent by the second node, and the second message includes an instruction indicating that the first network card is The second message is marked for broadcast;

The first network card sends the second message to the plurality of first processor cores according to the mark of the second message and the marks of the plurality of first processor cores set in the first network card. Processor core.
The computing system of claim 5, wherein the first processor core in the first core group is configured to send the second message to the at least one processor core in the first core group. in the second processor core.
The computing system according to any one of claims 2 to 6, the first processor core and the at least one second processor core included in the first core group belong to a central processing unit (CPU), or Access NUMA units in a non-uniform memory.
A data transmission method, characterized by including:

The first network card receives the first data sent by each of the plurality of first processor cores in the first node;

The first network card performs an aggregation operation on the plurality of first data;

The first network card sends the data after performing the aggregation operation to the second node.
The method according to claim 8, wherein the first network card receiving the first data sent by each processor core in a plurality of first processor cores in the first node includes:

The first network card receives a first message including the first data sent by the first processor core, and the first message includes instructing the first network card to perform an aggregation operation on the first data. mark;

The first network card performing an aggregation operation on the plurality of first data includes:

The first network card aggregates the first data in the first packet with the mark.
The method according to any one of claims 8 to 9, characterized in that, before the first network card receives the first data sent by each of the plurality of first processor cores in the first node, The method also includes:

The first network card is configured with the plurality of first processor cores.
The method of claim 10, further comprising:

The first network card receives a second message sent by the second node, and the second message includes an instruction indicating that the first network A flag for the card to broadcast the second message;

The first network card sends the second message to the plurality of first processor cores according to the mark of the second message and the marks of the plurality of first processor cores set in the first network card. Processor core.
The method according to any one of claims 8 to 11, characterized in that the first node includes a plurality of core groups, the core group includes a first processor core and at least one second processor core, wherein The first core group is any one of the plurality of core groups, and the method further includes: the first processor core in the first core group receives the at least one processor core in the first core group. Data sent by the second processor core;

The first processor core in the first core group aggregates data sent by at least one second processor core in the first core group and data in the first processor core into first data.
The method of claim 12, further comprising:

The first processor core in the first core group receives the second message sent by the first network card;

The first processor core in the first core group sends the second message to the at least one second processor core in the first core group.
A network card, characterized in that it includes a processor, the processor is coupled to a memory, the memory is used to store instructions, and when the instructions are executed by the processor, the network card executes claims 8 to The method executed by the first network card in any of 13.
A chip includes an interface and a processing unit, the interface is used to send and receive data, and the processing unit is used to perform the functions performed by the first network card in any one of claims 8 to 13.