WO2023179741A1 - Computation system and data transmission method - Google Patents

Computation system and data transmission method Download PDF

Info

Publication number
WO2023179741A1
WO2023179741A1 PCT/CN2023/083532 CN2023083532W WO2023179741A1 WO 2023179741 A1 WO2023179741 A1 WO 2023179741A1 CN 2023083532 W CN2023083532 W CN 2023083532W WO 2023179741 A1 WO2023179741 A1 WO 2023179741A1
Authority
WO
WIPO (PCT)
Prior art keywords
network card
core
processor
data
node
Prior art date
Application number
PCT/CN2023/083532
Other languages
French (fr)
Chinese (zh)
Inventor
勾文进
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023179741A1 publication Critical patent/WO2023179741A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs

Definitions

  • Embodiments of the present application relate to the field of computers, and in particular, to a computing system and a data transmission method.
  • the message passing interface (message passing interface), as a cross-language communication protocol, accounts for an increasing proportion of HPC communications.
  • collective communication is an integral part of MPI communication, and optimizing collective communication delay is an important direction for HPC communication optimization.
  • Set communication is divided into inter-node communication and intra-node communication, and intra-node communication and inter-node communication are executed serially.
  • it is first necessary for a processor in the sending node to aggregate all the data of all processor cores in the sending node, and then send it to the receiving node through the network card. node.
  • This application provides a computing system and a data transmission method for reducing the communication delay of the computing system.
  • a first aspect of this application provides a computing system.
  • the computing system includes a first node and a second node.
  • the first node is connected to a first network card, and the first node is connected to the second node through the first network card.
  • the first node includes a plurality of first processor cores.
  • the first processor core is the processor core responsible for communicating with the network card in each processor core group.
  • the first processor core may also be called a leader in this application. (leader) processor core or root processor core.
  • leader leader
  • Each first processor core is used to perform computing tasks, generate first data according to the computing tasks, and send the first data to the first network card.
  • the first data includes data generated by collective communication between the first processor cores in the core group.
  • the first network card is used to identify multiple first processors and sent data, perform an aggregation operation on multiple first data sent by multiple first processor cores, and pass the data generated after performing the aggregation operation through the first network card. sent to the second node.
  • the first network card in the computing system provided in this application can implement aggregation operations on data sent by processor cores. Compared with data sent by multiple first processor cores in the first node, it is first processed by one of the first processors. The data is checked and aggregated before being sent to the first network card. The aggregation operation performed by the first network card reduces the communication between multiple first processor cores within the first node, thereby reducing the time-consuming aggregation operation within the node. And reduce the load on the processor core in the node.
  • the first node includes a plurality of core groups, the first core group is any one of the plurality of core groups, and the first core group includes a first processor core and at least a second processor core.
  • the computing tasks performed by the first processor core in the first core group include aggregating data sent by at least one second processor core in the first core group and data in the first processor core into a first data.
  • multiple first processor cores of the first node perform aggregation operations on data sent by other processor cores in the same core group, thereby avoiding communication between the first processor cores of different core groups, thereby reducing the number of Multiple first processes within one node
  • the communication delay between processor cores across core groups is further reduced, thereby further reducing the communication delay of the computing system.
  • the first processor core in the first core group after receiving the data sent by other processor cores in the same core group, performs an aggregation operation with the data in the first processor core to obtain the first data. , by packaging the first data into a first message and sending it to the first network card, where the first message includes a mark indicating that the first network card performs an aggregation operation on the first data.
  • the first network card is used to aggregate the first data in the first packet with the mark.
  • the first network card receives the message sent by multiple processor cores in the first node, identifies the mark in the message, and performs an aggregation operation on the first data in the first message with the mark.
  • the tag in the first message includes the target field in the header of the first message, such as the "Coll Tag" field.
  • the first network card can identify the packets that need to be aggregated according to the tag in the first packet, and perform the aggregation operation on the packets with the tags, so that it can accurately identify the packets that need to be aggregated, thereby improving the solution. achievability.
  • the first network card is also used to provide multiple first processor cores in the first network card.
  • the first network card pre-sets multiple first processor cores in the first node that need to interact with the first network card, that is, a leader processor core is set.
  • a leader processor core is set.
  • the first network card can pre-set multiple first processor cores in the first node that communicate with the first network card, so that the first network card can identify the first processor core during the collective communication process, further improving the solution. achievability.
  • the first network card is also configured to receive a second message sent by a second node, the second node being an external node of the first node, and the second message includes an instruction indicating that the first network card is responsible for the second message.
  • the message is marked for broadcast.
  • the first network card sends the second message to the plurality of first processor cores according to the tag of the second message and the tags of the multiple first processor cores set in the first network card.
  • the first network card matches the tags of the second message according to the tags of the plurality of first processor cores, and the first network card performs a broadcast operation for the first processor core with the tags of the second message.
  • the first network card can perform a broadcast operation on the second message sent by the second node. Compared with the first network card forwarding the second message to the first node, and then the first processor core in the first node The solution of broadcasting the second message to the processor cores of different core groups avoids the second message from being broadcast across core groups in the first node, thereby reducing the delay in broadcasting the second message and further reducing calculations. System communication delay.
  • the first processor core in the first core group is used to send the second message to at least one second processor core in the first core group, that is, multiple first processors
  • the cores broadcast the second message within their respective core groups.
  • the multiple first processor cores in the first node distribute the second message to other processor cores in the core group where the first processing core is located. Since It avoids the first processor core from distributing the second message across core groups, thereby improving and reducing the communication delay of the computing system.
  • the first processor core and at least one second processor core included in the first core group belong to a central processing unit CPU, or to a non-uniform memory access NUMA unit. That is, the processor cores in the first node are divided into core groups according to hardware ownership.
  • the processor cores belonging to the same CPU can be a core group, or the processor cores belonging to the same NUMA unit can be a core group.
  • a first processor core is set in each CPU of the first node or a first processor is set in each NUMA.
  • the first processor directly interacts with the first network card, thereby reducing data communication across CPUs in the first node or data communication across NUMA units, thereby reducing communication delay of the computing system.
  • the second aspect of this application provides a data transmission method, which can be executed by a computing system or by a computing system.
  • the components of the system such as the execution of the processor, chip or chip system of the first network card or the first node in the computing system, can also be implemented by logic modules or software that can realize all or part of the functions of the computing system.
  • the data transmission method provided in the first aspect includes: the first network card receives the first data sent by each processor in a plurality of first processor cores in the first node, wherein the plurality of first processor cores are the first network card
  • the preset processor core is responsible for communicating with the first network card.
  • Each first processor core in the first processor core is used to perform a computing task, generate first data according to the computing task, and send the first data to the first network card.
  • the first data includes the first processor in the core group.
  • the data generated by the core's collective communication such as ALL-Reduce, All-to-All and ALL-Gather, etc.
  • the first network card performs an aggregation operation on the plurality of first data.
  • the first network card sends the data after performing the aggregation operation to the second node.
  • the first network card can implement aggregation operations on data sent by multiple first processor cores of the first node. Compared with data sent by multiple first processor cores in the first node, it is first processed by one of the first processor cores. One processor core performs an aggregation operation on the data and then sends it to the first network card, which reduces the communication delay between multiple first processor cores within the first node, thereby reducing the communication delay of the computing system.
  • the first network card when the first network card receives the first data sent by each of the plurality of first processor cores in the first node, the first network card receives the first data sent by the first processor core.
  • a first message including first data the first message including a mark indicating that the first network card performs an aggregation operation on the first data.
  • the first network card aggregates the first data in the first message with the mark.
  • the first network card receives the first message including the first data sent by multiple first processor cores, and identifies the message that needs to be aggregated according to the mark in the first message. For the message with the mark, The aggregation operation is performed on the packets, so that the first packet that needs to be aggregated can be accurately identified, which improves the realizability of the solution.
  • the first network card before the first network card receives the first data sent by each of the plurality of first processor cores in the first node, the first network card sets multiple first processor cores. Specifically, when the computing system creates a communication domain, during the resource initialization process, the first node performs an All-reduce set communication operation with the network card based on multiple first processor cores, and the first network card can collect the information corresponding to the first network card. multiple first processor cores of the communication domain.
  • the first network card can pre-set multiple first processor cores in the first node that communicate with the first network card, so that the first network card can identify the first processor core during the collective communication process, further improving the solution. achievability.
  • the first network card receives a second message sent by the second node, and the second message includes a mark instructing the first network card to broadcast the second message.
  • the first network card sends the second message to the plurality of first processor cores according to the tag of the second message and the tags of the multiple first processor cores set in the first network card.
  • the first node includes multiple core groups.
  • the core group includes a first processor core and at least one second processor core, where the first core group is any one of the multiple core groups.
  • the first processor core in the first core group receives data sent by at least one second processor core in the first core group.
  • the first processor core in the first core group aggregates data sent by at least one second processor core in the first core group and data in the first processor core into first data.
  • the first network card receives the first data sent by each of the plurality of first processor cores in the first node, and the first network card receives the first data from the first processor in the first core group.
  • the core sends a first message including first data.
  • the first processor core in the first core group receives the second message sent by the first network card.
  • the first processor core in the first core group sends the second message to at least one second processor core in the first core group.
  • the third invention of the present application provides a data transmission method, which is executed by a network card.
  • the method includes the steps executed by the first network card in any method provided in the second aspect.
  • the fourth aspect of this application provides a network card, including a transceiver unit and a processing unit.
  • the transceiver unit and the processing unit are used to implement the steps performed by the first network card in any method provided in the third aspect.
  • the fifth aspect of this application provides a computing device, including a transceiver unit and an aggregation unit executed by a first processor core, A sending unit executed by the second processor core.
  • the transceiver unit and the aggregation unit are used to perform the functions performed by the first processor core in any method provided by the second aspect, and the sending unit is used to perform the second processor core in any method provided by the second aspect.
  • the functions performed by the processor core are performed by the processor core.
  • a sixth aspect of the present application provides a network card, including a processor.
  • the processor is coupled to a memory.
  • the processor is configured to store instructions. When the instructions are executed by the processor, the network card performs any method provided in the third aspect.
  • a seventh aspect of the present application provides a computing device, including a plurality of core groups.
  • the first core group is any one of the plurality of core groups.
  • the first core group includes a first processor core and At least one second processor core.
  • the computing tasks performed by the first processor core in the first core group include combining the data sent by the at least one second processor core in the first core group with the The data in the first processor core is aggregated into the first data, and the first processor in each core group is used to send the first data to the network card.
  • the eighth aspect of this application provides a chip, including an interface and a processing unit.
  • the interface is used to send and receive data.
  • the processing unit is used to perform the above-mentioned second aspect and the first network card in any possible implementation of the second aspect. function.
  • the eighth aspect of the present application provides a computer-readable storage medium on which instructions are stored.
  • the computer executes the first network card in the above-mentioned second aspect or any of the possible implementations of the second aspect.
  • the first aspect of this application provides a computer program product.
  • the computer program product includes instructions. When the instructions are executed, the computer implements the above-mentioned second aspect or any of the possible implementations of the second aspect and is executed by the first network card. method, or to enable the computer to implement the method executed by the first node in the above second aspect or any possible implementation manner of the second aspect.
  • Figure 1 is a schematic diagram of a communication system architecture provided by an embodiment of the present application.
  • Figure 2 is a schematic flow chart of a data transmission method provided by an embodiment of the present application.
  • Figure 3a is a schematic diagram of collective communication provided by an embodiment of the present application.
  • Figure 3b is a schematic diagram of collective communication provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of a message format provided by an embodiment of the present application.
  • Figure 5 is a schematic diagram of a network card performing a message aggregation operation provided by an embodiment of the present application
  • Figure 6 is a schematic diagram of another data transmission method according to an embodiment of the present application.
  • Figure 7 is a schematic structural diagram of a network card provided by an embodiment of the present application.
  • Figure 8 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of another network card provided by an embodiment of the present application.
  • Embodiments of the present application provide a computing system and a data transmission method for reducing the communication delay of the computing system.
  • Message passing interface is a standardized message passing standard that can run on a variety of parallel computing systems. This standard defines the core syntax and semantics of the communication library. Users can write message passing programs using a variety of programming languages, such as C language, C++ language and Fortran language.
  • the collective communication system provided by the embodiment of the present application includes multiple computing devices and multiple network cards and switches 103.
  • the plurality of computing devices includes a first node 101 and a second node 104
  • the plurality of network cards includes a first network card 102 and a second network card 105.
  • the first node 101 includes multiple processor cores, and the multiple processor cores can be divided into multiple processor core groups according to hardware ownership.
  • the core group can be divided according to the central processing unit CPU, that is, the processor cores belonging to the same CPU are divided into one processor core group.
  • the division of core groups can also be divided according to non-uniform memory access NUMA units.
  • the processor cores belonging to the same NUMA unit are divided into one processor core group.
  • data communication can be carried out between multiple computing devices, for example, data communication can be carried out between the first node 101 and the second node 104 .
  • Data communication can also be carried out between processor cores in a single node device.
  • data communication between processor cores can be carried out between multiple processor cores in a single first node 101.
  • Data communication between processor cores can be carried out.
  • Communication includes data communication between processor cores in the same core group and data communication across core groups between processor cores in different core groups.
  • the processor core in the first node 101 executes a computing task to generate data to be sent, packages the data to be sent, and sends it to the first network card 102.
  • the first network card 102 transmits the data to be sent.
  • the sent data is sent to the switch 103, and then the switch 103 forwards it to the second node 104.
  • the first network card 102 can identify, aggregate and forward messages sent by multiple processor cores within the first node 101, thereby reducing the transmission of messages across core groups within the node device 101.
  • the first network card 102 includes a processing module 1021.
  • the processing module 1021 is specifically used to set the first processor core in the first node 101.
  • the set first processor core is the one responsible for communicating with the network card in each processor core group.
  • Processor core, the first processor core may also be called a leader processor core or a root processor core in this application.
  • the processing module 1021 is also used to identify and aggregate the messages sent by the set first processor core.
  • the processing module 1021 can be an application specific integrated circuit (ASIC), a field programmable gate array (field programmable gate array, FPGA), a microprocessor MPU (micro-processing unit) and other hardware forms. , there is no specific limit.
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • MPU micro-processing unit
  • the first network card 102 can also be integrated in the first node 101.
  • the first network card 102 can be a board in the first node 101. network card.
  • the communication scenarios between the processor core in the first node 101 and the first network card 102 include a many-to-one fan-in scenario (Fan-in) and a one-to-many fan-out (Fan-out) scenario.
  • the many-to-one fan-in scenario refers to a process in which data sent by multiple processor cores in the first node 101 is aggregated by the first network card 102 and then sent by the first network card 102 to the second node 104 .
  • the one-to-many fan-out scenario refers to the process in which the first network card 102 receives the data sent by the second node 104 and then distributes the data to multiple processor cores in the first node 101 .
  • the data transmission method provided in the embodiment of the present application includes a fan-in scenario and a data fan-out scenario.
  • the following takes the data fan-in scenario as an example and introduces the data transmission method provided in the embodiment of the present application with reference to the accompanying drawings:
  • Figure 2 is a schematic flowchart of a data transmission method in a fan-in scenario provided in the implementation of this application. The method includes the following steps:
  • Multiple first processor cores in the first node 101 execute computing tasks to generate first data.
  • the first node 101 includes multiple core groups, wherein the first core group 1011 is any one of the multiple core groups in the first node 101, and the first core group 1011 includes a first processor.
  • Core 10111 and at least one second processor core 10112 where the first processor core 10111 is a preset leader processor core in the first core group, and the leader processor core is responsible for interacting with the first network card.
  • each of the plurality of core groups in the first node 101 is provided with a first processor core, and the first processor core interacts with the first network card.
  • Multiple first processor cores in the first node 101 execute computing tasks to generate first data. Specifically, when multiple first processor cores in the first node 101 execute computing tasks, taking the first processor core 10111 in the first core group 1011 as an example, the first processor core 10111 in the first core group 1011 The processor core 10111 aggregates the data sent by at least one second processor core 10112 in the first core group 1011 and the data in the first processor core 10111 into first data, and the first data is the first processor core The first data generated by performing a calculation task.
  • the first processor core of each of the plurality of core groups in the first node 101 can perform a computing task to generate the first data, that is, each of the plurality of first processor cores in the first node 101 can The processor cores perform calculation tasks to generate first data.
  • the processor cores in the first node 101 are not divided into core groups.
  • the first node 101 only includes first processor cores.
  • Each first processor core sends the first data after executing its own computing task.
  • the network card performs aggregation.
  • the first node 101 performs the task of a high-performance computing HPC application, such as a weather forecast application.
  • the first node 101 starts multiple processes, and the processor cores where the multiple processes are located need to perform collective communication.
  • ,Set communication includes Reduce, Broadcast and All-Reduce.
  • All-Reduce communication a reduction operation is performed on the data in the input buffer of each processor core in the communication domain, and the result of the reduction operation is returned to the output buffer of each processor core. That is, it is the first data for performing calculation tasks.
  • the above-mentioned reduction operations include summation, maximum value or minimum value, etc.
  • communication parameters include the starting address of the input buffer revbuf, the starting address of the output buffer sendbuf, the number of data count and the data type datatype.
  • all processor cores in the first node have the same communication parameters, so the processes in each processor core provide input buffers and output buffers with the same length and the same element type.
  • the first core group 1011 may be a central processing unit CPU unit or a non-uniform memory access NUMA unit. The steps in which multiple first processor cores of the first node execute computing tasks to generate the first data in the two core group division methods are described below with reference to the accompanying drawings.
  • FIG. 3a is a schematic diagram of collective communication provided by the implementation of this application.
  • the processor cores in the first node 102 are divided into multiple core groups according to the CPU ownership range. It is assumed that CPU1 is the first core group 1011 and CPU2 is the second core group 1012.
  • the processor cores of the first core group 1011 include processor core 0 to processor core 47 in CPU1, where processor core 0 is the first processor core in the first core group 1011.
  • the processor cores of the second core group 1012 include the processor core 48 to the processor core 95 in the CPU 2 , where the processor core 48 is the first processor core in the second core group.
  • processor cores 1 to 47 send data to processor core 0.
  • the processor 0 performs an aggregation operation on the data sent from the processor cores 1 to 47 and the data of the processor core 0 to obtain the first data.
  • FIG. 3b is a schematic diagram of another collective communication provided by the implementation of this application.
  • the processor cores in the first node 101 are divided into multiple core groups according to the NUMA ownership range.
  • NUMA1 is the first core group 1011
  • NUMA2 is the second core group 1012
  • NUMA3 is the third core group 1013
  • NUMA4 is the fourth core group 1014.
  • the processing cores of the first core group 1011 include processor core 0 to processor core 23 in NUMA1, where processor core 0 is the first processor core in the first core group 1011 .
  • the processor cores of the second core group 1012 include processor cores 24 to 47 in NUMA2, where the processor core 24 is the first processor core in the second core group 1012 .
  • the processor cores of the third core group 1013 include processor core 48 to processor core 71 in NUMA3, where processor core 48 is the first processor core in the third core group.
  • the processor cores of the fourth core group 1014 include processor cores 72 to 95 in NUMA4, where the processor core 72 is the first processor core in the fourth core group 1014 .
  • processor core 1 to processor core 23 send data to processor core 0.
  • the processor 0 performs an aggregation operation on the data sent from the processor cores 1 to 23 and the data of the processor core 0 to obtain the first data.
  • Each first processor core among the plurality of first processor cores in the first node 101 sends the first data to the first network card 102 .
  • Each first processor core among the plurality of first processor cores in the first node 101 sends the first data to the first network card 102 . Specifically, after each first processor core of the plurality of first processor cores in the first node 101 packages the first data into a first message, it sends the first message to the first network card 102, where, The first message includes a mark indicating that the first network card 102 performs an aggregation operation on a plurality of first data.
  • the first network card 102 will also receive the regular message sent by the first node 1011.
  • the first network card 102 will identify the need to perform the aggregation operation based on the above mark.
  • the first packet is aggregated, and other regular packets are forwarded directly.
  • Figure 4 is a schematic diagram of a message format of a first message provided by an embodiment of the present application.
  • the MPI+ header of the first message includes multiple fields, among which the "Coll tag” field or the "Coll tag high 32" field is the tag of the first message.
  • the first network card can identify the first packet that needs to be aggregated based on the "Coll tag” field or the "Coll tag high 32" field.
  • the MPI+ header of the first message also includes other fields, such as communication domain identifier Comm ID, operation type identifier Opt code, and data type identifier data type.
  • the processor core 0 in the CPU1 of the first node 101 packages the generated first data into a first message and then sends it to the first network card 102.
  • the first The network card 102 identifies the first packet as a packet that needs to be aggregated according to the tag in the first packet.
  • the processor core 0 in CPU2 of the first node 101 also packages the generated data into a first message and sends it to the first network card 102.
  • the first network card 102 receives the first messages sent by the leader cores in multiple CPUs 1. .
  • the leader cores in multiple NUMA units in the first node 101 send the first message to the first network card 102, and the processor cores 0 and NUMA2 units in the NUMA1 unit process it.
  • the processor core 24, the processor core 48 in the NUMA3 unit, and the processor core 72 in the NUMA4 unit respectively send the first message to the first network card 102.
  • the first network card 102 receives the first messages sent from the leader cores of multiple NUMA units, and identifies the messages that need to be aggregated according to the tags of the multiple first messages.
  • the first network card 102 performs an aggregation operation on the first data sent by the plurality of first processor cores.
  • the first network card 102 performs an aggregation operation on the first data sent by multiple first processor cores. Specifically, after receiving multiple messages sent by the first node, the first network card 102 identifies the first message according to the mark in the message header, and performs an aggregation operation on the first data in the first message with the mark.
  • Figure 5 is a schematic diagram of a network card performing an aggregation operation on packets according to an embodiment of the present application.
  • the network interface controller 1022 (NIC) of the first network card 102 receives the first message sent by the leader processor of a different core group of the first node 101, the first network card 102 internally
  • the processing module 1021 identifies the mark in the first message, performs an aggregation operation on the first data in the first message with the mark, generates an aggregated message, and sends the aggregated message to the third A network interface controller 1022 inside the network card 102.
  • the first network card 102 sends the data generated after the aggregation operation to the second node.
  • the first network card 102 sends the data generated after the aggregation operation to the second network card 105 .
  • the network interface controller 1022 of the first network card 102 sends the packet generated by the aggregation operation to the switch, and the switch forwards it to the second network card 105 connected to the second node 104 .
  • the first network card can perform an aggregation operation on the data sent by the first processor core. Compared with the data sent by multiple first processor cores in the first node, One of the first processor cores performs an aggregation operation on the data and then sends it to the first network card.
  • the solution of having the first network card perform the aggregation operation reduces the communication delay between multiple first processor cores within the first node. This in turn reduces the communication delay of the computing system.
  • Steps 201 to 204 of the above embodiment describe in detail the data transmission method of the computing system in the fan-in scenario.
  • the data transmission method of the computer system in the fan-out scenario is similar to the above method.
  • the following is a description of the data transmission method in the fan-out scenario with reference to Figure 6
  • the transmission method is introduced.
  • the data transmission method in the fan-out scenario includes the following steps:
  • the first network card 102 receives the second message sent by the second node 104.
  • the first network card 102 receives the second message sent by the second node 104. Specifically, the second node 104 sends the second message to the switch 103, and the switch 103 forwards the second message to the first network card 102 according to the destination address of the second message.
  • the second message also includes a mark used to instruct the first network card 102 to broadcast the second message. This mark is the same as the mark of the first message in the above fan-in scenario, and will not be described again here.
  • the first network card 102 broadcasts the second message to multiple first processor cores in the first node 101 according to the tag of the second message.
  • the first network card 102 broadcasts the second message to the plurality of first processor cores in the first node 101 according to the tag of the second message. Specifically, the first network card 102 broadcasts the second message to multiple first processors in the first node 101 according to the mark of the second message and the marks of the multiple first processor cores set in the first network card 102 Organ core.
  • the first network card 102 will receive multiple messages sent by the second node 104, the first network card 102 will identify the messages sent by the second node 104, and perform a broadcast operation on the second messages with the above tags.
  • the computing system when the computing system creates a communication domain, the computing system will set the first processor core in the first network card 102, that is, the leader processor core that interacts with the network card will be set in the first network card 102 in advance, so that , the first network
  • the card 102 may broadcast the second message to multiple leader processor cores in the first node.
  • the first network card 102 can set the first processor core in a variety of ways. For example, during the resource initialization process when the computing system creates a communication domain, the first node 1011 performs an All-process with the network card based on multiple first processor cores. Reduce aggregates communication operations to set the first processor core of the communication domain corresponding to the first network card 102 in the first network card 102 .
  • Multiple first processor cores distribute the second message to at least one second processor core in the same core group.
  • the plurality of first processor cores distribute the second message to at least one second processor core in the same core group.
  • the first processor core in the first core group 1011 distributes the second message to at least one second processor core in the first core group 1011. .
  • the first network card 102 can perform a broadcast operation on the second message sent by the second node 104. Compared with the first network card 102 forwarding the second message to the second node 104, A node 101, and then the first processor core in the first node 101 broadcasts the second message to the processor cores of different core groups, which avoids the second message being broadcast across core groups in the first node 101. This reduces the delay in broadcasting the second message and further reduces the communication delay in the computing system.
  • Table 1 is a table of delay reduction rates of a data transmission method provided by embodiments of the present application.
  • the "node-based delay” refers to the data sent by the first processor cores of different core groups in the first node. After one of the first processor cores is aggregated, the communication delay is forwarded to the external node through the network card.
  • Socket-based delay refers to the communication time in this application that the data sent by the first processor cores of different core groups in the first node is directly sent to the first network card, and then the first network card performs an aggregation operation and then forwards it to the external node. extension.
  • the "socket-based" data transmission method provided by the embodiment of the present application is compared with the "node-based” data transmission method.
  • the communication time of the calculation system is Delay decreased by 26.28%.
  • Figure 7 is a schematic structural diagram of a network card provided by an embodiment of the present application.
  • the network card is used to implement various steps performed by the first network card in the above embodiments.
  • the network card 700 includes a transceiver unit 701 and a processing unit 702.
  • the transceiver unit 701 is configured to receive first data sent by each processor in a plurality of first processor cores in the first node.
  • the processing unit 702 is configured to perform an aggregation operation on a plurality of first data.
  • the transceiver unit 701 is also used to send the data after performing the aggregation operation to the second node.
  • the transceiver unit 701 is specifically configured to receive a first message including first data sent by the first processor core, where the first message includes a mark indicating that the first network card performs an aggregation operation on the first data. .
  • Processing unit 702 The body is used to aggregate the first data in the first message with the mark.
  • the processing unit 702 is further configured to set multiple first Processor core.
  • the transceiver unit 701 is also configured to receive a second message sent by the second node, where the second message includes a mark instructing the first network card to broadcast the second message.
  • the transceiver unit 701 is also configured to send the second message to multiple first processor cores according to the tag of the second message and the tags of the multiple first processor cores set in the first network card.
  • FIG. 8 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
  • the first processor core of each core group in the computing device 800 includes a transceiver unit 801 and an aggregation unit 802, and the second processor core of each core group includes a transmitter unit 803.
  • the transceiver unit 801 and the aggregation unit 802 are In executing the functions executed by the first processor core in FIGS. 2 and 6 , the sending unit is used to execute the functions executed by the second processor core in FIGS. 2 and 6 .
  • the transceiving unit 801 of the first processor core is used to receive data sent by the sending unit 803 of at least one second processor core.
  • the aggregation unit 802 is used to aggregate the data sent by at least one second processor core and the data in the first processor core into first data.
  • the transceiver unit 801 of the first processor core is also used to send the first data to the network card. And instruct the network card to perform an aggregation operation on the first data sent by the transceiver unit 801 in the first processor in the multiple core groups.
  • the aggregation unit 802 is specifically configured to package the first data into a first message.
  • the transceiver unit 801 is specifically configured to send a first message to the network card, where the first message includes a mark instructing the network card to perform an aggregation operation on the first data.
  • the transceiver unit 801 is also configured to receive a second message sent by the network card.
  • the second message includes a mark instructing the network card to broadcast the second message.
  • the plurality of first processor cores are network cards. Preconfigured processor cores.
  • the transceiving unit 801 is configured to send the second message to at least one second processor core in the first core group.
  • each unit in the device can be a separate processing element, or it can be integrated and implemented in a certain chip of the device.
  • it can also be stored in the memory in the form of a program, and a certain processing element of the device can call and execute the unit. Function.
  • all or part of these units can be integrated together or implemented independently.
  • the processing element described here can also be a processor, which can be an integrated circuit with signal processing capabilities.
  • each step of the above method or each unit above can be implemented by an integrated logic circuit of hardware in the processor element or implemented in the form of software calling through the processing element.
  • FIG. 9 is a schematic structural diagram of a network card provided by an embodiment of the present application.
  • the network card 900 includes: a processor 910, a memory 920 and an interface 930.
  • the processor 910, the memory 920 and the interface 930 are coupled through a bus (not labeled in the figure).
  • the memory 920 stores instructions.
  • the network card 900 executes the method executed by the first network card or the first node in the above method embodiment.
  • the network card 900 may be one or more integrated circuits configured to implement the above methods, for example: one or more specific Integrated circuit (application specific integrated circuit, ASIC), or one or more microprocessors (digital signal processor, DSP), or one or more field programmable gate arrays (field programmable gate array, FPGA), or these A combination of at least two of the integrated circuit forms.
  • ASIC application specific integrated circuit
  • DSP digital signal processor
  • FPGA field programmable gate array
  • the unit in the device can be implemented in the form of a processing element scheduler
  • the processing element can be a general processor, such as a central processing unit (CPU) or other processor that can call a program.
  • CPU central processing unit
  • these units can be integrated together and implemented in the form of a system-on-a-chip (SOC).
  • the processor 910 may be a central processing unit (CPU), or other general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), or a field-controller processor. Field programmable gate array (FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof.
  • a general-purpose processor can be a microprocessor or any conventional processor.
  • Memory 920 may include read-only memory and random access memory and provides instructions and data to processor 910 .
  • Memory 920 may also include non-volatile random access memory.
  • the memory 920 may be provided with multiple partitions, and each area is used to store private keys of different software modules.
  • Memory 920 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase electrically programmable read-only memory (EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which is used as an external cache.
  • RAM static random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • Double data rate synchronous dynamic random access memory double data date SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • synchronous link dynamic random access memory direct rambus RAM, DR RAM
  • the bus may also include a power bus, a control bus, a status signal bus, etc.
  • the bus can be a peripheral component interconnect express (PCIe) bus, an extended industry standard architecture (EISA) bus, a unified bus (Ubus or UB), or a computer quick link (compute express link (CXL), cache coherent interconnect for accelerators (CCIX), etc.
  • PCIe peripheral component interconnect express
  • EISA extended industry standard architecture
  • Ubus or UB unified bus
  • CXL computer quick link
  • CXL cache coherent interconnect for accelerators
  • the bus can be divided into address bus, data bus, control bus, etc.
  • the network card 700 shown in Figure 7 and the network card 900 shown in Figure 9 in the embodiment of the present application can be the first network card in the system architecture shown in Figure 1, and the network card shown in Figure 8 in the embodiment of the present application
  • the computing device 800 may be the first node in the system architecture shown in FIG. 1 .
  • a computer-readable storage medium is also provided.
  • Computer-executable instructions are stored in the computer-readable storage medium.
  • the processor of the device executes the computer-executed instructions
  • the device executes the above method embodiment. The method executed by the first network card or the first node.
  • a computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium.
  • the processor of the device executes the computer execution instruction
  • the device executes the method executed by the first network card or the first node in the above method embodiment.
  • the disclosed systems, devices and methods can be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented.
  • the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit.
  • the above integrated units can be implemented in the form of hardware or software functional units.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program code. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

Disclosed in the embodiments of the present application are a computation system and a data transmission method, used for reducing a communication time delay of a computation system. The computation system in the embodiments of the present application comprises a first node and a second node, the first node being connected to a first network interface controller, and the first node being connected to the second node by means of the first network interface controller. The first node comprises a plurality of first processor cores, each first processor core being used for executing a computation task, generating first data according to the computation task, and sending the first data to the first network interface controller. The first network interface controller is used for executing an aggregation operation on the plurality of first data sent by the plurality of first processor cores, and sending the data generated after execution of the aggregation operation to the second node by means of the first network interface controller.

Description

一种计算系统以及数据传输方法Computing system and data transmission method
本申请要求于2022年3月24日提交中国专利局、申请号为202210295346.3、发明名称为“一种计算系统以及数据传输方法”的中国专利申请的优先权,所述专利申请的全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the China Patent Office on March 24, 2022, with the application number 202210295346.3 and the invention title "A computing system and data transmission method". The entire content of the patent application is incorporated by reference. incorporated in this application.
技术领域Technical field
本申请实施例涉及计算机领域,尤其涉及一种计算系统以及数据传输方法。Embodiments of the present application relate to the field of computers, and in particular, to a computing system and a data transmission method.
背景技术Background technique
随着高性能计算机群(high performance computing,HPC)技术的发展,信息传递接口(message passing interface)作为一种跨语言的通信协议,在HPC的通信中所占的比重越来越高。其中,集合通信是MPI通信的组成部分,优化集合通信时延是HPC通信优化的重要方向。With the development of high performance computing (HPC) technology, the message passing interface (message passing interface), as a cross-language communication protocol, accounts for an increasing proportion of HPC communications. Among them, collective communication is an integral part of MPI communication, and optimizing collective communication delay is an important direction for HPC communication optimization.
集合通信分为节点间通信和节点内通信,节点内通信和节点间通信串行执行。在目前的集合通信过程中,在两个节点之间进行通信时,首先需要由发送节点内的一个处理器核对发送节点内的所有处理器核的数据全部进行聚合后,再通过网卡发送至接收节点。Set communication is divided into inter-node communication and intra-node communication, and intra-node communication and inter-node communication are executed serially. In the current collective communication process, when communicating between two nodes, it is first necessary for a processor in the sending node to aggregate all the data of all processor cores in the sending node, and then send it to the receiving node through the network card. node.
由于节点内处理器核数量增加,使得集合通信中节点内的聚合操作的耗时也随之增加,另外,节点内的聚合操作也增加了处理器核的负载。As the number of processor cores in a node increases, the time-consuming of aggregation operations within nodes in collective communication also increases. In addition, aggregation operations within nodes also increase the load on the processor cores.
发明内容Contents of the invention
本申请提供了一种计算系统以及数据传输方法,用于降低计算系统的通信时延。This application provides a computing system and a data transmission method for reducing the communication delay of the computing system.
本申请第一方面提供了一种计算系统,该计算系统包括第一节点和第二节点,第一节点连接第一网卡,第一节点通过第一网卡连接至第二节点。第一节点包括多个第一处理器核,该第一处理器核为每个处理器核组中负责与网卡通信的处理器核,该第一处理器核也在本申请中可以称作领导(leader)处理器核或者根处理器核。每个第一处理器核用于执行计算任务,根据计算任务生成第一数据,并将第一数据发送至第一网卡,第一数据包括核组内第一处理器核进行集合通信产生的数据,集合通信例如ALL-Reduce、All-to-All和ALL-Gather等。第一网卡用于识别多个第一处理器和发送的数据,以及对多个第一处理器核发送的多个第一数据执行聚合操作,并将执行聚合操作后生成的数据通过第一网卡发送给第二节点。A first aspect of this application provides a computing system. The computing system includes a first node and a second node. The first node is connected to a first network card, and the first node is connected to the second node through the first network card. The first node includes a plurality of first processor cores. The first processor core is the processor core responsible for communicating with the network card in each processor core group. The first processor core may also be called a leader in this application. (leader) processor core or root processor core. Each first processor core is used to perform computing tasks, generate first data according to the computing tasks, and send the first data to the first network card. The first data includes data generated by collective communication between the first processor cores in the core group. ,Gather communication such as ALL-Reduce, All-to-All and ALL-Gather, etc. The first network card is used to identify multiple first processors and sent data, perform an aggregation operation on multiple first data sent by multiple first processor cores, and pass the data generated after performing the aggregation operation through the first network card. sent to the second node.
本申请中提供的计算系统中第一网卡能够实现对处理器核发送的数据执行聚合操作,相较于第一节点内的多个第一处理器核发送的数据先由其中一个第一处理器核对数据执行聚合操作后再发送至第一网卡,由第一网卡执行聚合操作减少了第一节点的内部多个第一处理器核之间的通信,从而减少了节点内聚合操作的耗时,且减少了节点内处理器核的负载。The first network card in the computing system provided in this application can implement aggregation operations on data sent by processor cores. Compared with data sent by multiple first processor cores in the first node, it is first processed by one of the first processors. The data is checked and aggregated before being sent to the first network card. The aggregation operation performed by the first network card reduces the communication between multiple first processor cores within the first node, thereby reducing the time-consuming aggregation operation within the node. And reduce the load on the processor core in the node.
一种可能的实施方式中,第一节点包括多个核组,第一核组为多个核组中的任意一个核组,第一核组包括一个第一处理器核及至少一个第二处理器核,第一核组中的第一处理器核所执行的计算任务包括将第一核组中的至少一个第二处理器核发送的数据与第一处理器核中的数据聚合为第一数据。In a possible implementation, the first node includes a plurality of core groups, the first core group is any one of the plurality of core groups, and the first core group includes a first processor core and at least a second processor core. The computing tasks performed by the first processor core in the first core group include aggregating data sent by at least one second processor core in the first core group and data in the first processor core into a first data.
本申请中第一节点的多个第一处理器核对同核组内的其他处理器核发送的数据进行聚合操作,避免了不同核组之间的第一处理器核的通信,从而减少了第一节点内的多个第一处理 器核进行跨核组之间的通信时延,从而进一步降低了计算系统的通信时延。In this application, multiple first processor cores of the first node perform aggregation operations on data sent by other processor cores in the same core group, thereby avoiding communication between the first processor cores of different core groups, thereby reducing the number of Multiple first processes within one node The communication delay between processor cores across core groups is further reduced, thereby further reducing the communication delay of the computing system.
一种可能的实施方式中,第一核组中的第一处理器核接收到同核组内其他处理器核发送的数据之后,与第一处理器核中的数据执行聚合操作得到第一数据,通过将第一数据打包为第一报文发送给第一网卡,第一报文包括指示第一网卡对第一数据进行聚合操作的标记。第一网卡用于对具有标记的第一报文中的第一数据进行聚合。具体的,第一网卡接收第一节点中多个处理器核发送的报文,第一网卡识别报文中标记,对具有标记的第一报文中的第一数据执行聚合操作。第一报文中的标记包括第一报文中报文头中的目标字段,目标字段例如“Coll Tag”字段。In a possible implementation, after receiving the data sent by other processor cores in the same core group, the first processor core in the first core group performs an aggregation operation with the data in the first processor core to obtain the first data. , by packaging the first data into a first message and sending it to the first network card, where the first message includes a mark indicating that the first network card performs an aggregation operation on the first data. The first network card is used to aggregate the first data in the first packet with the mark. Specifically, the first network card receives the message sent by multiple processor cores in the first node, identifies the mark in the message, and performs an aggregation operation on the first data in the first message with the mark. The tag in the first message includes the target field in the header of the first message, such as the "Coll Tag" field.
本申请中第一网卡能够根据第一报文中的标记识别需要进行聚合操作的报文,对于具有标记的报文执行聚合操作,从而能够准确识别需要进行聚合操作的报文,从而提升了方案的可实现性。In this application, the first network card can identify the packets that need to be aggregated according to the tag in the first packet, and perform the aggregation operation on the packets with the tags, so that it can accurately identify the packets that need to be aggregated, thereby improving the solution. achievability.
一种可能的实施方式中,第一网卡还用于在第一网卡中设置多个第一处理器核。具体的,第一网卡预先设置第一节点中需要与第一网卡进行交互的多个第一处理器核,即设置leader处理器核。在第一网卡中设置多个第一处理器核的方式有多种,例如,计算系统在创建通信域时,在资源初始化流程中,第一节点基于多个第一处理器核与网卡进行一次All-Reduce集合通信操作,第一网卡即可以收集到第一网卡对应的通信域的多个第一处理器核。In a possible implementation, the first network card is also used to provide multiple first processor cores in the first network card. Specifically, the first network card pre-sets multiple first processor cores in the first node that need to interact with the first network card, that is, a leader processor core is set. There are many ways to set multiple first processor cores in the first network card. For example, when the computing system creates a communication domain, during the resource initialization process, the first node communicates with the network card based on the multiple first processor cores. All-Reduce collects communication operations, and the first network card can collect multiple first processor cores in the communication domain corresponding to the first network card.
本申请中第一网卡能够预先设置第一节点中与第一网卡进行通信的多个第一处理器核,从而使得第一网卡在集合通信过程中能够识别第一处理器核,进一步提升了方案的可实现性。In this application, the first network card can pre-set multiple first processor cores in the first node that communicate with the first network card, so that the first network card can identify the first processor core during the collective communication process, further improving the solution. achievability.
一种可能的实施方式中,第一网卡还用于接收第二节点发送的第二报文,该第二节点为第一节点的外部节点,第二报文中包括指示第一网卡对第二报文进行广播的标记。第一网卡根据第二报文的标记及第一网卡中所设置的多个第一处理器核的标记将第二报文发送至多个第一处理器核。具体的,第一网卡根据多个第一处理器核的标记与第二报文的标记进行匹配,第一网卡对于具有第二报文的标记的第一处理器核执行广播操作。In a possible implementation, the first network card is also configured to receive a second message sent by a second node, the second node being an external node of the first node, and the second message includes an instruction indicating that the first network card is responsible for the second message. The message is marked for broadcast. The first network card sends the second message to the plurality of first processor cores according to the tag of the second message and the tags of the multiple first processor cores set in the first network card. Specifically, the first network card matches the tags of the second message according to the tags of the plurality of first processor cores, and the first network card performs a broadcast operation for the first processor core with the tags of the second message.
本申请中第一网卡能够实现对第二节点发送的第二报文执行广播操作,相较于第一网卡将第二报文转发至第一节点,再由第一节点中第一处理器核将第二报文广播至不同核组的处理器核的方案,避免了第二报文在第一节点内跨核组广播,从而减少了对第二报文进行广播的时延,进一步减少计算系统的通信时延。In this application, the first network card can perform a broadcast operation on the second message sent by the second node. Compared with the first network card forwarding the second message to the first node, and then the first processor core in the first node The solution of broadcasting the second message to the processor cores of different core groups avoids the second message from being broadcast across core groups in the first node, thereby reducing the delay in broadcasting the second message and further reducing calculations. System communication delay.
一种可能的实施方式中,第一核组中的第一处理器核用于将第二报文发送至第一核组中的至少一个第二处理器核中,即多个第一处理器核在各自的核组内广播第二报文。In a possible implementation, the first processor core in the first core group is used to send the second message to at least one second processor core in the first core group, that is, multiple first processors The cores broadcast the second message within their respective core groups.
本申请中第一节点中的多个第一处理器核在接收到第一网卡发送的第二报文之后,在第一处理核所在的核组内向其他处理器核分发第二报文,由于避免了第一处理器核跨核组分发第二报文,从提降低计算系统的通信时延。In this application, after receiving the second message sent by the first network card, the multiple first processor cores in the first node distribute the second message to other processor cores in the core group where the first processing core is located. Since It avoids the first processor core from distributing the second message across core groups, thereby improving and reducing the communication delay of the computing system.
一种可能的实施方式中,第一核组中所包括的第一处理器核和至少一个第二处理器核归属于一个中央处理单元CPU,或者归属于一个非统一内存访问NUMA单元。即第一节点中的处理器核按照硬件归属划分核组,归属于同一个CPU的处理器核可以为一个核组,或者归属于同一个NUMA单元的处理器核可以为一个核组。In a possible implementation, the first processor core and at least one second processor core included in the first core group belong to a central processing unit CPU, or to a non-uniform memory access NUMA unit. That is, the processor cores in the first node are divided into core groups according to hardware ownership. The processor cores belonging to the same CPU can be a core group, or the processor cores belonging to the same NUMA unit can be a core group.
本申请中第一节点的每个CPU中设置一个第一处理器核或者在每个NUMA中设置一个第一处理器。第一处理器直接与第一网卡进行交互,从而减少了第一节点内数据跨CPU通信或者数据跨NUMA单元通信,进而降低了计算系统的通信时延。In this application, a first processor core is set in each CPU of the first node or a first processor is set in each NUMA. The first processor directly interacts with the first network card, thereby reducing data communication across CPUs in the first node or data communication across NUMA units, thereby reducing communication delay of the computing system.
本申请第二方面提供了一种数据传输方法,该方法可以由计算系统执行,也可以由计算 系统的部件,例如计算系统中第一网卡或第一节点的处理器、芯片或芯片系统等执行,还可以由能实现全部或部分计算系统功能的逻辑模块或软件实现。第一方面提供的数据传输方法包括:第一网卡接收第一节点中多个第一处理器核中的每个处理器发送的第一数据,其中,多个第一处理器核为第一网卡预先设置的处理器核,负责与第一网卡通信。第一处理器核中的每个第一处理器核用于执行计算任务,根据计算任务生成第一数据,并将第一数据发送至第一网卡,第一数据包括核组内第一处理器核进行集合通信产生的数据,集合通信例如ALL-Reduce、All-to-All和ALL-Gather等。第一网卡对多个第一数据执行聚合操作。第一网卡向第二节点发送执行聚合操作后的数据。The second aspect of this application provides a data transmission method, which can be executed by a computing system or by a computing system. The components of the system, such as the execution of the processor, chip or chip system of the first network card or the first node in the computing system, can also be implemented by logic modules or software that can realize all or part of the functions of the computing system. The data transmission method provided in the first aspect includes: the first network card receives the first data sent by each processor in a plurality of first processor cores in the first node, wherein the plurality of first processor cores are the first network card The preset processor core is responsible for communicating with the first network card. Each first processor core in the first processor core is used to perform a computing task, generate first data according to the computing task, and send the first data to the first network card. The first data includes the first processor in the core group. The data generated by the core's collective communication, such as ALL-Reduce, All-to-All and ALL-Gather, etc. The first network card performs an aggregation operation on the plurality of first data. The first network card sends the data after performing the aggregation operation to the second node.
本申请中第一网卡能够实现对第一节点的多个第一处理器核发送的数据执行聚合操作,相较于第一节点内的多个第一处理器核发送的数据先由其中一个第一处理器核对数据执行聚合操作后再发送至第一网卡,减少了第一节点的内部多个第一处理器核之间的通信时延,从而减少了计算系统的通信时延。In this application, the first network card can implement aggregation operations on data sent by multiple first processor cores of the first node. Compared with data sent by multiple first processor cores in the first node, it is first processed by one of the first processor cores. One processor core performs an aggregation operation on the data and then sends it to the first network card, which reduces the communication delay between multiple first processor cores within the first node, thereby reducing the communication delay of the computing system.
一种可能的实施方式中,第一网卡接收第一节点中多个第一处理器核中的每个处理器核发送的第一数据的过程中,第一网卡接收第一处理器核发送的包括第一数据的第一报文,第一报文包括指示第一网卡对第一数据进行聚合操作的标记。第一网卡对多个第一数据执行聚合操作的过程中,第一网卡对具有标记的第一报文中的第一数据进行聚合。In a possible implementation, when the first network card receives the first data sent by each of the plurality of first processor cores in the first node, the first network card receives the first data sent by the first processor core. A first message including first data, the first message including a mark indicating that the first network card performs an aggregation operation on the first data. During the process of the first network card performing an aggregation operation on multiple first data, the first network card aggregates the first data in the first message with the mark.
本申请实施例中第一网卡接收多个第一处理器核发送的包括第一数据的第一报文,并根据第一报文中的标记识别需要进行聚合操作的报文,对于具有标记的报文执行聚合操作,从而能够准确识别需要进行聚合操作的第一报文,提升了方案的可实现性。In the embodiment of the present application, the first network card receives the first message including the first data sent by multiple first processor cores, and identifies the message that needs to be aggregated according to the mark in the first message. For the message with the mark, The aggregation operation is performed on the packets, so that the first packet that needs to be aggregated can be accurately identified, which improves the realizability of the solution.
一种可能的实施方式中,第一网卡接收第一节点中多个第一处理器核中的每个处理器核发送的第一数据之前,第一网卡设置多个第一处理器核。具体的,计算系统在创建通信域时,在资源初始化流程中,第一节点基于多个第一处理器核与网卡进行一次All-reduce集合通信操作,第一网卡就可以收集到第一网卡对应的通信域的多个第一处理器核。In a possible implementation, before the first network card receives the first data sent by each of the plurality of first processor cores in the first node, the first network card sets multiple first processor cores. Specifically, when the computing system creates a communication domain, during the resource initialization process, the first node performs an All-reduce set communication operation with the network card based on multiple first processor cores, and the first network card can collect the information corresponding to the first network card. multiple first processor cores of the communication domain.
本申请中第一网卡能够预先设置第一节点中与第一网卡进行通信的多个第一处理器核,从而使得第一网卡在集合通信过程中能够识别第一处理器核,进一步提升了方案的可实现性。In this application, the first network card can pre-set multiple first processor cores in the first node that communicate with the first network card, so that the first network card can identify the first processor core during the collective communication process, further improving the solution. achievability.
一种可能的实施方式中,第一网卡接收第二节点发送的第二报文,第二报文中包括指示第一网卡对第二报文进行广播的标记。第一网卡根据第二报文的标记及第一网卡中所设置的多个第一处理器核的标记将第二报文发送至多个第一处理器核。In a possible implementation, the first network card receives a second message sent by the second node, and the second message includes a mark instructing the first network card to broadcast the second message. The first network card sends the second message to the plurality of first processor cores according to the tag of the second message and the tags of the multiple first processor cores set in the first network card.
一种可能的实施方式中,第一节点包括多个核组,核组包括第一处理器核和至少一个第二处理器核,其中第一核组为多个核组中的任意一个核组,第一核组中的第一处理器核接收第一核组中的至少一个第二处理器核发送的数据。第一核组中的第一处理器核将第一核组中的至少一个第二处理器核发送的数据与第一处理器核中的数据聚合为第一数据。In a possible implementation, the first node includes multiple core groups. The core group includes a first processor core and at least one second processor core, where the first core group is any one of the multiple core groups. , the first processor core in the first core group receives data sent by at least one second processor core in the first core group. The first processor core in the first core group aggregates data sent by at least one second processor core in the first core group and data in the first processor core into first data.
一种可能的实施方式中,第一网卡接收第一节点中多个第一处理器核中的每个处理器核发送的第一数据,第一网卡接收第一核组中的第一处理器核发送的包括第一数据的第一报文。In a possible implementation, the first network card receives the first data sent by each of the plurality of first processor cores in the first node, and the first network card receives the first data from the first processor in the first core group. The core sends a first message including first data.
一种可能的实施方式中,第一核组中的第一处理器核接收第一网卡发送的第二报文。第一核组中的第一处理器核将第二报文发送至第一核组中的至少一个第二处理器核中。In a possible implementation, the first processor core in the first core group receives the second message sent by the first network card. The first processor core in the first core group sends the second message to at least one second processor core in the first core group.
本申请第三发明提供了一种数据传输方法,该方法由网卡执行,所述方法包括第二方面所提供的任一方法中,由第一网卡执行的步骤。The third invention of the present application provides a data transmission method, which is executed by a network card. The method includes the steps executed by the first network card in any method provided in the second aspect.
本申请第四方面提供了一种网卡,包括收发单元和处理单元。所述收发单元和处理单元用于实现第三方面所提供任一方法中第一网卡执行的步骤。The fourth aspect of this application provides a network card, including a transceiver unit and a processing unit. The transceiver unit and the processing unit are used to implement the steps performed by the first network card in any method provided in the third aspect.
本申请第五方面提供了一种计算设备,包括由第一处理器核执行的收发单元和聚合单元, 由第二处理器核执行的发送单元。所述收发单元和聚合单元用于执行第二方面所提供的任一方法中的第一处理器核所执行的功能,所述发送单元用于执行第二方面所提供的任一方法中第二处理器核所执行的功能。The fifth aspect of this application provides a computing device, including a transceiver unit and an aggregation unit executed by a first processor core, A sending unit executed by the second processor core. The transceiver unit and the aggregation unit are used to perform the functions performed by the first processor core in any method provided by the second aspect, and the sending unit is used to perform the second processor core in any method provided by the second aspect. The functions performed by the processor core.
本申请第六方面提供了一种网卡,包括处理器,处理器与存储器耦合,处理器用于存储指令,当指令被处理器执行时,以使得网卡执行上述第三方面提供的任一方法。A sixth aspect of the present application provides a network card, including a processor. The processor is coupled to a memory. The processor is configured to store instructions. When the instructions are executed by the processor, the network card performs any method provided in the third aspect.
本申请第七方面提供了一种计算设备,包括多个核组,第一核组为所述多个核组中的任意一个核组,所述第一核组包括一个第一处理器核及至少一个第二处理器核,所述第一核组中的第一处理器核所执行的计算任务包括将所述第一核组中的所述至少一个第二处理器核发送的数据与所述第一处理器核中的数据聚合为所述第一数据,每个核组中的第一处理器用于将所述第一数据发送至网卡。A seventh aspect of the present application provides a computing device, including a plurality of core groups. The first core group is any one of the plurality of core groups. The first core group includes a first processor core and At least one second processor core. The computing tasks performed by the first processor core in the first core group include combining the data sent by the at least one second processor core in the first core group with the The data in the first processor core is aggregated into the first data, and the first processor in each core group is used to send the first data to the network card.
本申请第八方面提供的了一种芯片,包括接口及处理单元,接口用于收发数据,处理单元用于执行上述第二方面以及第二方面任意一种可能的实施方式中第一网卡所执行的功能。The eighth aspect of this application provides a chip, including an interface and a processing unit. The interface is used to send and receive data. The processing unit is used to perform the above-mentioned second aspect and the first network card in any possible implementation of the second aspect. function.
本申请第八方面提供了一种计算机可读存储介质,其上存储有指令,指令被执行时,以使得计算机执行上述第二方面或第二方面任意一种可能的实施方式中第一网卡所执行的方法,或者,以使得计算机执行上述第二方面或第二方面任意一种可能的实施方式中第一节点所执行的方法。The eighth aspect of the present application provides a computer-readable storage medium on which instructions are stored. When the instructions are executed, the computer executes the first network card in the above-mentioned second aspect or any of the possible implementations of the second aspect. The method of execution, or to cause the computer to execute the method executed by the first node in the above second aspect or any possible implementation manner of the second aspect.
本申请第方面提供了一种计算机程序产品,计算机程序产品中包括指令,指令被执行时,以使得计算机实现上述第二方面或第二方面任意一种可能的实施方式中第一网卡所执行的方法,或者,以使得计算机实现上述第二方面或第二方面任意一种可能的实施方式中第一节点所执行的方法。The first aspect of this application provides a computer program product. The computer program product includes instructions. When the instructions are executed, the computer implements the above-mentioned second aspect or any of the possible implementations of the second aspect and is executed by the first network card. method, or to enable the computer to implement the method executed by the first node in the above second aspect or any possible implementation manner of the second aspect.
可以理解,上述提供的任一种数据传输方法、网卡、计算设备、芯片、计算机可读介质或计算机程序产品等所能达到的有益效果可参考对应的计算系统中的有益效果,此处不再赘述。It can be understood that the beneficial effects that can be achieved by any of the data transmission methods, network cards, computing devices, chips, computer-readable media or computer program products provided above can be referred to the beneficial effects in the corresponding computing system, which will not be discussed here. Repeat.
附图说明Description of the drawings
图1为本申请实施例提供的一种通信系统架构示意图;Figure 1 is a schematic diagram of a communication system architecture provided by an embodiment of the present application;
图2为本申请实施例提供的一种数据传输方法流程示意图;Figure 2 is a schematic flow chart of a data transmission method provided by an embodiment of the present application;
图3a为本申请实施例提供的一种集合通信的示意图;Figure 3a is a schematic diagram of collective communication provided by an embodiment of the present application;
图3b为本申请实施例提供的一种集合通信的示意图;Figure 3b is a schematic diagram of collective communication provided by an embodiment of the present application;
图4为本申请实施例提供的一种报文格式的示意图;Figure 4 is a schematic diagram of a message format provided by an embodiment of the present application;
图5为本申请实施例提供的一种网卡执行报文聚合操作的示意图;Figure 5 is a schematic diagram of a network card performing a message aggregation operation provided by an embodiment of the present application;
图6为本申请实施例提供另一种数据传输方法的示意图;Figure 6 is a schematic diagram of another data transmission method according to an embodiment of the present application;
图7为本申请实施例提供的一种网卡的结构示意图;Figure 7 is a schematic structural diagram of a network card provided by an embodiment of the present application;
图8为本申请实施例提供的一种计算设备的结构示意图;Figure 8 is a schematic structural diagram of a computing device provided by an embodiment of the present application;
图9为本申请实施例提供的另一种网卡的结构示意图。Figure 9 is a schematic structural diagram of another network card provided by an embodiment of the present application.
具体实施方式Detailed ways
本申请实施例提供了一种计算系统以及数据传输方法,用于降低计算系统的通信时延。Embodiments of the present application provide a computing system and a data transmission method for reducing the communication delay of the computing system.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描 述的内容以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if present) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects without necessarily using Used to describe a specific order or sequence. It is to be understood that data so used are interchangeable under appropriate circumstances so that the embodiments described herein can be used in applications other than those illustrated or described herein. Implemented in a sequence other than those described above. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, e.g., a process, method, system, product, or apparatus that encompasses a series of steps or units and need not be limited to those explicitly listed. Those steps or elements may instead include other steps or elements not expressly listed or inherent to the process, method, product or apparatus.
在本申请实施例中,“示例性的”或者“例如”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”或者“例如”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”或者“例如”等词旨在以具体方式呈现相关概念。In the embodiments of this application, words such as "exemplary" or "for example" are used to represent examples, illustrations or explanations. Any embodiment or design described as "exemplary" or "such as" in the embodiments of the present application is not to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the words "exemplary" or "such as" is intended to present the concept in a concrete manner.
以下,对本申请中的部分用语进行解释说明,以便于本领域技术人员理解。In the following, some terms used in this application are explained to facilitate understanding by those skilled in the art.
消息传递接口(message passing interface,MPI)是一种标准化的消息传递标准,可以在多种并行计算系统上运行。该标准定义了通信库的核心语法和语义,用户可以使用多种编程语言编写消息传递程序,编程语言例如C语言、C++语言和Fortran语言。Message passing interface (MPI) is a standardized message passing standard that can run on a variety of parallel computing systems. This standard defines the core syntax and semantics of the communication library. Users can write message passing programs using a variety of programming languages, such as C language, C++ language and Fortran language.
集合通讯是MPI的一个重要通信部分,该集合通信模式下,通信域内所有进程均参与通信。集合通信的模型例如All-reduce、Broadcast和All-to-all等。Collective communication is an important communication part of MPI. In this collective communication mode, all processes in the communication domain participate in communication. Collection communication models such as All-reduce, Broadcast and All-to-all.
下面结合附图介绍本申请实施例提供的计算系统以及数据传输方法。The computing system and data transmission method provided by the embodiments of the present application will be introduced below with reference to the accompanying drawings.
请参阅图1,图1为本申请实施例提供的一种集合通信场景的系统架构示意图。如图1所示,本申请实施例提供的集合通信系统中包括多个计算设备和多个网卡和交换机103。多个计算设备包括第一节点101和第二节点104,多个网卡包括第一网卡102和第二网卡105。以第一节点101为例,第一节点101中包括多个处理器核,多个处理器核可以按照硬件归属分为多个处理器核组。例如,核组的划分可以按照中央处理器单元CPU划分,即归属于同一个CPU中的处理器核划分为一个处理器核组。核组的划分也可以按照非统一内存访问NUMA单元划分,属于同一个NUMA单元的处理器核划分为一个处理器核组。Please refer to Figure 1, which is a schematic diagram of the system architecture of a collective communication scenario provided by an embodiment of the present application. As shown in Figure 1, the collective communication system provided by the embodiment of the present application includes multiple computing devices and multiple network cards and switches 103. The plurality of computing devices includes a first node 101 and a second node 104, and the plurality of network cards includes a first network card 102 and a second network card 105. Taking the first node 101 as an example, the first node 101 includes multiple processor cores, and the multiple processor cores can be divided into multiple processor core groups according to hardware ownership. For example, the core group can be divided according to the central processing unit CPU, that is, the processor cores belonging to the same CPU are divided into one processor core group. The division of core groups can also be divided according to non-uniform memory access NUMA units. The processor cores belonging to the same NUMA unit are divided into one processor core group.
在图1所示的系统中,多个计算设备之间可以进行数据通信,例如第一节点101和第二节点104之间可以进行数据通信。单个节点设备内的处理器核之间也可以进行数据通信,例如,单个第一节点101内部的多个处理器核之间可以进行处理器核之间的数据通信,处理器核之间的数据通信包括同一核组的处理器核之间的数据通信和不同核组的处理器核之间跨核组的数据通信。In the system shown in FIG. 1 , data communication can be carried out between multiple computing devices, for example, data communication can be carried out between the first node 101 and the second node 104 . Data communication can also be carried out between processor cores in a single node device. For example, data communication between processor cores can be carried out between multiple processor cores in a single first node 101. Data communication between processor cores can be carried out. Communication includes data communication between processor cores in the same core group and data communication across core groups between processor cores in different core groups.
在计算设备之间进行数据通信的场景中,第一节点101中的处理器核执行计算任务生成待发送的数据,将待发送的数据打包发送至第一网卡102,通过第一网卡102将待发送的数据发送至交换机103,再由交换机103转发至第二节点104。In the scenario of data communication between computing devices, the processor core in the first node 101 executes a computing task to generate data to be sent, packages the data to be sent, and sends it to the first network card 102. The first network card 102 transmits the data to be sent. The sent data is sent to the switch 103, and then the switch 103 forwards it to the second node 104.
本申请实施例中第一网卡102能够实现对第一节点101内部的多个处理器核发送的报文进行识别、聚合以及转发,从而减少报文在节点设备101内部跨核组传输。第一网卡102包括处理模块1021,处理模块1021具体用于实现对第一节点101中第一处理器核的设置,设置的第一处理器核为每个处理器核组中负责与网卡通信的处理器核,该第一处理器核也在本申请中可以称作领导(leader)处理器核或者根处理器核。处理模块1021还用于对设置的第一处理器核发送的报文进行识别和聚合。其中,处理模块1021可以是由特定集成电路(application specific integrated circuit,ASIC),也可以是现场可编程门阵列(field programmable gate array,FPGA),微处理器MPU(micro-processing unit)等硬件形式,具体不做限定。In the embodiment of the present application, the first network card 102 can identify, aggregate and forward messages sent by multiple processor cores within the first node 101, thereby reducing the transmission of messages across core groups within the node device 101. The first network card 102 includes a processing module 1021. The processing module 1021 is specifically used to set the first processor core in the first node 101. The set first processor core is the one responsible for communicating with the network card in each processor core group. Processor core, the first processor core may also be called a leader processor core or a root processor core in this application. The processing module 1021 is also used to identify and aggregate the messages sent by the set first processor core. Among them, the processing module 1021 can be an application specific integrated circuit (ASIC), a field programmable gate array (field programmable gate array, FPGA), a microprocessor MPU (micro-processing unit) and other hardware forms. , there is no specific limit.
需要说明的,本申请实施例中第一网卡102也可以集成在第一节点101中,当第一网卡102集成在第一节点101中时,第一网卡102可以是第一节点101中的板载网卡。 It should be noted that in the embodiment of the present application, the first network card 102 can also be integrated in the first node 101. When the first network card 102 is integrated in the first node 101, the first network card 102 can be a board in the first node 101. network card.
第一节点101中的处理器核与第一网卡102的通信场景包括多对一的扇入场景(Fan-in)和一对多的扇出(Fan-out)场景。其中,多对一的扇入场景是指第一节点101中的多个处理器核发送的数据在第一网卡102聚合之后,由第一网卡102发送到第二节点104的过程。一对多的扇出场景是指第一网卡102接收到第二节点104发送的数据后,由第一网卡102分发至第一节点101内的多个处理器核的过程。The communication scenarios between the processor core in the first node 101 and the first network card 102 include a many-to-one fan-in scenario (Fan-in) and a one-to-many fan-out (Fan-out) scenario. The many-to-one fan-in scenario refers to a process in which data sent by multiple processor cores in the first node 101 is aggregated by the first network card 102 and then sent by the first network card 102 to the second node 104 . The one-to-many fan-out scenario refers to the process in which the first network card 102 receives the data sent by the second node 104 and then distributes the data to multiple processor cores in the first node 101 .
本申请实施例中提供的数据传输方法包括扇入场景和数据扇出场景,下面以数据扇入场景为例,结合附图对本申请实施例提供的数据传输方法进行介绍:The data transmission method provided in the embodiment of the present application includes a fan-in scenario and a data fan-out scenario. The following takes the data fan-in scenario as an example and introduces the data transmission method provided in the embodiment of the present application with reference to the accompanying drawings:
请参阅图2,图2为本申请实施中提供的一种扇入场景下的数据传输方法的流程示意图,所述方法包括以下步骤:Please refer to Figure 2. Figure 2 is a schematic flowchart of a data transmission method in a fan-in scenario provided in the implementation of this application. The method includes the following steps:
201.第一节点101内的多个第一处理器核执行计算任务生成第一数据。201. Multiple first processor cores in the first node 101 execute computing tasks to generate first data.
本申请实施例中第一节点101包括多个核组,其中,第一核组1011为第一节点101中多个核组中的任意一个核组,第一核组1011中包括第一处理器核10111和至少一个第二处理器核10112,其中,第一处理器核10111为第一核组中预先设置的leader处理器核,leader处理器核负责与第一网卡进行交互。可以理解的是,第一节点101中的多个核组中每个核组中都会设置一个第一处理器核,由该第一处理器核与第一网卡进行交互。In the embodiment of the present application, the first node 101 includes multiple core groups, wherein the first core group 1011 is any one of the multiple core groups in the first node 101, and the first core group 1011 includes a first processor. Core 10111 and at least one second processor core 10112, where the first processor core 10111 is a preset leader processor core in the first core group, and the leader processor core is responsible for interacting with the first network card. It can be understood that each of the plurality of core groups in the first node 101 is provided with a first processor core, and the first processor core interacts with the first network card.
第一节点101内的多个第一处理器核执行计算任务生成第一数据。具体的,第一节点101内的多个第一处理器核在执行计算任务的过程中,以第一核组1011中的第一处理器核10111为例,第一核组1011中的第一处理器核10111将第一核组1011内至少一个第二处理器核10112发送的数据与该第一处理器核10111中的数据聚合为第一数据,该第一数据即为第一处理器核执行计算任务生成的第一数据。Multiple first processor cores in the first node 101 execute computing tasks to generate first data. Specifically, when multiple first processor cores in the first node 101 execute computing tasks, taking the first processor core 10111 in the first core group 1011 as an example, the first processor core 10111 in the first core group 1011 The processor core 10111 aggregates the data sent by at least one second processor core 10112 in the first core group 1011 and the data in the first processor core 10111 into first data, and the first data is the first processor core The first data generated by performing a calculation task.
应理解,第一节点101中多个核组中每个核组的第一处理器核都能够执行计算任务生成第一数据,即第一节点101内的多个第一处理器核中的每个处理器核执行计算任务生成第一数据。It should be understood that the first processor core of each of the plurality of core groups in the first node 101 can perform a computing task to generate the first data, that is, each of the plurality of first processor cores in the first node 101 can The processor cores perform calculation tasks to generate first data.
在一些实施例中,第一节点101内的处理器核没有划分核组,第一节点101只包括第一处理器核,每个第一处理器核将自己执行计算任务后的第一数据发送给网卡,由网卡进行聚合。In some embodiments, the processor cores in the first node 101 are not divided into core groups. The first node 101 only includes first processor cores. Each first processor core sends the first data after executing its own computing task. To the network card, the network card performs aggregation.
为了方便描述,以下均已第一节点101内的处理核划分核组为例进行说明。For convenience of description, the following description takes the division of processing cores in the first node 101 into core groups as an example.
在一个执行计算任务的示例中,第一节点101执行高性能计算HPC应用的任务,HPC应用例如天气预报应用,第一节点101启动多个进程,多个进程所在的处理器核需要进行集合通信,集合通信包括Reduce、Broadcast和All-Reduce。例如,在进行All-Reduce通信时,对通信域内的每个处理器核的输入缓冲区的数据执行规约操作,并将规约操作后的结果返回至每个处理器核的输出缓冲区,该结果即为执行计算任务的第一数据,上述规约操作包括求和、求最大值或最小值等。In an example of performing a computing task, the first node 101 performs the task of a high-performance computing HPC application, such as a weather forecast application. The first node 101 starts multiple processes, and the processor cores where the multiple processes are located need to perform collective communication. ,Set communication includes Reduce, Broadcast and All-Reduce. For example, during All-Reduce communication, a reduction operation is performed on the data in the input buffer of each processor core in the communication domain, and the result of the reduction operation is returned to the output buffer of each processor core. That is, it is the first data for performing calculation tasks. The above-mentioned reduction operations include summation, maximum value or minimum value, etc.
在All-Reduce通信中,通信参数包括输入缓冲的起始地址revbuf、输出缓冲区的起始地址sendbuf、数据个数count和数据类型datatype。在All-Reduce通信中,第一节点中所有的处理器核具有相同的通信参数,因此每个处理器核中的进程都提供长度相同、元素类型相同的输入缓冲区和输出缓冲区。In All-Reduce communication, communication parameters include the starting address of the input buffer revbuf, the starting address of the output buffer sendbuf, the number of data count and the data type datatype. In All-Reduce communication, all processor cores in the first node have the same communication parameters, so the processes in each processor core provide input buffers and output buffers with the same length and the same element type.
由于本申请实施例中的核组划分是基于处理器核的硬件归属进行划分的,即上述第一核组1011可以是一个中央处理器CPU单元,也可以是一个非统一内存访问NUMA单元。下面结合附图分别介绍两种核组划分方式下,第一节点的多个第一处理器核执行计算任务生成第一数据的步骤。 Since the core group division in the embodiment of the present application is based on the hardware ownership of the processor core, that is, the first core group 1011 may be a central processing unit CPU unit or a non-uniform memory access NUMA unit. The steps in which multiple first processor cores of the first node execute computing tasks to generate the first data in the two core group division methods are described below with reference to the accompanying drawings.
请参阅图3a,图3a为本申请实施提供的一种集合通信的示意图。如图3a所示,第一节点102中的处理器核按照CPU归属范围分多个核组,假设CPU1为第一核组1011,CPU2为第二核组1012。第一核组1011的处理器核包括CPU1中处理器核0至处理器核47,其中处理器核0为第一核组1011中的第一处理器核。第二核组1012的处理器核包括CPU2中的处理器核48到处理器核95,其中,处理器核48为第二核组中的第一处理器核。Please refer to Figure 3a, which is a schematic diagram of collective communication provided by the implementation of this application. As shown in Figure 3a, the processor cores in the first node 102 are divided into multiple core groups according to the CPU ownership range. It is assumed that CPU1 is the first core group 1011 and CPU2 is the second core group 1012. The processor cores of the first core group 1011 include processor core 0 to processor core 47 in CPU1, where processor core 0 is the first processor core in the first core group 1011. The processor cores of the second core group 1012 include the processor core 48 to the processor core 95 in the CPU 2 , where the processor core 48 is the first processor core in the second core group.
在图3a所示的示例中,以CPU1中的处理器核为例,CPU1中的处理器核在执行计算任务的过程中,处理器核1至处理器核47将数据发送至处理器核0,处理器0对处理器核1至处理器核47发送的数据和处理器核0的数据进行聚合操作,得到第一数据。In the example shown in Figure 3a, taking the processor core in CPU1 as an example, when the processor core in CPU1 is executing a computing task, processor cores 1 to 47 send data to processor core 0. , the processor 0 performs an aggregation operation on the data sent from the processor cores 1 to 47 and the data of the processor core 0 to obtain the first data.
请参阅图3b,图3b为本申请实施提供的另一种集合通信的示意图。如图3b所示,第一节点101中的处理器核按照NUMA归属范围分多个核组,假设NUMA1为第一核组1011,NUMA2为第二核组1012,NUMA3为第三核组1013,NUMA4为第四核组1014。第一核组1011的处理核包括NUMA1中处理器核0至处理器核23,其中处理器核0为第一核组1011中的第一处理器核。第二核组1012的处理器核包括NUMA2中的处理器核24到处理器核47,其中,处理器核24为第二核组1012中的第一处理器核。第三核组1013的处理器核包括NUMA3中的处理器核48到处理器核71,其中,处理器核48为第三核组中的第一处理器核。第四核组1014的处理器核包括NUMA4中的处理器核72到处理器核95,其中,处理器核72为第四核组1014中的第一处理器核。Please refer to Figure 3b, which is a schematic diagram of another collective communication provided by the implementation of this application. As shown in Figure 3b, the processor cores in the first node 101 are divided into multiple core groups according to the NUMA ownership range. Assume that NUMA1 is the first core group 1011, NUMA2 is the second core group 1012, and NUMA3 is the third core group 1013. NUMA4 is the fourth core group 1014. The processing cores of the first core group 1011 include processor core 0 to processor core 23 in NUMA1, where processor core 0 is the first processor core in the first core group 1011 . The processor cores of the second core group 1012 include processor cores 24 to 47 in NUMA2, where the processor core 24 is the first processor core in the second core group 1012 . The processor cores of the third core group 1013 include processor core 48 to processor core 71 in NUMA3, where processor core 48 is the first processor core in the third core group. The processor cores of the fourth core group 1014 include processor cores 72 to 95 in NUMA4, where the processor core 72 is the first processor core in the fourth core group 1014 .
在图3b所示的示例中,以NUMA1中的处理器核为例,NUMA1中的处理器核在执行计算任务的过程中,处理器核1至处理器核23将数据发送至处理器核0,处理器0对处理器核1至处理器核23发送的数据和处理器核0的数据进行聚合操作,得到第一数据。In the example shown in Figure 3b, taking the processor core in NUMA1 as an example, when the processor core in NUMA1 is executing a computing task, processor core 1 to processor core 23 send data to processor core 0. , the processor 0 performs an aggregation operation on the data sent from the processor cores 1 to 23 and the data of the processor core 0 to obtain the first data.
202.第一节点101中的多个第一处理器核中每个第一处理器核向第一网卡102发送第一数据。202. Each first processor core among the plurality of first processor cores in the first node 101 sends the first data to the first network card 102 .
第一节点101中的多个第一处理器核中每个第一处理器核向第一网卡102发送第一数据。具体的,第一节点101中的多个第一处理器核中的每个第一处理器核将第一数据打包为第一报文后,向第一网卡102发送第一报文,其中,第一报文中包括指示第一网卡102对多个第一数据进行聚合操作的标记。Each first processor core among the plurality of first processor cores in the first node 101 sends the first data to the first network card 102 . Specifically, after each first processor core of the plurality of first processor cores in the first node 101 packages the first data into a first message, it sends the first message to the first network card 102, where, The first message includes a mark indicating that the first network card 102 performs an aggregation operation on a plurality of first data.
可以理解的是,第一网卡102接收需要进行聚合操作的第一报文之外,还会接收处第一节点1011发送的常规报文,第一网102卡则基于上述标记识别需要进行聚合操作的第一报文,对于第一报文执行聚合操作,对于其他常规报文,则直接进行转发。It can be understood that, in addition to receiving the first message that needs to be aggregated, the first network card 102 will also receive the regular message sent by the first node 1011. The first network card 102 will identify the need to perform the aggregation operation based on the above mark. The first packet is aggregated, and other regular packets are forwarded directly.
请参阅图4,图4为本申请实施例提供的一种第一报文的报文格式示意图。如图4所示,在第一报文的MPI+报文头中包括多个字段,其中“Coll tag”字段或“Coll tag high 32”字段即为上述第一报文的标记。第一网卡可以根据“Coll tag”字段或“Coll tag high 32”字段识别需要进行聚合操作的第一报文。第一报文的MPI+报文头中还包括其他字段,例如,通信域标识Comm ID、操作类型标识Opt code和数据类型标识data type等。Please refer to Figure 4, which is a schematic diagram of a message format of a first message provided by an embodiment of the present application. As shown in Figure 4, the MPI+ header of the first message includes multiple fields, among which the "Coll tag" field or the "Coll tag high 32" field is the tag of the first message. The first network card can identify the first packet that needs to be aggregated based on the "Coll tag" field or the "Coll tag high 32" field. The MPI+ header of the first message also includes other fields, such as communication domain identifier Comm ID, operation type identifier Opt code, and data type identifier data type.
请继续参阅图3a,在图3a所示的示例中,第一节点101的CPU1中的处理器核0将生成的第一数据打包成第一报文之后,发送至第一网卡102,第一网卡102根据第一报文中的标记识别第一报文为需要进行聚合操作的报文。同样的,第一节点101的CPU2中的处理器核0也将生成的数据打包为第一报文发送至第一网卡102,第一网卡102接收多个CPU1中leader核发送的第一报文。 Please continue to refer to Figure 3a. In the example shown in Figure 3a, the processor core 0 in the CPU1 of the first node 101 packages the generated first data into a first message and then sends it to the first network card 102. The first The network card 102 identifies the first packet as a packet that needs to be aggregated according to the tag in the first packet. Similarly, the processor core 0 in CPU2 of the first node 101 also packages the generated data into a first message and sends it to the first network card 102. The first network card 102 receives the first messages sent by the leader cores in multiple CPUs 1. .
相应的,在图3b所示的示例中,第一节点101中的多个NUMA单元内的leader核向第一网卡102发送第一报文,NUMA1单元中的处理器核0、NUMA2单元中处理器核24、NUMA3单元中的处理器核48和NUMA4单元中的处理器核72分别向第一网卡102发送第一报文。第一网卡102接收来自多个NUMA单元的leader核发送的第一报文,并根据多个第一报文的标记识别需要进行聚合操作的报文。Correspondingly, in the example shown in Figure 3b, the leader cores in multiple NUMA units in the first node 101 send the first message to the first network card 102, and the processor cores 0 and NUMA2 units in the NUMA1 unit process it. The processor core 24, the processor core 48 in the NUMA3 unit, and the processor core 72 in the NUMA4 unit respectively send the first message to the first network card 102. The first network card 102 receives the first messages sent from the leader cores of multiple NUMA units, and identifies the messages that need to be aggregated according to the tags of the multiple first messages.
203.第一网卡102对多个第一处理器核发送的第一数据执行聚合操作。203. The first network card 102 performs an aggregation operation on the first data sent by the plurality of first processor cores.
第一网卡102对多个第一处理器核发送的第一数据执行聚合操作。具体的,第一网卡102接收第一节点发送的多个报文后,根据报文头中的标记识别第一报文,对于具有标记的第一报文中的第一数据进行聚合操作。The first network card 102 performs an aggregation operation on the first data sent by multiple first processor cores. Specifically, after receiving multiple messages sent by the first node, the first network card 102 identifies the first message according to the mark in the message header, and performs an aggregation operation on the first data in the first message with the mark.
请参阅图5,图5为本申请实施例提供的一种网卡对报文执行聚合操作的示意图。如图5所示,第一网卡102的网络接口控制器1022(network interface controller,NIC)接收第一节点101的不同核组的leader处理器发送的第一报文之后,由第一网卡102内部的处理模块1021识别第一报文中的标记,并对具有该标记的第一报文中的第一数据执行聚合操作,生成聚合操作后的报文,并将聚合后的报文发送至第一网卡102内部的网络接口控制器1022。Please refer to Figure 5. Figure 5 is a schematic diagram of a network card performing an aggregation operation on packets according to an embodiment of the present application. As shown in Figure 5, after the network interface controller 1022 (NIC) of the first network card 102 receives the first message sent by the leader processor of a different core group of the first node 101, the first network card 102 internally The processing module 1021 identifies the mark in the first message, performs an aggregation operation on the first data in the first message with the mark, generates an aggregated message, and sends the aggregated message to the third A network interface controller 1022 inside the network card 102.
204.第一网卡102将聚合操作后生成的数据发送至第二节点。204. The first network card 102 sends the data generated after the aggregation operation to the second node.
第一网卡102将聚合操作后生成的数据发送至第二网卡105。具体的,第一网卡102的网络接口控制器1022将聚合操作生成的报文发送至交换机,由交换机转发至与第二节点104相连接的第二网卡105。The first network card 102 sends the data generated after the aggregation operation to the second network card 105 . Specifically, the network interface controller 1022 of the first network card 102 sends the packet generated by the aggregation operation to the switch, and the switch forwards it to the second network card 105 connected to the second node 104 .
由以上实施例可以看出,本申请实施例中第一网卡能够实现对第一处理器核发送的数据执行聚合操作,相较于第一节点内的多个第一处理器核发送的数据先由其中一个第一处理器核对数据执行聚合操作后再发送至第一网卡,由第一网卡执行聚合操作的方案减少了第一节点的内部多个第一处理器核之间的通信时延,进而减少了计算系统的通信时延。It can be seen from the above embodiments that in the embodiment of the present application, the first network card can perform an aggregation operation on the data sent by the first processor core. Compared with the data sent by multiple first processor cores in the first node, One of the first processor cores performs an aggregation operation on the data and then sends it to the first network card. The solution of having the first network card perform the aggregation operation reduces the communication delay between multiple first processor cores within the first node. This in turn reduces the communication delay of the computing system.
以上实施例步骤201至步骤204详细介绍了计算系统在扇入场景下的数据传输方法,计算机系统在扇出场景下的数据传输方法与上述方法类似,下面结合图6对扇出场景下的数据传输方法进行介绍。扇出场景下的数据传输方法包括以下步骤:Steps 201 to 204 of the above embodiment describe in detail the data transmission method of the computing system in the fan-in scenario. The data transmission method of the computer system in the fan-out scenario is similar to the above method. The following is a description of the data transmission method in the fan-out scenario with reference to Figure 6 The transmission method is introduced. The data transmission method in the fan-out scenario includes the following steps:
601.第一网卡102接收第二节点104发送的第二报文。601. The first network card 102 receives the second message sent by the second node 104.
第一网卡102接收第二节点104发送的第二报文。具体的,第二节点104向交换机103发送第二报文,交换机103根据第二报文的目的地址将第二报文转发至第一网卡102。第二报文中同样包括用于指示第一网卡102对第二报文进行广播的标记,该标记与上述扇入场景中第一报文的标记相同,此处不再赘述。The first network card 102 receives the second message sent by the second node 104. Specifically, the second node 104 sends the second message to the switch 103, and the switch 103 forwards the second message to the first network card 102 according to the destination address of the second message. The second message also includes a mark used to instruct the first network card 102 to broadcast the second message. This mark is the same as the mark of the first message in the above fan-in scenario, and will not be described again here.
602.第一网卡102根据第二报文的标记将第二报文广播至第一节点101中多个第一处理器核。602. The first network card 102 broadcasts the second message to multiple first processor cores in the first node 101 according to the tag of the second message.
第一网卡102根据第二报文的标记将第二报文广播至第一节点101中多个第一处理器核。具体的,第一网卡102根据第二报文的标记和第一网卡102中所设置的多个第一处理器核的标记将第二报文广播至第一节点101中的多个第一处理器核。The first network card 102 broadcasts the second message to the plurality of first processor cores in the first node 101 according to the tag of the second message. Specifically, the first network card 102 broadcasts the second message to multiple first processors in the first node 101 according to the mark of the second message and the marks of the multiple first processor cores set in the first network card 102 Organ core.
应理解,第一网卡102会接收第二节点104发送的多个报文,第一网卡102识别第二节点104发送的报文,对于具有上述标记的第二报文执行广播操作。It should be understood that the first network card 102 will receive multiple messages sent by the second node 104, the first network card 102 will identify the messages sent by the second node 104, and perform a broadcast operation on the second messages with the above tags.
需要说明的是,计算系统在创建通信域的时候,计算系统会在第一网卡102中设置第一处理器核,即预先在第一网卡102中设置与网卡进行交互的leader处理器核,这样,第一网 卡102在接收到需要进行广播操作第二报文时,可以将第二报文广播至第一节点中的多个leader处理器核。It should be noted that when the computing system creates a communication domain, the computing system will set the first processor core in the first network card 102, that is, the leader processor core that interacts with the network card will be set in the first network card 102 in advance, so that , the first network When the card 102 receives the second message requiring a broadcast operation, the card 102 may broadcast the second message to multiple leader processor cores in the first node.
第一网卡102可以有多种方式设置第一处理器核,例如,计算系统在创建通信域时的资源初始化流程中,第一节点1011基于多个第一处理器核进行与网卡进行一次All-Reduce集合通信操作,便可以在第一网卡102中设置第一网卡102对应的通信域的第一处理器核。The first network card 102 can set the first processor core in a variety of ways. For example, during the resource initialization process when the computing system creates a communication domain, the first node 1011 performs an All-process with the network card based on multiple first processor cores. Reduce aggregates communication operations to set the first processor core of the communication domain corresponding to the first network card 102 in the first network card 102 .
603.多个第一处理器核将第二报文分发至同核组内的至少一个第二处理器核。603. Multiple first processor cores distribute the second message to at least one second processor core in the same core group.
多个第一处理器核将第二报文分发至同核组内的至少一个第二处理器核。例如,在第一节点101的第一核组1011中,第一核组1011中的第一处理器核在将第二报文分发至第一核组1011中的至少一个第二处理器核中。The plurality of first processor cores distribute the second message to at least one second processor core in the same core group. For example, in the first core group 1011 of the first node 101, the first processor core in the first core group 1011 distributes the second message to at least one second processor core in the first core group 1011. .
从以上实施例中可以看出,本申请实施例中第一网卡102能够实现对第二节点104发送的第二报文执行广播操作,相较于第一网卡102将第二报文转发至第一节点101,再由第一节点101中第一处理器核将第二报文广播至不同核组的处理器核的方案,避免了第二报文在第一节点101内跨核组广播,从而减少了对第二报文进行广播的时延,进一步减少计算系统的通信时延。It can be seen from the above embodiments that in the embodiment of the present application, the first network card 102 can perform a broadcast operation on the second message sent by the second node 104. Compared with the first network card 102 forwarding the second message to the second node 104, A node 101, and then the first processor core in the first node 101 broadcasts the second message to the processor cores of different core groups, which avoids the second message being broadcast across core groups in the first node 101. This reduces the delay in broadcasting the second message and further reduces the communication delay in the computing system.
请参阅表1,表1为本申请实施例提供的一种数据传输方法的时延降低率的表格。如表1所示,以表1中96个处理器核的第一节点为例,其中“node-based时延”是指第一节点中不同核组的第一处理器核发送的数据在节点内其中一个第一处理器核聚合之后,再通过网卡转发至外部节点的通信时延。“socket-based时延”是指本申请中第一节点中不同核组的第一处理器核发送的数据直接发送至第一网卡,由第一网卡进行聚合操作后转发至外部节点的通信时延。Please refer to Table 1. Table 1 is a table of delay reduction rates of a data transmission method provided by embodiments of the present application. As shown in Table 1, taking the first node with 96 processor cores in Table 1 as an example, the "node-based delay" refers to the data sent by the first processor cores of different core groups in the first node. After one of the first processor cores is aggregated, the communication delay is forwarded to the external node through the network card. "Socket-based delay" refers to the communication time in this application that the data sent by the first processor cores of different core groups in the first node is directly sent to the first network card, and then the first network card performs an aggregation operation and then forwards it to the external node. extension.
从表1中可以看出,以8个字节的传输数据为例如,本申请实施例提供的“socket-based”数据传输方法相较于“node-based”数据传输方法,计算系统的通信时延降低了26.28%。As can be seen from Table 1, taking 8 bytes of transmission data as an example, the "socket-based" data transmission method provided by the embodiment of the present application is compared with the "node-based" data transmission method. The communication time of the calculation system is Delay decreased by 26.28%.
表1
Table 1
以上介绍了本申请实施例中的数据传输方法,下面介绍本申请实施例涉及的相关装置。The above describes the data transmission method in the embodiment of the present application, and the following describes the relevant devices involved in the embodiment of the present application.
请参阅图7,图,7为本申请实施例提供的一种网卡的结构示意图。该网卡用于实现上述各实施例中第一网卡所执行的各个步骤,如图7所示,该网卡700包括收发单元701和处理单元702。Please refer to Figure 7. Figure 7 is a schematic structural diagram of a network card provided by an embodiment of the present application. The network card is used to implement various steps performed by the first network card in the above embodiments. As shown in Figure 7, the network card 700 includes a transceiver unit 701 and a processing unit 702.
收发单元701用于接收第一节点中多个第一处理器核中的每个处理器发送的第一数据。处理单元702用于对多个第一数据执行聚合操作。收发单元701还用于向第二节点发送执行聚合操作后的数据。The transceiver unit 701 is configured to receive first data sent by each processor in a plurality of first processor cores in the first node. The processing unit 702 is configured to perform an aggregation operation on a plurality of first data. The transceiver unit 701 is also used to send the data after performing the aggregation operation to the second node.
一种可能的实施方式中,收发单元701具体用于接收第一处理器核发送的包括第一数据的第一报文,第一报文包括指示第一网卡对第一数据进行聚合操作的标记。处理单元702具 体用于对具有标记的第一报文中的第一数据进行聚合。In a possible implementation, the transceiver unit 701 is specifically configured to receive a first message including first data sent by the first processor core, where the first message includes a mark indicating that the first network card performs an aggregation operation on the first data. . Processing unit 702 The body is used to aggregate the first data in the first message with the mark.
一种可能的实施方式中,收发单元701用于接收第一节点中多个第一处理器核中的每个处理器核发送的第一数据之前,处理单元702还用于设置多个第一处理器核。In a possible implementation, before the transceiver unit 701 is configured to receive the first data sent by each of the plurality of first processor cores in the first node, the processing unit 702 is further configured to set multiple first Processor core.
一种可能的实施方式中,收发单元701还用于接收第二节点发送的第二报文,第二报文中包括指示第一网卡对第二报文进行广播的标记。收发单元701还用于根据第二报文的标记及第一网卡中所设置的多个第一处理器核的标记将第二报文发送至多个第一处理器核。In a possible implementation, the transceiver unit 701 is also configured to receive a second message sent by the second node, where the second message includes a mark instructing the first network card to broadcast the second message. The transceiver unit 701 is also configured to send the second message to multiple first processor cores according to the tag of the second message and the tags of the multiple first processor cores set in the first network card.
请参阅图8,图8为本申请实施例提供的一种计算设备的结构示意图。例如第一节点的功能模块图。该计算设备800中每个核组的第一处理器核包括收发单元801、聚合单元802,每个核组中的第二处理器核包括发送单元803,所述收发单元801和聚合单元802用于执行图2及图6中第一处理器核所执行的功能,所述发送单元用于执行图2及图6中第二处理器核所执行的功能。第一处理器核的收发单元801用于接收至少一个第二处理器核的发送单元803发送的数据。聚合单元802用于将至少一个第二处理器核发送的数据与第一处理器核中的数据聚合为第一数据,第一处理器核的收发单元801还用于向网卡发送第一数据,并指示网卡对多个核组中的第一处理器中的收发单元801发送的第一数据进行聚合操作。Please refer to FIG. 8 , which is a schematic structural diagram of a computing device provided by an embodiment of the present application. For example, the function module diagram of the first node. The first processor core of each core group in the computing device 800 includes a transceiver unit 801 and an aggregation unit 802, and the second processor core of each core group includes a transmitter unit 803. The transceiver unit 801 and the aggregation unit 802 are In executing the functions executed by the first processor core in FIGS. 2 and 6 , the sending unit is used to execute the functions executed by the second processor core in FIGS. 2 and 6 . The transceiving unit 801 of the first processor core is used to receive data sent by the sending unit 803 of at least one second processor core. The aggregation unit 802 is used to aggregate the data sent by at least one second processor core and the data in the first processor core into first data. The transceiver unit 801 of the first processor core is also used to send the first data to the network card. And instruct the network card to perform an aggregation operation on the first data sent by the transceiver unit 801 in the first processor in the multiple core groups.
一种可能的实施方式中,聚合单元802具体用于将第一数据打包为第一报文。收发单元801具体用于向网卡发送第一报文,第一报文包括指示网卡对第一数据进行聚合操作的标记。In a possible implementation, the aggregation unit 802 is specifically configured to package the first data into a first message. The transceiver unit 801 is specifically configured to send a first message to the network card, where the first message includes a mark instructing the network card to perform an aggregation operation on the first data.
一种可能的实施方式中,收发单元801还用于接收网卡发送的第二报文,第二报文中包括指示网卡对第二报文进行广播的标记,多个第一处理器核为网卡预先设置的处理器核。In a possible implementation, the transceiver unit 801 is also configured to receive a second message sent by the network card. The second message includes a mark instructing the network card to broadcast the second message. The plurality of first processor cores are network cards. Preconfigured processor cores.
一种可能的实施方式中,收发单元801用于将第二报文发送至第一核组中的至少一个第二处理器核中。In a possible implementation, the transceiving unit 801 is configured to send the second message to at least one second processor core in the first core group.
应理解以上装置中单元的划分仅仅是一种逻辑功能的划分,实际实现时可以全部或部分集成到一个物理实体上,也可以物理上分开。且装置中的单元可以全部以软件通过处理元件调用的形式实现;也可以全部以硬件的形式实现;还可以部分单元以软件通过处理元件调用的形式实现,部分单元以硬件的形式实现。例如,各个单元可以为单独设立的处理元件,也可以集成在装置的某一个芯片中实现,此外,也可以以程序的形式存储于存储器中,由装置的某一个处理元件调用并执行该单元的功能。此外这些单元全部或部分可以集成在一起,也可以独立实现。这里所述的处理元件又可以成为处理器,可以是一种具有信号的处理能力的集成电路。在实现过程中,上述方法的各步骤或以上各个单元可以通过处理器元件中的硬件的集成逻辑电路实现或者以软件通过处理元件调用的形式实现。It should be understood that the division of units in the above device is only a division of logical functions. In actual implementation, all or part of the units may be integrated into a physical entity or physically separated. And the units in the device can all be implemented in the form of software calling through processing components; they can also all be implemented in the form of hardware; some units can also be implemented in the form of software calling through processing components, and some units can be implemented in the form of hardware. For example, each unit can be a separate processing element, or it can be integrated and implemented in a certain chip of the device. In addition, it can also be stored in the memory in the form of a program, and a certain processing element of the device can call and execute the unit. Function. In addition, all or part of these units can be integrated together or implemented independently. The processing element described here can also be a processor, which can be an integrated circuit with signal processing capabilities. During the implementation process, each step of the above method or each unit above can be implemented by an integrated logic circuit of hardware in the processor element or implemented in the form of software calling through the processing element.
值得说明的是,对于上述方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明本申请并不受所描述的动作顺序的限制,其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明本申请所必须的。It is worth noting that for the above method embodiments, for the sake of simple description, they are all expressed as a series of action combinations. However, those skilled in the art should know that the present invention is not limited by the described action sequence. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily necessary for the present application.
本领域的技术人员根据以上描述的内容,能够想到的其他合理的步骤组合,也属于本发明本申请的保护范围内。其次,本领域技术人员也应该熟悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本发明本申请所必须的。Based on the above description, those skilled in the art can think of other reasonable step combinations, which also fall within the protection scope of the present application. Secondly, those skilled in the art should also be familiar with the fact that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily necessary for the present application.
请参阅图9,图9为本申请实施例提供的一种网卡网卡的结构示意图。如图9所示,该网卡900包括:处理器910、存储器920和接口930,处理器910、存储器920与接口930通过总线(图中未标注)耦合。存储器920存储有指令,当存储器920中的执行指令被执行时,网卡900执行上述方法实施例中第一网卡或者第一节点所执行的方法。Please refer to FIG. 9 , which is a schematic structural diagram of a network card provided by an embodiment of the present application. As shown in Figure 9, the network card 900 includes: a processor 910, a memory 920 and an interface 930. The processor 910, the memory 920 and the interface 930 are coupled through a bus (not labeled in the figure). The memory 920 stores instructions. When the execution instructions in the memory 920 are executed, the network card 900 executes the method executed by the first network card or the first node in the above method embodiment.
网卡900可以是被配置成实施以上方法的一个或多个集成电路,例如:一个或多个特定 集成电路(application specific integrated circuit,ASIC),或,一个或多个微处理器(digital singnal processor,DSP),或,一个或者多个现场可编程门阵列(field programmable gate array,FPGA),或这些集成电路形式中至少两种的组合。再如,当装置中的单元可以通过处理元件调度程序的形式实现时,该处理元件可以是通用处理器,例如中央处理器(central processing unit,CPU)或其它可以调用程序的处理器。再如,这些单元可以集成在一起,以片上系统(system-on-a-chip,SOC)的形式实现。The network card 900 may be one or more integrated circuits configured to implement the above methods, for example: one or more specific Integrated circuit (application specific integrated circuit, ASIC), or one or more microprocessors (digital signal processor, DSP), or one or more field programmable gate arrays (field programmable gate array, FPGA), or these A combination of at least two of the integrated circuit forms. For another example, when the unit in the device can be implemented in the form of a processing element scheduler, the processing element can be a general processor, such as a central processing unit (CPU) or other processor that can call a program. For another example, these units can be integrated together and implemented in the form of a system-on-a-chip (SOC).
处理器910可以是中央处理单元(central processing unit,CPU),还可以是其它通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其它可编程逻辑器件、晶体管逻辑器件,硬件部件或者其任意组合。通用处理器可以是微处理器,也可以是任何常规的处理器。The processor 910 may be a central processing unit (CPU), or other general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), or a field-controller processor. Field programmable gate array (FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof. A general-purpose processor can be a microprocessor or any conventional processor.
存储器920可以包括只读存储器和随机存取存储器,并向处理器910提供指令和数据。存储器920还可以包括非易失性随机存取存储器。例如,存储器920可设置多个分区,每个区域分别用于存储不同软件模块的私钥。Memory 920 may include read-only memory and random access memory and provides instructions and data to processor 910 . Memory 920 may also include non-volatile random access memory. For example, the memory 920 may be provided with multiple partitions, and each area is used to store private keys of different software modules.
存储器920可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data date SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。Memory 920 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. Among them, non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically removable memory. Erase electrically programmable read-only memory (EPROM, EEPROM) or flash memory. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of illustration, but not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (synchronous DRAM, SDRAM), Double data rate synchronous dynamic random access memory (double data date SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (synchlink DRAM, SLDRAM) and direct Memory bus random access memory (direct rambus RAM, DR RAM).
总线除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。总线可以是快捷外围部件互连标准(peripheral component interconnect express,PCIe)总线,或扩展工业标准结构(extended industry standard architecture,EISA)总线、统一总线(unified bus,Ubus或UB)、计算机快速链接(compute express link,CXL)、缓存一致互联协议(cache coherent interconnect for accelerators,CCIX)等。总线可以分为地址总线、数据总线、控制总线等。In addition to the data bus, the bus may also include a power bus, a control bus, a status signal bus, etc. The bus can be a peripheral component interconnect express (PCIe) bus, an extended industry standard architecture (EISA) bus, a unified bus (Ubus or UB), or a computer quick link (compute express link (CXL), cache coherent interconnect for accelerators (CCIX), etc. The bus can be divided into address bus, data bus, control bus, etc.
可以理解的是,本申请实施例中图7所示的网卡700和图9所示的网卡900可以是上述图1所示的系统架构中的第一网卡,本申请实施例中图8所示计算设备800可以上述图1所示的系统架构中的第一节点。It can be understood that the network card 700 shown in Figure 7 and the network card 900 shown in Figure 9 in the embodiment of the present application can be the first network card in the system architecture shown in Figure 1, and the network card shown in Figure 8 in the embodiment of the present application The computing device 800 may be the first node in the system architecture shown in FIG. 1 .
在本申请的另一个实施例中,还提供一种计算机可读存储介质,计算机可读存储介质中存储有计算机执行指令,当设备的处理器执行该计算机执行指令时,设备执行上述方法实施例中第一网卡或者第一节点所执行的方法。In another embodiment of the present application, a computer-readable storage medium is also provided. Computer-executable instructions are stored in the computer-readable storage medium. When the processor of the device executes the computer-executed instructions, the device executes the above method embodiment. The method executed by the first network card or the first node.
在本申请的另一个实施例中,还提供一种计算机程序产品,该计算机程序产品包括计算机执行指令,该计算机执行指令存储在计算机可读存储介质中。当设备的处理器执行该计算机执行指令时,设备执行上述方法实施例中第一网卡或者第一节点所执行的方法。In another embodiment of the present application, a computer program product is also provided. The computer program product includes computer-executable instructions, and the computer-executable instructions are stored in a computer-readable storage medium. When the processor of the device executes the computer execution instruction, the device executes the method executed by the first network card or the first node in the above method embodiment.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。 Those skilled in the art can clearly understand that for the convenience and simplicity of description, the specific working processes of the systems, devices and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be described again here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application can be integrated into one processing unit, each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,read-only memory)、随机存取存储器(RAM,random access memory)、磁碟或者光盘等各种可以存储程序代码的介质。 If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application is essentially or contributes to the existing technology, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in various embodiments of this application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, read-only memory), random access memory (RAM, random access memory), magnetic disk or optical disk and other media that can store program code. .

Claims (15)

  1. 一种计算系统,包括第一节点和第二节点,所述第一节点连接第一网卡,所述第一节点通过所述第一网卡连接至所述第二节点;A computing system includes a first node and a second node, the first node is connected to a first network card, and the first node is connected to the second node through the first network card;
    所述第一节点包括多个第一处理器核,每个第一处理器核用于执行计算任务,根据所述计算任务生成第一数据,并将所述第一数据发送至所述第一网卡;The first node includes a plurality of first processor cores, each first processor core is used to perform a computing task, generate first data according to the computing task, and send the first data to the first network card;
    所述第一网卡用于对所述多个第一处理器核发送的多个第一数据执行聚合操作,并将执行聚合操作后生成的数据通过所述第一网卡发送给所述第二节点。The first network card is configured to perform an aggregation operation on a plurality of first data sent by the plurality of first processor cores, and send the data generated after performing the aggregation operation to the second node through the first network card. .
  2. 根据权利要求1所述的计算系统,其特征在于,所述第一节点包括多个核组,第一核组为所述多个核组中的任意一个核组,所述第一核组包括一个第一处理器核及至少一个第二处理器核,所述第一核组中的第一处理器核所执行的计算任务包括将所述第一核组中的所述至少一个第二处理器核发送的数据与所述第一处理器核中的数据聚合为所述第一数据。The computing system according to claim 1, wherein the first node includes a plurality of core groups, the first core group is any one of the plurality of core groups, and the first core group includes A first processor core and at least one second processor core. The computing tasks performed by the first processor core in the first core group include processing the at least one second processor in the first core group. The data sent by the processor core and the data in the first processor core are aggregated into the first data.
  3. 根据权利要求2所述的计算系统,其特征在于,所述第一核组中的第一处理器核通过将所述第一数据打包为第一报文发送给所述第一网卡,所述第一报文包括指示所述第一网卡对所述第一数据进行聚合操作的标记;The computing system of claim 2, wherein the first processor core in the first core group packages the first data into a first message and sends it to the first network card, and the The first message includes a mark indicating that the first network card performs an aggregation operation on the first data;
    所述第一网卡用于对具有所述标记的第一报文中的第一数据进行聚合。The first network card is used to aggregate the first data in the first packet with the mark.
  4. 根据权利要求1至3任意一项所述的计算系统,其特征在于,所述第一网卡还用于在所述第一网卡中设置所述多个第一处理器核。The computing system according to any one of claims 1 to 3, wherein the first network card is further configured to provide the plurality of first processor cores in the first network card.
  5. 根据权利要求4所述的计算系统,其特征在于,所述第一网卡还用于接收所述第二节点发送的第二报文,所述第二报文中包括指示所述第一网卡对所述第二报文进行广播的标记;The computing system of claim 4, wherein the first network card is further configured to receive a second message sent by the second node, and the second message includes an instruction indicating that the first network card is The second message is marked for broadcast;
    所述第一网卡根据所述第二报文的标记及所述第一网卡中所设置的所述多个第一处理器核的标记将所述第二报文发送至所述多个第一处理器核。The first network card sends the second message to the plurality of first processor cores according to the mark of the second message and the marks of the plurality of first processor cores set in the first network card. Processor core.
  6. 根据权利要求5所述的计算系统,其特征在于,所述第一核组中的第一处理器核用于将所述第二报文发送至所述第一核组中的所述至少一个第二处理器核中。The computing system of claim 5, wherein the first processor core in the first core group is configured to send the second message to the at least one processor core in the first core group. in the second processor core.
  7. 根据权利要求2-6任意一项所述的计算系统,所述第一核组中所包括的第一处理器核和所述至少一个第二处理器核归属于一个中央处理单元CPU,或者归属于一个非统一内存访问NUMA单元。The computing system according to any one of claims 2 to 6, the first processor core and the at least one second processor core included in the first core group belong to a central processing unit (CPU), or Access NUMA units in a non-uniform memory.
  8. 一种数据传输方法,其特征在于,包括:A data transmission method, characterized by including:
    第一网卡接收第一节点中多个第一处理器核中的每个处理器核发送的第一数据;The first network card receives the first data sent by each of the plurality of first processor cores in the first node;
    所述第一网卡对所述多个第一数据执行聚合操作;The first network card performs an aggregation operation on the plurality of first data;
    所述第一网卡向第二节点发送执行聚合操作后的数据。The first network card sends the data after performing the aggregation operation to the second node.
  9. 根据权利要求8所述的方法,其特征在于,所述第一网卡接收第一节点中多个第一处理器核中的每个处理器核发送的第一数据包括:The method according to claim 8, wherein the first network card receiving the first data sent by each processor core in a plurality of first processor cores in the first node includes:
    所述第一网卡接收所述第一处理器核发送的包括所述第一数据的第一报文,所述第一报文包括指示所述第一网卡对所述第一数据进行聚合操作的标记;The first network card receives a first message including the first data sent by the first processor core, and the first message includes instructing the first network card to perform an aggregation operation on the first data. mark;
    所述第一网卡对所述多个第一数据执行聚合操作包括:The first network card performing an aggregation operation on the plurality of first data includes:
    所述第一网卡对具有所述标记的第一报文中的第一数据进行聚合。The first network card aggregates the first data in the first packet with the mark.
  10. 根据权利要求8至9中任一项所述的方法,其特征在于,所述第一网卡接收第一节点中多个第一处理器核中的每个处理器核发送的第一数据之前,所述方法还包括:The method according to any one of claims 8 to 9, characterized in that, before the first network card receives the first data sent by each of the plurality of first processor cores in the first node, The method also includes:
    所述第一网卡设置所述多个第一处理器核。The first network card is configured with the plurality of first processor cores.
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:The method of claim 10, further comprising:
    所述第一网卡接收所述第二节点发送的第二报文,所述第二报文中包括指示所述第一网 卡对所述第二报文进行广播的标记;The first network card receives a second message sent by the second node, and the second message includes an instruction indicating that the first network A flag for the card to broadcast the second message;
    所述第一网卡根据所述第二报文的标记及所述第一网卡中所设置的所述多个第一处理器核的标记将所述第二报文发送至所述多个第一处理器核。The first network card sends the second message to the plurality of first processor cores according to the mark of the second message and the marks of the plurality of first processor cores set in the first network card. Processor core.
  12. 根据权利要求8至11中任一项所述的方法,其特征在于,所述第一节点包括多个核组,所述核组包括第一处理器核和至少一个第二处理器核,其中第一核组为所述多个核组中的任意一个核组,所述方法还包括:所述第一核组中的第一处理器核接收所述第一核组中的所述至少一个第二处理器核发送的数据;The method according to any one of claims 8 to 11, characterized in that the first node includes a plurality of core groups, the core group includes a first processor core and at least one second processor core, wherein The first core group is any one of the plurality of core groups, and the method further includes: the first processor core in the first core group receives the at least one processor core in the first core group. Data sent by the second processor core;
    所述第一核组中的第一处理器核将所述第一核组中的至少一个第二处理器核发送的数据与所述第一处理器核中的数据聚合为第一数据。The first processor core in the first core group aggregates data sent by at least one second processor core in the first core group and data in the first processor core into first data.
  13. 根据权利要求12所述的方法,其特征在于,所述方法还包括:The method of claim 12, further comprising:
    所述第一核组中的第一处理器核接收所述第一网卡发送的第二报文;The first processor core in the first core group receives the second message sent by the first network card;
    所述第一核组中的第一处理器核将所述第二报文发送至所述第一核组中的所述至少一个第二处理器核中。The first processor core in the first core group sends the second message to the at least one second processor core in the first core group.
  14. 一种网卡,其特征在于,包括处理器,所述处理器与存储器耦合,所述存储器用于存储指令,当所述指令被所述处理器执行时,以使得所述网卡执行权利要求8至13中任一项中第一网卡所执行的方法。A network card, characterized in that it includes a processor, the processor is coupled to a memory, the memory is used to store instructions, and when the instructions are executed by the processor, the network card executes claims 8 to The method executed by the first network card in any of 13.
  15. 一种芯片,包括接口及处理单元,所述接口用于收发数据,所述处理单元用于执行权利要求8至13中任一项中第一网卡所执行的功能。 A chip includes an interface and a processing unit, the interface is used to send and receive data, and the processing unit is used to perform the functions performed by the first network card in any one of claims 8 to 13.
PCT/CN2023/083532 2022-03-24 2023-03-24 Computation system and data transmission method WO2023179741A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210295346.3A CN116841946A (en) 2022-03-24 2022-03-24 Computing system and data transmission method
CN202210295346.3 2022-03-24

Publications (1)

Publication Number Publication Date
WO2023179741A1 true WO2023179741A1 (en) 2023-09-28

Family

ID=88100072

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/083532 WO2023179741A1 (en) 2022-03-24 2023-03-24 Computation system and data transmission method

Country Status (2)

Country Link
CN (1) CN116841946A (en)
WO (1) WO2023179741A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090292905A1 (en) * 2008-05-21 2009-11-26 International Business Machines Corporation Performing An Allreduce Operation On A Plurality Of Compute Nodes Of A Parallel Computer
EP3495960A1 (en) * 2017-12-05 2019-06-12 Fujitsu Limited Program, apparatus, and method for communicating data between parallel processor cores
CN112422448A (en) * 2020-08-21 2021-02-26 苏州浪潮智能科技有限公司 FPGA accelerator card network data transmission method and related components
CN113556403A (en) * 2021-07-30 2021-10-26 中科计算技术西部研究院 Communication method and system for distributed training
CN114221736A (en) * 2020-09-04 2022-03-22 华为技术有限公司 Data processing method, device, equipment and medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090292905A1 (en) * 2008-05-21 2009-11-26 International Business Machines Corporation Performing An Allreduce Operation On A Plurality Of Compute Nodes Of A Parallel Computer
EP3495960A1 (en) * 2017-12-05 2019-06-12 Fujitsu Limited Program, apparatus, and method for communicating data between parallel processor cores
CN112422448A (en) * 2020-08-21 2021-02-26 苏州浪潮智能科技有限公司 FPGA accelerator card network data transmission method and related components
CN114221736A (en) * 2020-09-04 2022-03-22 华为技术有限公司 Data processing method, device, equipment and medium
CN113556403A (en) * 2021-07-30 2021-10-26 中科计算技术西部研究院 Communication method and system for distributed training

Also Published As

Publication number Publication date
CN116841946A (en) 2023-10-03

Similar Documents

Publication Publication Date Title
US10324873B2 (en) Hardware accelerated communications over a chip-to-chip interface
CN108270676B (en) Network data processing method and device based on Intel DPDK
WO2020093887A1 (en) Data transmission method and device for network-on-chip (noc) and electronic device
WO2020078044A1 (en) Data processing method and apparatus, and computing device
EP3291089B1 (en) Data processing method and apparatus
US11341087B2 (en) Single-chip multi-processor communication
KR20180018853A (en) Control messaging in multislot link layer flit
CN112291293B (en) Task processing method, related equipment and computer storage medium
US20220078043A1 (en) Cross network bridging
US12034604B2 (en) MQTT protocol simulation method and simulation device
WO2020224300A1 (en) Message shunting method, apparatus and system based on user mode protocol stack
US11403250B2 (en) Operation accelerator, switch, task scheduling method, and processing system
US11449456B2 (en) System and method for scheduling sharable PCIe endpoint devices
CN114553780A (en) Load balancing method and device and network card
He et al. Accl: Fpga-accelerated collectives over 100 gbps tcp-ip
WO2023072065A1 (en) Data processing method and apparatus, electronic device, and storage medium
JP2024517706A (en) Network-connected MPI processing architecture in SMARTNIC
WO2023179741A1 (en) Computation system and data transmission method
WO2008106879A1 (en) Data transfer process device and method
CN116340246B (en) Data pre-reading method and medium for direct memory access read operation
WO2022160714A1 (en) Communication method, apparatus, and system
WO2016127422A1 (en) System, device and method for processing data
US20220179813A1 (en) Transfer device, information processing device, and data transfer method
WO2024093958A1 (en) Access method and apparatus for storage pool
WO2024077999A1 (en) Collective communication method and computing cluster

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23773986

Country of ref document: EP

Kind code of ref document: A1