WO2022061646A1 - 数据处理的装置和方法 - Google Patents

数据处理的装置和方法 Download PDF

Info

Publication number
WO2022061646A1
WO2022061646A1 PCT/CN2020/117414 CN2020117414W WO2022061646A1 WO 2022061646 A1 WO2022061646 A1 WO 2022061646A1 CN 2020117414 W CN2020117414 W CN 2020117414W WO 2022061646 A1 WO2022061646 A1 WO 2022061646A1
Authority
WO
WIPO (PCT)
Prior art keywords
module
rdma
calculation
data
bus
Prior art date
Application number
PCT/CN2020/117414
Other languages
English (en)
French (fr)
Inventor
夏晶
李冰
吴双
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202080103887.0A priority Critical patent/CN116171429A/zh
Priority to EP20954498.0A priority patent/EP4206932A4/en
Priority to PCT/CN2020/117414 priority patent/WO2022061646A1/zh
Publication of WO2022061646A1 publication Critical patent/WO2022061646A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs
    • G06F15/163Interprocessor communication
    • G06F15/173Interprocessor communication using an interconnection network, e.g. matrix, shuffle, pyramid, star, snowflake
    • G06F15/17306Intercommunication techniques
    • G06F15/17331Distributed shared memory [DSM], e.g. remote direct memory access [RDMA]

Definitions

  • the present application relates to the computer field, and more particularly, to a data processing apparatus and method.
  • the data required for MPI_reduce or MPI_allreduce calculation generally comes from multiple nodes, and the calculation needs to be performed after data transmission between nodes through the network.
  • the remote direct memory access (RDMA) technology can be used as a technical means to increase the communication bandwidth between nodes and reduce the communication delay between nodes in HPC or AI scenarios.
  • the calculation part involves many read operations and write operations.
  • the RDMA module on the server side needs to perform at least one read operation and one write operation on the memory module through the bus, which will cause the occurrence of latency and bandwidth consumption. Therefore, how to optimize the calculation part to reduce the delay and bandwidth consumption is an urgent problem to be solved.
  • the present application provides a data processing device and method, which can realize the optimization of the computing part in the nodes that perform collective communication, help to reduce the communication delay between nodes, and can reduce the number of bus accesses, thereby helping to reduce the The bandwidth consumption of the bus.
  • a data processing device including a remote direct memory module access RDMA module, a local agent HA module and a memory module, wherein the RDMA module and the local agent module communicate through a bus, so communicating between the local agent module and the memory module through a non-bus interface;
  • the RDMA module is used to:
  • the local proxy module is used to:
  • the calculation result is written to the memory module through the non-bus interface.
  • the HA module calculates the local data and the operands in the RDMA message, and writes the obtained calculation result into the memory module through the non-bus interface, which can help prevent the RDMA module from passing the bus through the bus.
  • Read local data from the memory module, and write the calculation result into the memory module through the bus so the embodiments of the present application can optimize the calculation part of the nodes that perform collective communication, which helps to reduce the time required for communication between nodes extension. Further, since the embodiment of the present application can reduce the number of bus accesses, it can help reduce the bandwidth consumption of the bus.
  • the delay of the data processing process only includes one write operation to the bus, which can help reduce the operation time of the collective communication algorithm. Delay.
  • the communication between the RDMA module and the HA module is through the bus, which means that when data is exchanged between the RDMA module and the HA module, the data needs to be transmitted through the bus.
  • the HA module and the memory module communicate through a non-bus interface, which means that the HA module and the memory module do not communicate through the bus, that is, when the HA module and the memory module interact with each other, the data does not need to be transmitted through the bus.
  • data between the HA module and the memory module may be transferred through a private interface (or dedicated interface).
  • the HA module can be arranged at a position close to the memory module, so that the HA module can perform read operations and/or write operations on the memory modules, so as to shorten the read operations or write operations performed by the HA module on the memory modules. Latency during operation.
  • the HA module can directly perform a read operation or a write operation on the memory module without going through a cache (for example, L1 cache, L2 cache, or L3 cache, etc., which are not limited).
  • the HA module may be a hardware module, such as a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., without limitation.
  • DSP digital signal processor
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • FPGA field programmable gate array
  • the memory module may be dynamic random access memory (DRAM), or synchronous dynamic random access memory (synchronous DRAM, SDRAM), or double data rate synchronous dynamic random access memory (double data). rate SDRAM, DDR SDRAM), etc., are not limited.
  • the local proxy module is further configured to read the local data from the memory module through the non-bus interface.
  • the HA module can calculate the operand and local data according to the first command. Therefore, the present application can help to prevent the RDMA module from reading local data from the memory module through the bus, thereby optimizing the computing part in the nodes that perform collective communication, and helping to reduce the communication delay between nodes. Further, since the embodiment of the present application can reduce the number of bus accesses, it can help reduce the bandwidth consumption of the bus.
  • the local proxy module may further store local data in advance (for example, before receiving the first command), which is not limited in this embodiment of the present application.
  • the local proxy module is further configured to send a response message corresponding to the first command to the RDMA module through the bus, where the response message is used to indicate The computing operation is completed.
  • the response message is a response to a bus write operation, and its delay is included in a bus write operation.
  • the RDMA module receives the response message, and can determine, according to the response message, that the calculation operation for the operand in the RDMA message is completed.
  • some implementations of the first aspect further include a central processing unit (CPU), and the RDMA module is further configured to report an interrupt to the CPU, where the interrupt is used to indicate completion of the computing operation.
  • the CPU receives the interrupt, it can determine to complete the computing operation.
  • a central processing unit (CPU) is further included, and the RDMA module is further configured to receive a polling from the CPU to return the completion of the computing operation.
  • the RDMA module is specifically configured to determine, according to the first indication field in the header of the RDMA packet, the load that needs to be placed on the RDMA packet The data in the load is calculated for collective communication; the data in the load is determined as the operand. In this way, the RDMA module can determine that collective communication calculation needs to be performed according to the acquired RDMA message, and obtain the operand for collective communication calculation.
  • the first indication field may be used to indicate that the collective communication calculation is performed on the data in the payload of the RDMA message.
  • the present application can flexibly indicate whether to perform collective communication calculation for the data in the payload of each RDMA packet.
  • the first indication field may be included in the opcode of the base transport header of the RDMA message, which is not limited in this application.
  • a queue pair context QPC
  • a storage region memory region, MR
  • QPC queue pair context
  • MR storage region
  • the collective communication computation comprises a collective communication protocol computation, or a collective communication protocol and broadcast computation.
  • the RDMA module is further configured to determine, according to the second indication field in the header of the RDMA message, all operations that need to be performed on the operand. the computing operation; and generating the first command according to the computing operation. In this way, the RDMA module can generate the first command for instructing the calculation operation according to the acquired RDMA message.
  • the second indication field may be used to indicate a calculation operation performed on the data in the payload of the RDMA packet.
  • the present application can flexibly indicate the type of calculation operation performed on the data in the payload of each RDMA packet.
  • the second indication field may be implemented by the rsv field segment in the base transport header in the RDMA message, or by adding data encoding in the extended transport header or payload, which is not limited in this application.
  • the calculation operation that needs to be performed on the data in the payload of the RDMA message may also be specified through QPC and/or MR.
  • the computing operation includes at least one of an addition operation, a maximum value operation, a sum operation, an OR operation, an exclusive OR operation, and a minimum value operation.
  • the RDMA module is an RDMA network interface controller.
  • a data processing method comprising:
  • the local proxy module receives the first command sent by the RDMA module and the operand in the RDMA message through the bus, and the first command is used to instruct the local proxy module to perform a calculation operation on the operand;
  • the local agent module calculates the operand and local data to obtain a calculation result according to the first command
  • the local agent module writes the calculation result to the memory module through the non-bus interface.
  • the method may further include: the RDMA module receives an RDMA message, and obtains an operand in the RDMA message;
  • the RDMA module sends the first command and the operand to the local agent module through the bus.
  • the method can be applied to an apparatus comprising a remote direct memory module access RDMA module, a local proxy module and a memory module, such as the apparatus of the first aspect or various implementations of the first aspect.
  • RDMA module and the local proxy module communicate through a bus
  • the local proxy module and the memory module communicate through a non-bus interface
  • the local proxy module reads the local data from the memory module through the non-bus interface.
  • the local agent module sends a response message corresponding to the first command to the RDMA module through the bus, where the response message is used to indicate completion of the computing operation.
  • the RDMA module reports an interrupt to the central processing unit CPU after receiving the response message, where the interrupt is used to indicate completion of the computing operation.
  • the RDMA module receives a poll from the CPU to return completion of the computing operation.
  • the RDMA module determines, according to the first indication field in the header of the RDMA message, that collective communication calculation needs to be performed on the data in the payload of the RDMA message;
  • the RDMA module determines data in the payload as the operand.
  • the collective communication computation comprises a collective communication protocol computation, or a collective communication protocol and broadcast computation.
  • the RDMA module determines, according to the second indication field in the header of the RDMA message, the computing operation that needs to be performed on the operand;
  • the RDMA module generates the first command according to the calculation operation.
  • the computing operation includes at least one of an addition operation, a maximum value operation, a sum operation, an OR operation, an exclusive OR operation, and a minimum value operation.
  • the RDMA module is an RDMA network interface controller.
  • FIG. 1 is a schematic diagram of a system architecture corresponding to an existing collective communication solution
  • FIG. 2 is a schematic diagram of a system architecture corresponding to another existing collective communication solution
  • FIG. 3 is a schematic block diagram of a system architecture provided by an embodiment of the present application.
  • Fig. 4 is an example of the format of RDMA message
  • Fig. 5 is a schematic diagram of HA module and common processing core
  • FIG. 6 is a schematic block diagram of an apparatus for data processing provided by an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 1 shows a schematic diagram of a system architecture corresponding to an existing collective communication solution.
  • the calculation part of the collective communication calculation algorithm is executed by the CPU in the system.
  • the system architecture includes RDMA module, memory module and CPU.
  • the RDMA module After the RDMA module receives the RDMA message in the network, it can write the payload (including data A2) in the RDMA message to the memory module through the bus, and the write operation can correspond to the arrow corresponding to (1) in Figure 1 Corresponding data transmission process.
  • the RDMA module After the RDMA module writes the payload in the RDMA message to the memory module, the RDMA module can report an interrupt to the CPU, and the reporting interrupt operation can correspond to the data transmission process corresponding to the arrow corresponding to (2) in FIG. 1 .
  • the CPU can determine that the data A2 in the RDMA message has been written into the memory according to the interrupt.
  • the RDMA module may not report an interrupt to the CPU, but the CPU determines that the data A2 has been written to the memory by polling.
  • the CPU After the CPU determines that the data in the RDMA message is written into the memory, it can initiate a bus read operation to read the data A2 written into the memory module into the CPU.
  • the read operation can correspond to the arrow corresponding to (3) in Figure 1. data transfer process.
  • the CPU can read the local data A3 stored in the memory module into the CPU, and the read operation can correspond to the data transmission process corresponding to the arrow corresponding to (4) in FIG. 1 .
  • the CPU completes the calculation of the data A2 and A3.
  • the CPU writes the calculation result to the memory module through the bus.
  • the write operation may correspond to the data transmission process corresponding to the arrow corresponding to (5) in FIG. 1 .
  • the delay of the data processing process includes the data transmission process corresponding to the arrows corresponding to (1) to (5) in Fig. 1, and the calculation delay of the CPU, wherein, (1) to ( Each data transfer process in 5) requires a read or write operation to the bus.
  • FIG. 2 shows a schematic diagram of a system architecture corresponding to another existing collective communication solution, wherein the calculation part of the collective communication calculation algorithm is performed by an RDMA module.
  • the system architecture includes an RDMA module, a memory module and a CPU.
  • the RDMA module receives the RDMA message, the data A4 in the memory module can be read through the bus and wait to participate in the calculation.
  • the read operation may correspond to the data transmission process corresponding to the arrow corresponding to (1) in FIG. 2 .
  • the RDAM module receives the RDMA packets in the network, and determines the collective communication calculation (such as MPI_reduce or MPI_allreduce calculation) and the corresponding calculation operation type according to the relevant information in the packet header. At this time, the RDMA module can write the calculation result to the memory module through the bus.
  • the write operation may correspond to the data transmission process corresponding to the arrow corresponding to (2) in FIG. 2 .
  • the RDMA module may report an interrupt to the CPU, and the reporting interrupt operation may correspond to the data transmission process corresponding to the arrow corresponding to (3) in FIG. 2 .
  • the CPU may determine whether the calculation operation is completed by polling, and at this time, the RDMA module may not report an interrupt to the CPU.
  • the delay of the data processing process includes the data transmission process corresponding to the arrows corresponding to (1) and (2) in FIG. 2 and the calculation delay of the RDAM module. Among them, the data transmission process corresponding to (1) and (2) all need to perform a read operation or a write operation on the bus.
  • the calculation part involves more read operations and write operations.
  • the RDMA module needs to perform two read operations and two write operations on the memory module through the bus.
  • the RDMA module needs to perform a read operation and a write operation to the memory module through the bus, which will cause delay and bandwidth consumption.
  • an embodiment of the present application provides a data processing solution, in which a home agent (HA) module is added to the system architecture of collective communication, and the HA module can process local data and operands in RDMA messages Perform calculations and write the calculation results into the memory module through the non-bus interface, which can help to avoid the RDMA module reading local data from the memory module through the bus, and writing the calculation results into the memory module through the bus, so as to achieve the purpose of calculation. Parts are optimized to reduce the delay and bandwidth consumption of collective communication.
  • HA home agent
  • FIG. 3 shows a schematic block diagram of a system architecture 100 provided by an embodiment of the present application.
  • the system architecture 100 may be, for example, an HPC server, or an AI training center server, or the like.
  • a plurality of the system architectures 100 may be included, and each of the system architectures 100 may be referred to as a node.
  • data transmission between multiple nodes is carried out through RDMA technology, and then the data is calculated to realize MPI_reduce or MPI_allreduce calculation.
  • the data processing process of one of the nodes performing collective communication is described, wherein the node needs to perform calculation on the obtained data. It can be understood that the data processing process of other nodes that need to perform data calculation for collective communication is the same as or similar to the data processing process of the node. Refer to the data processing process of the node described below and will not be repeated.
  • the system architecture 100 includes an RDMA module 110 , a home agent (HA) module 120 , a memory module 130 and a central processing unit (central processing unit, CPU) 140 .
  • RDMA read-only memory
  • HA home agent
  • CPU central processing unit
  • the RDMA module 110 can support the RDMA protocol, so that the node in FIG. 3 can perform data transmission with other nodes through the RDMA technology, for example, can receive RDMA packets, send RDMA packets, and the like.
  • the RDMA module 110 may specifically be an RDMA engine, or an RDMA network interface controller (network interface controller, NIC, also referred to as an RDMA network card), which is not limited in this application.
  • NIC network interface controller
  • the RDMA module 110 may receive RDMA messages from other nodes in collective communication via the Internet.
  • the opcode type of the RDMA message can be, for example, send (send), send with remote invalid operation (send with invalidate), send with immediate data (send with immediate), RDMA write operation (RDMA write) ), RDMA write with immediate data (RDMA write with immediate), or RDMA read operation (RDMA read), etc., which are not limited in the embodiments of the present application. That is to say, for the node shown in Figure 3, other nodes may need to perform RDMA for MPI_reduce or MPI_allreduce calculations through operations such as send, send with invalidate, send with immediate, RDMA write, RDMA write with immediate or RDMA read message is sent to this node.
  • the RDMA module 110 may obtain the operands in the RDMA message, that is, the operands that need to be calculated by MPI_reduce or MPI_allreduce.
  • the data type of the operand may include 8/16/32/64-bit (bit) integer type (int), 8/16/32/64-bit unsigned integer (unsigned INT, UINT) or double-precision floating-point number (double) type, which is not limited in this embodiment of the present application.
  • the RDMA module 110 may determine, according to the first indication field in the packet header of the RDMA packet, that collective communication calculation needs to be performed on the data in the payload (payload) of the RDMA packet, for example, MPI_reduce or MPI_allreduce calculation.
  • the first indication field may be used to indicate to perform collective communication calculation on the data in the payload of the RDMA message.
  • the RDMA module 110 may determine the data in the payload of the RDMA packet as the above-mentioned operand. Therefore, by using the first indication field in the RDMA packet header, it is possible to flexibly indicate whether to perform collective communication calculation for the data in the payload of each RDMA packet.
  • a queue pair context QPC
  • a storage region memory region, MR
  • the MR is a storage area defined in the RDMA protocol, for example, a memory space that can be received or sent by the RDMA module.
  • an indication bit #1 may exist in the data structure of the QPC of a queue pair (queue pair, QP) 1 to indicate that a collective communication operation is performed on the data in the payload of the RDMA message.
  • the messages belonging to QP1 are all fixed for collective communication calculation.
  • an indication bit #2 exists in the data structure in MR1 to indicate that a collective communication operation is performed on the data in the payload of the RDMA message.
  • the RDMA message that operates on MR1 is fixed to perform collective communication calculation.
  • the RDMA module 110 may determine a calculation operation to be performed on the data in the payload of the RDMA packet (that is, the above-mentioned operands) according to the second indication field in the packet header of the RDMA packet, For example, computing operations in MPI_reduce or MPI_allreduce computing are performed.
  • the second indication field may be used to indicate a calculation operation performed on the data in the payload of the RDMA packet. Therefore, by using the second indication field in the RDMA packet header, it is possible to flexibly indicate the type of calculation operation performed on the data in the payload of each RDMA packet.
  • the calculation operation that needs to be performed on the data in the payload of the RDMA message may also be specified through QPC and/or MR.
  • an indication bit #3 may exist in the data structure of the QPC of QP1 to indicate the calculation operation performed on the data in the payload of the RDMA message. In the case that the indication bit #3 is valid (even if there is no indication field in the header of the RDMA message at this time), the messages belonging to QP1 are all fixed according to the indication bit #3 to perform calculation operations.
  • an indication bit #4 exists in the data structure in MR1 to indicate a calculation operation performed on the data in the payload of the RDMA message. In the case that the indication bit #4 is valid (even if there is no indication field in the header of the RDMA message at this time), the RDMA message that operates on MR1 will always perform the calculation operation according to the indication bit #4.
  • the calculation operation on the data in the payload in the RDMA packet may include, for example, an add operation, a maximum value (max) operation, an and (and) operation, an or (or) operation, an exclusive or ( At least one of an xor) operation and a minimum value (min) operation.
  • Figure 4 shows an example of the format of an RDMA message.
  • the RDMA message can include local router header, global transport header, base transport header, extended transport header, payload, invariant CRC and variant CRC.
  • the above-mentioned first indication field may be included in the opcode of the base transport header, and the second indication field may be implemented through the rsv field segment in the base transport header, or by adding data encoding in the extended transport header or payload. , the application is not limited to this.
  • the RDMA module 110 may communicate with the HA module 120 through a bus.
  • the RDMA module 110 may send an operation command and an operand in the RDMA message to the HA module 120 through the bus, where the operation command is used to instruct the HA module 120 to perform a calculation operation on the operand.
  • the operation command and operand sent by the RDMA module 110 to the HA module 120 may correspond to the data transmission process corresponding to the arrow corresponding to (1) in FIG. 3 , that is, a bus write operation.
  • the HA module 120 may communicate with the RDMA module 110 through a bus, and communicate with the memory module 130 through a non-bus interface.
  • the HA module 120 may receive operands and operation commands from the RDMA module 110 over the bus.
  • the HA module 120 can also read local data, such as A1, from the memory module 130 through a non-bus interface.
  • the reading of local data from the memory module 130 by the HA module 120 may correspond to the data transmission process corresponding to the arrow corresponding to (2) in FIG. 3 .
  • the HA module 120 may also store local data in advance (for example, before receiving an operation command), which is not limited in this embodiment of the present application.
  • the HA module 120 can also perform calculation operations, for example, according to the obtained operation command from the RDMA module 110, perform calculation operations on the obtained operands of the RDMA module 110 and the local data of the memory module 130 to obtain calculation results. After obtaining the calculation result, the HA module 120 can write the calculation result into the memory module 130 through the non-bus interface. Continuing to refer to FIG. 3 , the writing of the calculation result by the HA module 120 into the memory module 130 may correspond to the data transmission process corresponding to the arrow corresponding to (3) in FIG. 3 .
  • the HA module 120 may be implemented by a circuit module having the functions of reading data from the memory module 130, writing data to the memory module 130, and performing computing operations on the data.
  • the HA module 120 may be a hardware module, for example, may be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (field programmable gate array) , FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., without limitation.
  • the memory module 130 may be a dynamic random access memory (DRAM), a synchronous dynamic random access memory (SDRAM), or a double data rate synchronous dynamic random access memory (double). data rate SDRAM, DDR SDRAM), etc., without limitation.
  • DRAM dynamic random access memory
  • SDRAM synchronous dynamic random access memory
  • double double data rate synchronous dynamic random access memory
  • the communication between the RDMA module 110 and the HA module 120 is through the bus, which means that when data is exchanged between the RDMA module 110 and the HA module 120, the data needs to be transmitted through the bus.
  • Communication between the HA module 120 and the memory module 130 is through a non-bus interface, which means that the HA module 120 and the memory module 130 do not communicate through a bus, that is, when data is exchanged between the HA module 120 and the memory module 130, the data is not communicated. It needs to be transmitted over the bus.
  • data between the HA module 120 and the memory module 130 may be transferred through a private interface (or a dedicated interface).
  • the delay when the HA module 120 performs a read operation or a write operation on the memory module 130 can be shortened .
  • FIG. 5 shows a schematic diagram of the HA module and common processing cores.
  • the description is given by taking DRAM and non-volatile memory (NVM) as the memory module and the processing core as the CPU as an example.
  • the processing core (core) 1 to the processing core n 301 respectively perform processing on the DRAM 305 through the respective layer 1 (L1) cache (cache), the layer 2 (L2) cache and the common layer 3 (L3) cache.
  • the DRAM 305 can further perform data transfer with the NVM 306.
  • the processing core 302 performs read or write operations on the DRAM 307 through the cache, and the DRAM 307 can further perform data transmission with the NVM 308.
  • the HA module 303 can directly perform a read operation or a write operation on the DRAM 307 without going through the cache.
  • the HA module 304 can read or write to the NVM 308 without going through the cache.
  • the processing core 1 to the processing core n 301, and the processing core 302, as a conventional computing system, can be used as the computing processing core as the center
  • the memory module and the computing processing core are very different in frequency and communication/processing speed
  • the computing processing core and the memory Modules generally do not communicate directly, but need to go through the cache module and the bus to communicate.
  • the HA module 303 and the HA module 304 are dedicated computing computing modules, which can be directly connected to the memory module in a private interface, and match the frequency with the communication/processing speed. Read or write operation of the module.
  • the HA module calculates the local data and the operands in the RDMA message, and writes the obtained calculation result into the memory module through the non-bus interface, which can help prevent the RDMA module from passing the bus through the bus.
  • the local data is read from the memory module, and the calculation result is written into the memory module through the bus. Therefore, the embodiment of the present application can optimize the calculation part of the nodes that perform collective communication, which helps to reduce the time required for communication between nodes. extension. Further, since the embodiment of the present application can reduce the number of bus accesses, it can help reduce the bandwidth consumption of the bus.
  • the delay of the data processing process includes the data transmission process corresponding to the arrows corresponding to (1) to (3) in FIG. Calculate the delay. Since the read and write operations of the HA module 120 to the memory module 130 do not need to go through the bus, the delay only includes one write operation to the bus, which can help reduce the delay of collective communication algorithm operations.
  • the HA module 120 when the HA module 120 writes the calculation result to the memory module 130 through the non-bus interface, or after writing the calculation result to the memory module 130 through the non-bus interface, it can send the corresponding A response message to the above-mentioned operation command (that is, the process corresponding to (1) in FIG. 3 ), the response message is used to indicate that the computing operation corresponding to the operation command is completed.
  • the response is a response of a bus write operation, and its delay is included in a bus write operation.
  • the RDMA module 110 receives the response message, and according to the response message, can determine that the calculation operation of the operand in the RDMA message is completed.
  • the RDMA module 110 may notify the CPU 140 to complete the computing operation by interrupting or polling.
  • the RDMA module 110 may send an interrupt to the CPU 140 through the bus, and when the CPU 140 receives the interrupt, it may determine to complete the computing operation.
  • the CPU 140 may periodically poll to determine whether the calculation process is completed, and the RDMA module 110 does not need to report an interrupt at this time.
  • the delay of the data processing process when the collective communication algorithm operation is performed only includes one write operation to the bus, so the delay of the collective communication algorithm operation will be greatly reduce and improve system performance.
  • the CPU only needs an interrupt or polling operation, and does not need to perform a read operation or a write operation on the memory module. Therefore, the embodiment of the present application can be compared with the scheme shown in FIG. 1 . Improve the processing efficiency of the CPU.
  • the data processing scheme of the embodiment of the present application only includes one write operation to the bus when the collective communication algorithm operation is performed, and the collective communication operation is required to perform collective communication operation.
  • the embodiment of the present application can significantly reduce the delay of the collective communication algorithm operation, and improve the system performance.
  • FIG. 6 shows a schematic block diagram of an apparatus 600 for data processing provided by an embodiment of the present application.
  • the apparatus 600 includes a remote direct memory module access RDMA module 610 , a local proxy module 620 and a memory module 630 .
  • the communication between the RDMA module 610 and the local proxy module 620 is through a bus, and the communication between the local proxy module 620 and the memory module 630 is through a non-bus interface.
  • the RDMA module 610 is used for:
  • a first command and the operand are sent to the local proxy module through the bus, where the first command is used to instruct the local proxy module to perform a calculation operation on the operand.
  • the local proxy module 620 is used to:
  • the calculation result is written to the memory module through the non-bus interface.
  • the local proxy module 620 is further configured to read the local data from the memory module through the non-bus interface.
  • the local proxy module 620 is further configured to send a response message corresponding to the first command to the RDMA module through the bus, where the response message is used to indicate completion of the computing operation.
  • the apparatus 600 may further include a central processing unit (CPU).
  • the RDMA module 610 is further configured to report an interrupt to the central processing unit CPU, where the interrupt is used to indicate the completion of the computing operation.
  • the apparatus 600 may further include a central processing unit (CPU).
  • the RDMA module 610 is also configured to receive a poll from the CPU to return the completion of the computing operation.
  • the RDMA module 610 is specifically configured to determine, according to the first indication field in the packet header of the RDMA packet, that collective communication calculation needs to be performed on the data in the payload of the RDMA packet, and Data in the payload is determined as the operand.
  • the collective communication calculation includes a collective communication protocol calculation, or a collective communication protocol and broadcast calculation.
  • the RDMA module 610 is further configured to determine the calculation operation that needs to be performed on the operand according to the second indication field in the packet header of the RDMA packet; according to the calculation operation to generate the first command.
  • the computing operation includes at least one of an addition operation, a maximum value operation, and a sum operation, or an OR operation, an exclusive OR operation, and a minimum value operation.
  • the RDMA module is an RDMA network interface controller.
  • each module or unit in the above apparatus 600 is only exemplary descriptions.
  • FIG. 7 shows a schematic flowchart of a data processing method 700 provided by an embodiment of the present application.
  • the method 700 can be applied to an apparatus including a remote direct memory module accessing an RDMA module, a local proxy module and a memory module, such as the above-mentioned apparatus 600 or system architecture 100 .
  • the method 700 includes the following steps 710-730.
  • the local proxy module receives, through the bus, a first command sent by the RDMA module and an operand in the RDMA packet, where the first command is used to instruct the local proxy module to perform a calculation operation on the operand.
  • the local proxy module calculates the operand and local data according to the first command to obtain a calculation result.
  • the local agent module writes the calculation result to the memory module through the non-bus interface.
  • the method 700 may further include: the RDMA module receives an RDMA message, and obtains an operand in the RDMA message;
  • the RDMA module sends the first command and the operand to the local agent module through the bus.
  • the method 700 may further include: the local proxy module reads the local data from the memory module through the non-bus interface.
  • the method 700 further includes: the local agent module sends a response message corresponding to the first command to the RDMA module through the bus, where the response message is used to indicate completion of the computing operation.
  • the method 700 further includes: the RDMA module reports an interrupt to the CPU after receiving the response message, where the interrupt is used to indicate completion of the computing operation.
  • the method 700 further includes: the RDMA module receiving a poll from the CPU to return completion of the computing operation.
  • the method 700 further includes: the RDMA module determines, according to the first indication field in the header of the RDMA packet, that collective communication needs to be performed on the data in the payload of the RDMA packet Compute; the RDMA module determines data in the payload as the operand.
  • the collective communication calculation includes a collective communication protocol calculation, or a collective communication protocol and broadcast calculation.
  • the method 700 further includes: the RDMA module determines, according to the second indication field in the header of the RDMA packet, the calculation operation that needs to be performed on the operand; The RDMA module generates the first command according to the calculation operation.
  • the computing operation includes at least one of an addition operation, a maximum value operation, and a sum operation, or an OR operation, an exclusive OR operation, and a minimum value operation.
  • the RDMA module is an RDMA network interface controller.
  • the size of the sequence numbers of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its functions and internal logic, and should not be dealt with in the embodiments of the present application. implementation constitutes any limitation.
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution, and the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bus Control (AREA)
  • Multi Processors (AREA)

Abstract

一种数据处理装置(100)和方法。该数据处理装置(100),包括远端直接内存模块访问RDMA模块(110)、本地代理HA模块(120)和内存模块(130),其中,HA模块(120)对本地数据和RDMA报文中的操作数进行计算,并将获得的计算结果通过非总线接口写入内存模块(130)中,能够有助于避免通过总线从内存模块(130)中读取本地数据,以及通过总线将计算结果写入内存模块(130)中,因此该处理装置(100)能够实现对进行集体通信的节点中的计算部分进行优化,有助于减小节点间通信时延。同时,由于该处理装置(100)能够减少总线访问次数,从而能够有助于减小总线的带宽消耗。

Description

数据处理的装置和方法 技术领域
本申请涉及计算机领域,并且更具体的,涉及一种数据处理的装置和方法。
背景技术
随着高性能计算(high performance computing,HPC)机群和人工智能(artificial intelligence,AI)的飞速发展,对数据的运算需求越来越大。在HPC或AI的应用场景中,因为进程众多,大量的数据处理需要多个进程参与进来进行协同合作。在多个进程协同合作的场景中,需要进行集体通信(collective communication)。作为示例,在集体通信中,规约计算(MPI_reduce)或规约并广播计算(MPI_allreduce)的拓扑结构的占比可能较大,比如可高达40%。因此很有必要对这些算法进行针对性的优化。计算(calculation)是MPI_reduce或MPI_allreduce中的一个重要部分,calculation部分的执行效率将较大影响集体计算算法的执行效率。因此针对calculation部分的优化显得尤为重要。
在HPC或AI的向外扩展(scal out)方案中,MPI_reduce或MPI_allreduce计算需要的数据一般来自多个节点,需要通过网络进行节点间的数据传输之后再进行计算。远程直接内存模块访问(remote direct memory access,RDMA)技术因为其高带宽低时延的特性,可以作为HPC或AI场景中的提升节点间通信带宽与降低节点间通信时延的技术手段。
当前的集体通信的方案中,calculation部分涉及到比较多的读操作和写操作,例如服务器(server)端的RDMA模块至少需要通过总线对内存模块进行一次读操作和一次写操作,这会导致产生时延和带宽消耗。因此,如何对calculation部分进行优化以减小时延和带宽消耗是亟待解决的问题。
发明内容
本申请提供数据处理的装置和方法,能够实现对进行集体通信的节点中的计算部分进行优化,有助于减小节点间通信时延,并且能够减少总线访问次数,从而能够有助于减小总线的带宽消耗。
第一方面,提供了一种数据处理的装置,包括远端直接内存模块访问RDMA模块、本地代理HA模块和内存模块,其中,所述RDMA模块和所述本地代理模块之间通过总线通信,所述本地代理模块与所述内存模块之间通过非总线接口通信;
所述RDMA模块用于:
接收RDMA报文,并获取所述RDMA报文中的操作数;
通过所述总线将第一命令和所述操作数发送给所述本地代理模块,所述第一命令用于指示所述本地代理模块对所述操作数进行计算操作;
所述本地代理模块用于:
根据所述第一命令对所述操作数和本地数据进行计算以获得计算结果;
通过所述非总线接口将所述计算结果写到所述内存模块。
因此,本申请实施例中,HA模块通过对本地数据和RDMA报文中的操作数进行计算,并将获得的计算结果通过非总线接口写入内存模块中,能够有助于避免RDMA模块通过总线从内存模块中读取本地数据,以及通过总线将计算结果写入内存模块中,因此本申请实施例能够实现对进行集体通信的节点中的计算部分进行优化,有助于减小节点间通信时延。进一步的,由于本申请实施例能够减少总线访问次数,从而能够有助于减小总线的带宽消耗。
作为示例,由于HA模块对内存模块的读操作以及写操作不需要经过总线,因此该数据处理的过程的延迟仅包含1次对总线的写操作,从而能够有助于减小集体通信算法操作的延迟。
需要说明的是,RDMA模块与HA模块之间通过总线通信,指的是RDMA模块和HA模块之间进行数据交互时,数据需要经过总线进行传输。HA模块和内存模块之间通过非总线接口通信,即指的是HA模块和内存模块之间不通过总线通信,即HA模块和内存模块之间进行数据交互时,数据不需要经过总线进行传输。示例性的,在HA模块和内存模块之间的数据可以通过私有接口(或专用接口)进行传输。
作为示例,在计算机系统中,HA模块可以设置在距离内存模块较近的位置上,使得HA模块可以对内存模块进行读操作和/或写操作,以缩短HA模块对内存模块进行读操作或写操作时的时延。在一些可选的实施例中,HA模块可以对内存模块直接进行读操作或写操作,而不需要经过cache(例如L1cache、L2cache或L3cache等,不作限定)。
示例性的,HA模块可以是一个硬件模块,例如可以是通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,不作限定。
示例性的,内存模块可以为动态随机存取存储器(dynamic random access memory,DRAM),或同步动态随机存取存储器(synchronous DRAM,SDRAM),或双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)等,不作限定。
结合第一方面,在第一方面的某些实现方式中,所述本地代理模块还用于通过所述非总线接口从所述内存模块中读取所述本地数据。这样,HA模块在接收到第一命令和操作数之后,可以实现根据第一命令对操作数和本地数据进行计算。因此,本申请能够有助于避免RDMA模块通过总线从内存模块中读取本地数据,进而对进行集体通信的节点中的计算部分进行优化,有助于减小节点间通信时延。进一步的,由于本申请实施例能够减少总线访问次数,从而能够有助于减小总线的带宽消耗。
在一些可能的实施例中,本地代理模块还可以预先(例如在接收到第一命令之前)保存有本地数据,本申请实施例对此不作限定。
结合第一方面,在第一方面的某些实现方式中,所述本地代理模块还用于通过总线向所述RDMA模块发送对应于所述第一命令的响应消息,所述响应消息用于指示完成所述计算操作。这里,该响应消息为一次总线写操作的响应,其延时包含在一次总线写操作中。对应的,RDMA模块接收该响应消息,并能够根据该响应消息,确定对RDMA报文中的操作数的计算操作完成。
结合第一方面,在第一方面的某些实现方式中,还包括中央处理器单元CPU,所述RDMA模块还用于向CPU上报中断,所述中断用于指示所述计算操作的完成。对应的,CPU接收到该中断时,可以确定完成计算操作。
结合第一方面,在第一方面的某些实现方式中,还包括中央处理器单元CPU,所述RDMA模块还用于接收所述CPU的轮询以返回所述计算操作的完成。
结合第一方面,在第一方面的某些实现方式中,所述RDMA模块具体用于根据所述RDMA报文的报文头中的第一指示字段,确定需要对所述RDMA报文的负荷中数据进行集体通信计算;将所述负荷中的数据确定为所述操作数。这样,RDMA模块可以根据获取的RDMA报文,确定需要进行集体通信计算,并获取进行集体通信计算的操作数。
这里,第一指示字段可以用于指示对RDMA报文的payload中的数据进行集体通信计算。本申请通过使用RDMA报文头中的该第一指示字段,能够针对每个RDMA报文的payload中的数据,灵活地指示是否对其进行集体通信计算。
在一些可能的实现方式中,第一指示字段可以包含于RDMA报文的base transport header的opcode中,本申请对此不作限定。
在另一些可能的实现方式中,还可以通过队列对上下文(queue pair context,QPC)和/或存储区域(memory region,MR)来指定需要对RDMA报文的payload中的数据进行集体通信计算。
结合第一方面,在第一方面的某些实现方式中,所述集体通信计算包括集体通信规约计算,或集体通信规约并广播计算。
结合第一方面,在第一方面的某些实现方式中,所述RDMA模块还用于根据所述RDMA报文的报文头中的第二指示字段,确定需要对所述操作数进行的所述计算操作;根据所述计算操作,生成所述第一命令。这样,RDMA模块可以根据获取的RDMA报文,生成用于指示进行计算操作的第一命令。
这里,第二指示字段可以用于指示对RDMA报文的payload中的数据进行的计算操作。本申请通过使用RDMA报文头中的该第二指示字段,能够针对每个RDMA报文的payload中的数据,灵活地指示对其进行的计算操作类型。
在一些可能的实现方式中,第二指示字段可以通过RDMA报文中的base transport header中的rsv域段,或增加extended transport header或payload内的数据编码实现,本本申请对此不作限定。
在另一些可能的实现方式中,还可以通过QPC和/或MR来指定需要对RDMA报文的payload中的数据进行的计算操作。
结合第一方面,在第一方面的某些实现方式中,所述计算操作包括加操作、取最大值操作、和操作、或操作、异或操作和取最小值操作中的至少一种。
结合第一方面,在第一方面的某些实现方式中,所述RDMA模块为RDMA网络接口控制器。
第二方面,提供了一种数据处理的方法,所述方法包括:
本地代理模块通过总线接收RDMA模块发送的第一命令和RDMA报文中的操作数,所述第一命令用于指示所述本地代理模块对所述操作数进行计算操作;
所述本地代理模块根据所述第一命令对所述操作数和本地数据进行计算以获得计算 结果;
所述本地代理模块通过非总线接口将所述计算结果写到内存模块。
该方法还可以包括:所述RDMA模块接收RDMA报文,并获取所述RDMA报文中的操作数;
所述RDMA模块通过总线将第一命令和所述操作数发送给所述本地代理模块。
该方法可以应用于包含远端直接内存模块访问RDMA模块、本地代理模块和内存模块的装置,例如第一方面或第一方面的各个实现方式中的装置。其中,所述RDMA模块和所述本地代理模块之间通过总线通信,所述本地代理模块与所述内存模块之间通过非总线接口通信,
结合第二方面,在第二方面的某些实现方式中,还包括:
所述本地代理模块通过所述非总线接口从所述内存模块中读取所述本地数据。
结合第二方面,在第二方面的某些实现方式中,还包括:
所述本地代理模块通过所述总线向所述RDMA模块发送对应于所述第一命令的响应消息,所述响应消息用于指示完成所述计算操作。
结合第二方面,在第二方面的某些实现方式中,还包括:
所述RDMA模块在接收所述响应消息后向中央处理器单元CPU上报中断,所述中断用于指示所述计算操作的完成。
结合第二方面,在第二方面的某些实现方式中,还包括:
所述RDMA模块接收CPU的轮询以返回所述计算操作的完成。
结合第二方面,在第二方面的某些实现方式中,还包括:
所述RDMA模块根据所述RDMA报文的报文头中的第一指示字段,确定需要对所述RDMA报文的负荷中数据进行集体通信计算;
所述RDMA模块将所述负荷中的数据确定为所述操作数。
结合第二方面,在第二方面的某些实现方式中,所述集体通信计算包括集体通信规约计算,或集体通信规约并广播计算。
结合第二方面,在第二方面的某些实现方式中,还包括:
所述RDMA模块根据所述RDMA报文的报文头中的第二指示字段,确定需要对所述操作数进行的所述计算操作;
所述RDMA模块根据所述计算操作,生成所述第一命令。
结合第二方面,在第二方面的某些实现方式中,所述计算操作包括加操作、取最大值操作、和操作、或操作、异或操作和取最小值操作中的至少一种。
结合第二方面,在第二方面的某些实现方式中,所述RDMA模块为RDMA网络接口控制器。
应理解,本申请的第二方面及对应的实现方式所取得的有益效果可以参见本申请的第一方面及对应的实现方式所取得的有益效果,不再赘述。
附图说明
图1是现有的一种集体通信的方案对应的系统架构的示意图;
图2是现有的另一种集体通信的方案对应的系统架构的示意图;
图3是本申请实施例提供的一种系统架构的示意性框图;
图4是RDMA报文的格式的一个示例;
图5是HA模块与普通处理核的一个示意图;
图6是本申请实施例提供的一种数据处理的装置的示意性框图;
图7是本申请实施例提供的一种数据处理的方法的示意性流程图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
图1示出了现有的一种集体通信的方案对应的系统架构的示意图。其中,集体通信计算算法中的计算部分由系统中的CPU执行。如图1所示,该系统架构包括RDMA模块、内存模块和CPU。RDMA模块接收到网络中的RDMA报文之后,可以通过总线将该RDMA报文中的payload(其中包含数据A2)写到内存模块中,该写操作可以对应于图1中(1)对应的箭头对应的数据传输过程。
在RDMA模块将RDMA报文中的payload写到内存模块之后,RDMA模块可以向CPU上报中断,该上报中断操作可以对应于图1中(2)对应的箭头对应的数据传输过程。对应的,CPU可以根据该中断确定RDMA报文中的数据A2已经写到了内存中。另一种可能的实现方式,RDMA模块可以不向CPU上报中断,而是CPU通过轮询的方式确定数据A2已经写到了内存中。
CPU确定RDMA报文中的数据写到内存中之后,可以发起总线读操作,将写到内存模块中的数据A2读到CPU中,该读操作可以对应于图1中(3)对应的箭头对应的数据传输过程。同时,CPU可以将内存模块中保存的本地数据A3读到CPU中,该读操作可以对应于图1中(4)对应的箭头对应的数据传输过程。
之后,CPU完成数据A2和A3的计算。获得计算结果之后,CPU通过总线将计算结果写到内存模块中。该写操作可以对应于图1中(5)对应的箭头对应的数据传输过程。
图1中的数据处理过程,对RDMA模块而言,总共需要通过总线对内存模块进行2次写操作(即图1中(1)和(5)对应的箭头对应的数据传输过程)和2次读操作(即图1中(3)和(4)对应的箭头对应的数据传输过程)。对CPU而言,需要CPU进行一次中断或轮询操作,对内存模块进行1次写操作(即图1中(5)对应的箭头对应的数据传输过程),以及2次读操作(即图1中(3)和(4)对应的箭头对应的数据传输过程)。对图1中的整个系统而言,该数据处理的过程的延迟包括图1中(1)至(5)对应的箭头对应的数据传输过程,以及CPU的计算延迟,其中,(1)至(5)中的每个数据传输过程都需要对总线进行读或写操作。
图2示出了现有的另一种集体通信的方案对应的系统架构的示意图,其中,集体通信计算算法中的计算部分由RDMA模块执行。如图2所示,该系统架构包括RDMA模块、内存模块和CPU。在RDMA模块接收RDMA报文之前,可以通过总线读取内存模块中的数据A4,并等待参与计算。该读操作可以对应于图2中(1)对应的箭头对应的数据传输过程。
RDAM模块接收网络中的RDMA报文,并通过报文头中的相关信息确定需要进行集体通信计算(例如MPI_reduce或MPI_allreduce计算),以及相应的计算操作类型。此时, RDMA模块可以将计算结果通过总线写到内存模块中。该写操作可以对应于图2中(2)对应的箭头对应的数据传输过程。
然后,RDMA模块可以向CPU上报中断,该上报中断操作可以对应于图2中(3)对应的箭头对应的数据传输过程。另一种可能的实现方式,CPU可以通过轮询的方式确定是否完成了计算操作,此时RDMA模块可以不向CPU上报中断。
图2中的数据处理过程,对RDMA模块而言,需要通过总线对内存模块进行1次读操作(即图2中(1)对应的箭头对应的数据传输过程)和1次写操作(即图2中(2)对应的箭头对应的数据传输过程)。对CPU而言,需要进行一次中断或轮询操作。对图2中的整个系统而言,该数据处理的过程的延迟包括图2中的(1)和(2)对应的箭头对应的数据传输过程,以及RDAM模块的计算延迟。其中,(1)和(2)对应的数据传输过程都需要对总线进行读操作或写操作。
在图1或图2的集体通信的方案中,calculation部分涉及到比较多的读操作和写操作,例如在图1中RDMA模块需要通过总线对内存模块进行两次读操作和两次写操作,在图2中RDMA模块需要通过总线对内存模块进行一次读操作和一次写操作,这会导致产生时延和带宽消耗。
有鉴于此,本申请实施例提供了一种数据处理方案,其中在集体通信的系统架构中增加了本地代理(home agent,HA)模块,HA模块可以对本地数据和RDMA报文中的操作数进行计算,并将计算结果通过非总线接口写入内存模块中,能够有助于避免RDMA模块通过总线从内存模块中读取本地数据,以及通过总线将计算结果写入内存模块,从而实现针对calculation部分进行优化,减小集体通信的时延和带宽消耗。
图3示出了本申请实施例提供的一种系统架构100的示意性框图。该系统架构100例如可以为HPC服务器,或者AI训练中心服务器等。示例性的,在HPC或AI应用场景中,可以包括多个该系统架构100,每个该系统架构100可以称为一个节点。在多个节点的集体通信中,多个节点之间通过RDMA技术进行节点间的数据传输,之后再对数据进行计算,实现MPI_reduce或MPI_allreduce计算。
需要说明的是,在进行集体通信的多个节点中,在通过RDMA技术进行节点间的数据传输之后,一些节点需要对获得的数据进行计算,一些节点不需要对获得的数据进行计算,而可以直接将该数据传输给下一个节点。可选的,当节点对数据进行计算之后,还可以将计算结果传输给下一个节点。
下面,结合图3中所示的系统架构,描述进行集体通信的其中一个节点的数据处理过程,其中,该节点需要对获得的数据进行计算。可以理解的是,进行集体通信的其他需要进行数据计算的节点的数据处理过程与该节点的数据处理过程相同或相似,可以参见下面描述的该节点的数据处理过程,不再赘述。
如图3所示,系统架构100包括RDMA模块110、本地代理(home agent,HA)模块120、内存模块130和中央处理单元(central processing unit,CPU)140。
其中,RDMA模块110可以支持RDMA协议,使得图3中的该节点可以与其他节点通过RDMA技术进行数据传输,例如可以接收RDMA报文,发送RDMA报文等。示例性的,RDMA模块110具体可以为RDMA引擎,或RDMA网络接口控制器(network interface controller,NIC,也可以称为RDMA网卡),本申请对此不作限定。
作为示例,RDMA模块110可以通过互联网(Internet)接收来自进行集体通信的其他节点的RDMA报文。
这里,RDMA报文的操作码(opcode)类型例如可以为发送(send)、带远端无效操作的发送(send with invalidate)、带立即数的发送(send with immediate)、RDMA写操作(RDMA write)、带立即数的RDMA写操作(RDMA write with immediate)或RDMA读操作(RDMA read)等,本申请实施例对此不作限定。也就是说,对于图3中所示的节点而言,其他节点可以通过send、send with invalidate、send with immediate、RDMA write、RDMA write with immediate或RDMA read等操作将需要进行MPI_reduce或MPI_allreduce计算的RDMA报文发送到该节点。
RDMA模块110在接收到RDMA报文之后,可以获取该RDMA报文中的操作数,即需要进行MPI_reduce或MPI_allreduce计算的操作数。示例性的,该操作数的数据类型可以包括8/16/32/64比特(bit)整数型(int)、8/16/32/64bit无符号整数(unsigned INT,UINT)或双精度浮点数(double)类型,本申请实施例对此不作限定。
在一些可能的实现方式中,RDMA模块110可以根据RDMA报文的报文头中的第一指示字段,确定需要对该RDMA报文的负荷(payload)中数据进行集体通信计算,例如进行MPI_reduce或MPI_allreduce计算。其中,第一指示字段可以用于指示对RDMA报文的payload中的数据进行集体通信计算。此时RDMA模块110可以将该RDMA报文的payload中的数据确定为上述操作数。因此,通过使用RDMA报文头中的该第一指示字段,能够针对每个RDMA报文的payload中的数据,灵活地指示是否对其进行集体通信计算。
在另一些可能的实现方式中,还可以通过队列对上下文(queue pair context,QPC)和/或存储区域(memory region,MR)来指定需要对RDMA报文的payload中的数据进行集体通信计算。其中,MR为RDMA协议中定义的存储区域,例如可以为RDMA模块进行接收或发送的内存空间。例如,在队列对(queue pair,QP)1的QPC的数据结构中可以存在指示位#1来指示对RDMA报文的payload中的数据进行集体通信操作。在该指示位#1有效的情况下(即使此时RDMA报文中的报文头中没有指示字段),属于QP1的报文都固定的进行集体通信计算。又例如,在MR1中的数据结构中存在指示位#2来指示对RDMA报文的payload中的数据进行集体通信操作。在该指示位#2有效的情况下(即使此时RDMA报文中的报文头中没有指示字段),对MR1进行操作的RDMA报文都固定的进行集体通信计算。
在一些可能的实现方式中,RDMA模块110可以根据RDMA报文的报文头中的第二指示字段,确定对RDMA报文payload中的数据(即上述操作数)进行的计算(calculation)操作,例如进行MPI_reduce或MPI_allreduce计算中的计算操作。其中,第二指示字段可以用于指示对RDMA报文的payload中的数据进行的计算操作。因此,通过使用RDMA报文头中的该第二指示字段,能够针对每个RDMA报文的payload中的数据,灵活地指示对其进行的计算操作类型。
在另一些可能的实现方式中,还可以通过QPC和/或MR来指定需要对RDMA报文的payload中的数据进行的计算操作。例如,在QP1的QPC的数据结构中可以存在指示位#3来指示对RDMA报文的payload中的数据进行的计算操作。在该指示位#3有效的情况下(即使此时RDMA报文中的报文头中没有指示字段),属于QP1的报文都固定的根 据该指示位#3进行计算操作。又例如,在MR1中的数据结构中存在指示位#4来指示对RDMA报文的payload中的数据进行的计算操作。在该指示位#4有效的情况下(即使此时RDMA报文中的报文头中没有指示字段),对MR1进行操作的RDMA报文都固定的根据该指示位#4进行计算操作。
示例性的,对RDMA报文中的payload中的数据的计算操作,例如可以包括加(add)操作、取最大值(max)操作、和(and)操作、或(or)操作、异或(xor)操作和取最小值(min)操作中的至少一种。
图4示出了RDMA报文的格式的一个示例。如图4所示,RDMA报文中可以包括local router header、global transport header、base transport header、extended transport header、payload、invariant CRC和variant CRC。在一些可能的实现方式中,上述第一指示字段可以包含于base transport header的opcode中,第二指示字段可以通过base transport header中的rsv域段,或增加extended transport header或payload内的数据编码实现,本申请并不限于此。
RDMA模块110可以通过总线与HA模块120进行通信。示例性的,RDMA模块110可以通过总线向HA模块120发送操作命令和RDMA报文中的操作数,该操作命令用于指示HA模块120对该操作数进行计算操作。继续参见图3,RDMA模块110向HA模块120发送操作命令和操作数可以对应于图3中(1)对应的箭头对应的数据传输过程,即一次总线写操作。
HA模块120可以通过总线与RDMA模块110通信,以及通过非总线接口与内存模块130之间进行通信。例如,HA模块120可以通过总线接收来自RDMA模块110的操作数和操作命令。HA模块120还可以通过非总线接口从内存模块130读取本地数据,例如A1。继续参见图3,HA模块120从内存模块130中读取本地数据可以对应于图3中(2)对应的箭头对应的数据传输过程。
在一些可能的实现方式中,HA模块120还可以预先(例如在接收到操作命令之前)保存有本地数据,本申请实施例对此不作限定。
HA模块120还可以进行计算操作,例如可以根据获取的来自RDMA模块110的操作命令,对获取的RDMA模块110的操作数和内存模块130的本地数据进行计算操作,获得计算结果。获得计算结果之后,HA模块120可以通过非总线接口将该计算结果写入内存模块130中。继续参见图3,HA模块120将计算结果写入内存模块130可以对应于图3中(3)对应的箭头对应的数据传输过程。
示例性的,HA模块120可以由具有从内存模块130中读取数据、向内存模块130中写入数据,以及对数据进行计算操作的电路模块实现。HA模块120可以是一个硬件模块,例如可以是通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,不作限定。
示例性的,内存模块130可以为动态随机存取存储器(dynamic random access memory,DRAM),或同步动态随机存取存储器(synchronous DRAM,SDRAM),或双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)等,不作限定。
需要说明的是,RDMA模块110与HA模块120之间通过总线通信,指的是RDMA模块110和HA模块120之间进行数据交互时,数据需要经过总线进行传输。HA模块120与内存模块130之间通过非总线接口通信,即指的是HA模块120和内存模块130之间不通过总线通信,即HA模块120和内存模块130之间进行数据交互时,数据不需要经过总线进行传输。示例性的,在HA模块120和内存模块130之间的数据可以通过私有接口(或专用接口)进行传输。
继续参见图3,作为示例,在系统架构100中,由于HA模块120设置在距离内存模块130较近的位置上,因此可以缩短HA模块120对内存模块130进行读操作或写操作时的时延。
图5示出了HA模块与普通处理核的一个示意图。其中,以DRAM、非易失性存储器(non-volatile memory,NVM)为内存模块、处理核为CPU为例进行描述。如图5所示,处理核(core)1至处理核n 301分别通过各自的层1(L1)缓存(cache)、层2(L2)cache以及公共的层3(L3)cache对DRAM 305进行读或写操作,DRAM 305进一步可以与NVM306进行数据传输。处理核302通过cache对DRAM 307进行读或写操作,DRAM 307进一步可以与NVM 308进行数据传输。同时,HA模块303可以对DRAM 307直接进行读操作或写操作,而不需要通过cache。另外,HA模块304可以对NVM 308进行读操作或写操作,而不需要通过cache。
其中,处理核1至处理核n 301,以及处理核302作为常规计算系统,可以作为计算处理核心为中心,内存模块与计算处理核心由于频率与通信/处理速度差异很大,计算处理核心与内存模块之间一般不会直接通信,而是需要经过cache模块以及总线才能通信。而HA模块303和HA模块304作为专用计算的计算模块,能够直接以私有接口的方式与内存模块连接,并实现其频率与通信/处理速度匹配,从而不需要通过cache或总线便可实现对内存模块的读操作或写操作。
因此,本申请实施例中,HA模块通过对本地数据和RDMA报文中的操作数进行计算,并将获得的计算结果通过非总线接口写入内存模块中,能够有助于避免RDMA模块通过总线从内存模块中读取本地数据,以及通过总线将计算结果写入内存模块中,因此本申请实施例能够实现对进行集体通信的节点中的计算部分进行优化,有助于减小节点间通信时延。进一步的,由于本申请实施例能够减少总线访问次数,从而能够有助于减小总线的带宽消耗。
示例性的,图3中的系统架构100在进行集体通信算法操作时,数据处理的过程的延迟包括图3中的(1)至(3)对应的箭头对应的数据传输过程,以及HA模块的计算延迟。由于HA模块120对内存模块130的读操作以及写操作不需要经过总线,因此该延迟仅包含1次对总线的写操作,从而能有助于够减小集体通信算法操作的延迟。
在一些可选的实施例中,HA模块120在通过非总线接口将该计算结果写入内存模块130中的同时,或者在通过非总线接口将该计算结果写入内存模块130之后,可以发送对应于上述操作命令(即图3中(1)对应的过程)的响应(response)消息,该响应消息用于指示完成该操作命令对应的计算操作。这里,该response为一次总线写操作的response,其延时包含在一次总线写操作中。对应的,RDMA模块110接收该响应消息,并根据该响应消息,可以确定对RDMA报文中的操作数的计算操作完成。
在RDMA模块110接收该响应消息之后,RDMA模块110可以通过中断或者轮询的方式通知CPU140完成计算操作。作为一个示例,RDMA模块110可以通过总线向CPU140发送中断,CPU140接收到该中断时,可以确定完成计算操作。作为另一个示例,CPU140可以周期性的轮询以确定是否完成计算过程,此时不需要RDMA模块110上报中断。
本申请实施例的数据处理方案相对图1中的数据处理方案而言,进行集体通信算法操作时数据处理的过程的延迟只包含1次对总线的写操作,因此集体通信算法操作的延迟将大幅减小,提升系统性能。
另外,本申请实施例的数据处理方案中,CPU只需要一次中断或轮询操作,而不需要对内存模块进行读操作或写操作,因此本申请实施例相对图1中所示的方案,能够提高CPU的处理效率。
本申请实施例的数据处理方案相对图2中的数据处理方案而言,进行集体通信算法操作时数据处理的过程的延迟只包含1次对总线的写操作,当需要对大量数据进行集体通信操作时,本申请实施例能够显著减小集体通信算法操作的延迟,提升系统性能。
图6示出了本申请实施例提供的一种数据处理的装置600的示意性框图。如图6所示,装置600包括远端直接内存模块访问RDMA模块610、本地代理模块620和内存模块630。其中,RDMA模块610和所本地代理模块620之间通过总线通信,本地代理模块620与内存模块630之间通过非总线接口通信。
所述RDMA模块610用于:
接收RDMA报文,并获取所述RDMA报文中的操作数;
通过所述总线将第一命令和所述操作数发送给所述本地代理模块,所述第一命令用于指示所述本地代理模块对所述操作数进行计算操作。
所述本地代理模块620用于:
根据所述第一命令对所述操作数和本地数据进行计算以获得计算结果;
通过所述非总线接口将所述计算结果写到所述内存模块。
在一些可选的实施例中,所述本地代理模块620还用于通过所述非总线接口从所述内存模块中读取所述本地数据。
在一些可选的实施例中,本地代理模块620还用于通过所述总线向所述RDMA模块发送对应于所述第一命令的响应消息,所述响应消息用于指示完成所述计算操作。
在一些可选的实施例中,装置600还可以包括中央处理器单元CPU。RDMA模块610还用于向中央处理单元CPU上报中断,所述中断用于指示所述计算操作的完成。
在一些可选的实施例中,装置600还可以包括中央处理器单元CPU。所述RDMA模块610还用于接收所述CPU的轮询以返回所述计算操作的完成。
在一些可选的实施例中,RDMA模块610具体用于根据所述RDMA报文的报文头中的第一指示字段,确定需要对所述RDMA报文的负荷中数据进行集体通信计算,并将所述负荷中的数据确定为所述操作数。
在一些可选的实施例中,所述集体通信计算包括集体通信规约计算,或集体通信规约并广播计算。
在一些可选的实施例中,RDMA模块610还用于根据所述RDMA报文的报文头中的第二指示字段,确定需要对所述操作数进行的所述计算操作;根据所述计算操作,生成所 述第一命令。
在一些可选的实施例中,所述计算操作包括加操作、取最大值操作、和操作、或操作、异或操作和取最小值操作中的至少一种。
在一些可选的实施例中,所述RDMA模块为RDMA网络接口控制器。
以上装置600中各模块或单元的功能和动作仅为示例性说明,装置600中各模块或单元可以参见上述系统100中各模块或单元的描述,此处不做赘述。
图7示出了本申请实施例提供的一种数据处理的方法700的示意性流程图。该方法700可以应用于包含远端直接内存模块访问RDMA模块、本地代理模块和内存模块的装置,例如上述装置600或系统架构100。方法700包括以下步骤710至730。
710,本地代理模块通过总线接收RDMA模块发送的第一命令和RDMA报文中的操作数,所述第一命令用于指示所述本地代理模块对所述操作数进行计算操作。
720,所述本地代理模块根据所述第一命令对所述操作数和本地数据进行计算以获得计算结果。
730,所述本地代理模块通过非总线接口将所述计算结果写到内存模块。
在一些可选的实施例中,方法700还可以包括:所述RDMA模块接收RDMA报文,并获取所述RDMA报文中的操作数;
所述RDMA模块通过总线将第一命令和所述操作数发送给所述本地代理模块。
在一些可选的实施例中,方法700还可以包括:所述本地代理模块通过所述非总线接口从所述内存模块中读取所述本地数据。
在一些可选的实施例中,方法700还包括:本地代理模块通过总线向所述RDMA模块发送对应于所述第一命令的响应消息,所述响应消息用于指示完成所述计算操作。
在一些可选的实施例中,方法700还包括:RDMA模块在接收所述响应消息后向CPU上报中断,所述中断用于指示所述计算操作的完成。
在一些可选的实施例中,方法700还包括:所述RDMA模块接收CPU的轮询以返回所述计算操作的完成。
在一些可选的实施例中,方法700还包括:所述RDMA模块根据所述RDMA报文的报文头中的第一指示字段,确定需要对所述RDMA报文的负荷中数据进行集体通信计算;所述RDMA模块将所述负荷中的数据确定为所述操作数。
在一些可选的实施例中,所述集体通信计算包括集体通信规约计算,或集体通信规约并广播计算。
在一些可选的实施例中,方法700还包括:所述RDMA模块根据所述RDMA报文的报文头中的第二指示字段,确定需要对所述操作数进行的所述计算操作;所述RDMA模块根据所述计算操作,生成所述第一命令。
在一些可选的实施例中,所述计算操作包括加操作、取最大值操作、和操作、或操作、异或操作和取最小值操作中的至少一种。
在一些可选的实施例中,所述RDMA模块为RDMA网络接口控制器。
以上方法700中包括的各步骤或处理过程仅为示例性说明,方法700可以参见上述系统100中各模块或单元的描述,此处不做赘述。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的 先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
还应理解,在本申请实施例中,第一、第二以及各种数字编号仅为描述方便进行的区分,并不用来限制本申请实施例的范围。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种数据处理的装置,其特征在于,包括远端直接内存模块访问RDMA模块、本地代理模块和内存模块,其中,所述RDMA模块和所述本地代理模块之间通过总线通信,所述本地代理模块与所述内存模块之间通过非总线接口通信;
    所述RDMA模块用于:
    接收RDMA报文,并获取所述RDMA报文中的操作数;
    通过所述总线将第一命令和所述操作数发送给所述本地代理模块,所述第一命令用于指示所述本地代理模块对所述操作数进行计算操作;
    所述本地代理模块用于:
    根据所述第一命令对所述操作数和本地数据进行计算以获得计算结果;
    通过所述非总线接口将所述计算结果写到所述内存模块。
  2. 根据权利要求1所述的装置,其特征在于,所述本地代理模块还用于通过所述非总线接口从所述内存模块中读取所述本地数据。
  3. 根据权利要求1或2所述的装置,其特征在于,
    所述本地代理模块还用于通过所述总线向所述RDMA模块发送对应于所述第一命令的响应消息,所述响应消息用于指示完成所述计算操作。
  4. 根据权利要求3所述的装置,其特征在于,还包括中央处理器单元CPU;
    所述RDMA模块还用于在接收所述响应消息后向所述CPU上报中断,所述中断用于指示所述计算操作的完成。
  5. 根据权利要求3所述的装置,其特征在于,还包括CPU;
    所述RDMA模块还用于接收所述CPU的轮询以返回所述计算操作的完成。
  6. 根据权利要求1-5任一项所述的装置,其特征在于,所述RDMA模块具体用于:
    根据所述RDMA报文的报文头中的第一指示字段,确定需要对所述RDMA报文的负荷中数据进行集体通信计算;
    将所述负荷中的数据确定为所述操作数。
  7. 根据权利要求6所述的装置,其特征在于,所述集体通信计算包括集体通信规约计算,或集体通信规约并广播计算。
  8. 根据权利要求1-7任一项所述的装置,其特征在于,所述RDMA模块还用于:
    根据所述RDMA报文的报文头中的第二指示字段,确定需要对所述操作数进行的所述计算操作;
    根据所述计算操作,生成所述第一命令。
  9. 根据权利要求1-8任一项所述的装置,其特征在于,所述计算操作包括加操作、取最大值操作、和操作、或操作、异或操作和取最小值操作中的至少一种。
  10. 根据权利要求1-9任一项所述的装置,其特征在于,所述RDMA模块为RDMA网络接口控制器。
  11. 一种数据处理的方法,其特征在于,包括:
    本地代理模块通过总线接收RDMA模块发送的第一命令和RDMA报文中的操作数, 所述第一命令用于指示所述本地代理模块对所述操作数进行计算操作;
    所述本地代理模块根据所述第一命令对所述操作数和本地数据进行计算以获得计算结果;
    所述本地代理模块通过非总线接口将所述计算结果写到内存模块。
  12. 根据权利要求11所述的方法,其特征在于,还包括:
    所述本地代理模块通过所述非总线接口从所述内存模块中读取所述本地数据。
  13. 根据权利要求11或12所述的方法,其特征在于,还包括:
    所述本地代理模块通过所述总线向所述RDMA模块发送对应于所述第一命令的响应消息,所述响应消息用于指示完成所述计算操作。
  14. 根据权利要求13所述的方法,其特征在于,还包括:
    所述RDMA模块在接收所述响应消息后向中央处理器单元CPU上报中断,所述中断用于指示所述计算操作的完成。
  15. 根据权利要求13所述的方法,其特征在于,还包括:
    所述RDMA模块接收CPU的轮询以返回所述计算操作的完成。
  16. 根据权利要求11-15任一项所述的方法,其特征在于,还包括:
    所述RDMA模块根据所述RDMA报文的报文头中的第一指示字段,确定需要对所述RDMA报文的负荷中数据进行集体通信计算;
    所述RDMA模块将所述负荷中的数据确定为所述操作数。
  17. 根据权利要求16所述的方法,其特征在于,所述集体通信计算包括集体通信规约计算,或集体通信规约并广播计算。
  18. 根据权利要求11-17任一项所述的方法,其特征在于,还包括:
    所述RDMA模块根据所述RDMA报文的报文头中的第二指示字段,确定需要对所述操作数进行的所述计算操作;
    所述RDMA模块根据所述计算操作,生成所述第一命令。
  19. 根据权利要求11-18任一项所述的方法,其特征在于,所述计算操作包括加操作、取最大值操作、和操作、或操作、异或操作和取最小值操作中的至少一种。
  20. 根据权利要求11-19任一项所述的方法,其特征在于,所述RDMA模块为RDMA网络接口控制器。
PCT/CN2020/117414 2020-09-24 2020-09-24 数据处理的装置和方法 WO2022061646A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202080103887.0A CN116171429A (zh) 2020-09-24 2020-09-24 数据处理的装置和方法
EP20954498.0A EP4206932A4 (en) 2020-09-24 2020-09-24 DATA PROCESSING DEVICE AND METHOD
PCT/CN2020/117414 WO2022061646A1 (zh) 2020-09-24 2020-09-24 数据处理的装置和方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/117414 WO2022061646A1 (zh) 2020-09-24 2020-09-24 数据处理的装置和方法

Publications (1)

Publication Number Publication Date
WO2022061646A1 true WO2022061646A1 (zh) 2022-03-31

Family

ID=80846041

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/117414 WO2022061646A1 (zh) 2020-09-24 2020-09-24 数据处理的装置和方法

Country Status (3)

Country Link
EP (1) EP4206932A4 (zh)
CN (1) CN116171429A (zh)
WO (1) WO2022061646A1 (zh)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104871493A (zh) * 2012-12-18 2015-08-26 国际商业机器公司 高性能计算(hpc)网络中的通信信道故障切换
CN106537367A (zh) * 2014-09-09 2017-03-22 英特尔公司 用于基于代理的多线程消息传递通信的技术
CN108027794A (zh) * 2015-09-24 2018-05-11 英特尔公司 用于在私有高速缓存中使用直接数据放置进行自动处理器核关联管理和通信的技术
EP3198467B1 (en) * 2014-09-24 2020-07-29 Intel Corporation System, method and apparatus for improving the performance of collective operations in high performance computing
CN111611125A (zh) * 2019-02-26 2020-09-01 英特尔公司 用于改善高性能计算应用的性能数据收集的方法与设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050091334A1 (en) * 2003-09-29 2005-04-28 Weiyi Chen System and method for high performance message passing
US10891253B2 (en) * 2016-09-08 2021-01-12 Microsoft Technology Licensing, Llc Multicast apparatuses and methods for distributing data to multiple receivers in high-performance computing and cloud-based networks
CN111459418B (zh) * 2020-05-15 2021-07-23 南京大学 一种基于rdma的键值存储系统传输方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104871493A (zh) * 2012-12-18 2015-08-26 国际商业机器公司 高性能计算(hpc)网络中的通信信道故障切换
CN106537367A (zh) * 2014-09-09 2017-03-22 英特尔公司 用于基于代理的多线程消息传递通信的技术
EP3198467B1 (en) * 2014-09-24 2020-07-29 Intel Corporation System, method and apparatus for improving the performance of collective operations in high performance computing
CN108027794A (zh) * 2015-09-24 2018-05-11 英特尔公司 用于在私有高速缓存中使用直接数据放置进行自动处理器核关联管理和通信的技术
CN111611125A (zh) * 2019-02-26 2020-09-01 英特尔公司 用于改善高性能计算应用的性能数据收集的方法与设备

Also Published As

Publication number Publication date
EP4206932A1 (en) 2023-07-05
CN116171429A (zh) 2023-05-26
EP4206932A4 (en) 2023-11-01

Similar Documents

Publication Publication Date Title
US10360098B2 (en) High performance interconnect link layer
US10204064B2 (en) Multislot link layer flit wherein flit includes three or more slots whereby each slot comprises respective control field and respective payload field
CN110647480B (zh) 数据处理方法、远程直接访存网卡和设备
US10380059B2 (en) Control messaging in multislot link layer flit
US11366773B2 (en) High bandwidth link layer for coherent messages
JP6433146B2 (ja) 情報処理装置、システム、情報処理方法、コンピュータプログラム
US11816052B2 (en) System, apparatus and method for communicating telemetry information via virtual bus encodings
WO2017101080A1 (zh) 处理写请求的方法、处理器和计算机
US10936048B2 (en) System, apparatus and method for bulk register accesses in a processor
EP4357901A1 (en) Data writing method and apparatus, data reading method and apparatus, and device, system and medium
WO2022061646A1 (zh) 数据处理的装置和方法
WO2022178675A1 (zh) 一种互联系统、数据传输方法以及芯片
US20190012282A1 (en) Information processing system, information processing device, and control method of information processing system
WO2024077999A1 (zh) 集合通信方法及计算集群
US12093754B2 (en) Processor, information processing apparatus, and information processing method
WO2023179741A1 (zh) 一种计算系统以及数据传输方法
WO2023093065A1 (zh) 数据传输方法、计算设备及计算系统
WO2024193142A1 (zh) 存储装置、方法、设备和存储系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20954498

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020954498

Country of ref document: EP

Effective date: 20230327

NENP Non-entry into the national phase

Ref country code: DE