CN115989478A

CN115989478A - Data operation method and device

Info

Publication number: CN115989478A
Application number: CN202080103371.6A
Authority: CN
Inventors: 石达清
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-09-01
Filing date: 2020-09-01
Publication date: 2023-04-18
Also published as: WO2022047632A1

Abstract

The application provides a data operation method and device, relates to the technical field of communication, and is used for reducing delay of MPI operation and improving MPI execution efficiency. The method is applied to a network card, the network card is coupled with a memory through a bus, and the method comprises the following steps: receiving a first message, wherein the first message comprises operation indication information and first data; determining data operation needing MPI operation on the first data according to the operation indication information; acquiring second data from the memory, wherein the second data is local data of the data operation in the MPI operation; and finishing the data operation of the first data and the second data to obtain a first operation result.

Description

Data operation method and device

Technical Field

The present application relates to the field of communications technologies, and in particular, to a data operation method and apparatus.

Background

With the rapid development of High Performance Computing (HPC) and Artificial Intelligence (AI) scenarios, the execution efficiency of the Message Passing Interface (MPI) set communication function becomes more important. The MPI set communication function comprises MPI _ reduce and other reduce function types, wherein the proportion of the MPI reduce function in an MPI application scene is about 40%, and the improvement of the MPI reduce function execution efficiency brings greater MPI application program operation efficiency. The MPI reduce function can be decomposed into three parts, namely calculation, synchronization and communication, and optimization of the calculation part of the MPI reduce function is realized.

In the prior art, a computing part in the MPI operation is usually realized by unloading Send Queue (SQ) tasks of an external network card of a server. Specifically, as shown in fig. 1, the method includes: s1, a network card writes data A1 from a network message into a Dynamic Random Access Memory (DRAM); s2, scheduling SQ by the network card, and when the task of the selected SQ is reduce operation of specified operation on the A1 data, reading the data A1 in the message from the DRAM by the network card; s3, the network card reads local data A2 from the DRAM; and S4, the network card completes the operation of the data A1 and the data A2 and writes the operation result into the DRAM.

In the method, the network card can execute the data operation only when the SQ corresponding to the calculation task is scheduled, and when the networking scale is larger, the quantity of the SQ corresponding to the network card is larger, and the delay of the SQ corresponding to the calculation task scheduled by the network card is larger, so that the delay of the MPI operation is larger.

Disclosure of Invention

The application provides a data operation method and a data operation device, which are used for reducing the delay of MPI operation and improving the execution efficiency of the MPI operation.

In order to achieve the above purpose, the embodiments of the present application adopt the following technical solutions:

in a first aspect, a data operation method in MPI operation is provided, which is applied in a network card, where the network card is coupled with a memory through a bus, and the method includes: receiving a first message, wherein the first message can be sent by other servers executing Message Passing Interface (MPI) operation in a network, and the first message comprises operation indication information and first data; determining data operation needing MPI operation on the first data according to the operation indication information; acquiring second data from a memory, wherein the second data is local data of the data operation in the MPI operation; completing data operation (such as addition or multiplication) of the first data and the second data in the MPI operation to obtain a first operation result; further, the first operation result is written into the memory.

In the above technical solution, when the network card receives and acquires the operation instruction information and the first data in the first message, the network card may determine, according to the operation instruction information, that data operation is required to be performed on the first data, thereby directly acquiring, from the memory, local data of the data operation in the MPI operation, that is, the second data, and completing the data operation on the first data and the second data to obtain the first operation result. Compared with the prior art, the network card does not need to write the first data into the memory, but directly acquires the second data when acquiring the first data, namely, the first data and the second data are subjected to channel following calculation, so that the read-write times of the memory are reduced, the time delay of MPI operation is reduced, and the MPI execution efficiency is improved.

In one possible implementation manner of the first aspect, the method includes: receiving a first message, wherein the first message comprises first data; when the first message carries operation indication information, determining that data operation of Message Passing Interface (MPI) operation needs to be performed on the first data, and acquiring second data from a memory, wherein the second data is local data of the operation of the MPI operation; and finishing MPI operation of the first data and the second data to obtain a first operation result. It should be understood that the scheme may further include: and writing the first operation result into a memory.

It should be understood that the operation indication information may be carried in a header of the first packet, for example, by extending a header of an existing packet, and the operation indication information is carried in a header obtained by the extending.

In a possible implementation manner of the first aspect, the obtaining the second data from the memory, where the first message further includes a storage address of the second data, includes: and reading the second data from the memory according to the storage address. Further, writing the first operation result into the memory includes: and storing the first operation result in the storage position of the second data according to the storage address of the second data to cover the second data. In the possible implementation manner, useless data can be prevented from occupying the storage space in the memory, so that the utilization rate of the memory is improved.

In one possible implementation manner of the first aspect, the network card, the memory and the bus are integrated in a system on chip SoC. In the possible implementation manner, the network card, the memory and the bus are integrated in the SoC, so that the end-to-end transmission delay can be reduced, and the execution efficiency of data operation in the MPI operation is further improved.

In a possible implementation manner of the first aspect, the operation indication information includes: an operation type and a data type. In the possible implementation manner, the data operation that the first data needs to be subjected to the MPI operation can be determined according to the operation type and the data type, so that when the network card acquires the information, the first data does not need to be written into the memory, but the second data is directly acquired, and the data operation of the first data and the second data is implemented, thereby reducing the read-write times of the memory, reducing the delay of the MPI operation, and improving the execution efficiency of the MPI.

In a possible implementation manner of the first aspect, the operation indication information is carried in a header of the first packet. In the above possible implementation, a simple and effective way of carrying operation indication information is provided.

In one possible implementation manner of the first aspect, the MPI operation includes: an MPI _ reduce operation, or an MPI _ allreduce operation. In the possible implementation manner, the delay of the MPI _ reduce operation or the MPI _ reduce operation may be reduced, so that the execution efficiency of the MPI _ reduce operation or the MPI _ reduce operation is improved.

In a possible implementation manner of the first aspect, the network card is further coupled to the processor through a bus, and the method further includes: and sending notification information to the processor, wherein the notification information is used for indicating that the data operation is completed. In the possible implementation manner, the notification information is sent to the processor, so that the state of the MPI operation recorded by the processor is consistent with the state of the actual MPI operation, thereby ensuring the orderliness and high efficiency of the MPI operation execution.

In a second aspect, a data operation device is provided, where the device is a network card or a chip built in the network card, and the network card is coupled with a memory through a bus, and the device includes: a receiving unit, configured to receive a first packet from a network; the processing unit is used for analyzing the first message to obtain operation indication information and first data which are included in the first message, wherein the operation indication information is used for indicating data operation of Message Passing Interface (MPI) operation needing to be carried out on the first data; the acquisition unit is used for acquiring second data from the memory, wherein the second data is local data of the data operation in the MPI operation; and the processing unit is also used for finishing the data operation of the first data and the second data to obtain a first operation result. Further, the apparatus further comprises: and the writing unit is used for writing the first operation result into the memory.

In one possible implementation manner of the second aspect, the apparatus includes: a receiving unit, configured to receive a first packet, where the first packet includes first data; the processing unit is used for determining data operation needing Message Passing Interface (MPI) operation on the first data when the first message carries operation indication information; the acquisition unit is used for acquiring second data from a memory, wherein the second data is local data of the operation of the MPI operation; and the processing unit is further used for finishing MPI operation of the first data and the second data to obtain a first operation result. It should be understood that the scheme may also include: and writing the first operation result into a memory. It should be understood that the operation indication information may be carried in a header of the first packet, for example, a header of an existing packet is extended, and the operation indication information is carried in a header obtained by the extension.

In a possible implementation manner of the second aspect, the first message further includes a storage address of the second data, and the obtaining unit is further configured to: and reading the second data from the memory according to the storage address. Further, the writing unit is further configured to: and storing the first operation result on the storage position of the second data according to the storage address of the second data so as to cover the second data.

In one possible implementation of the second aspect, the network card, the memory and the bus are integrated in a system on chip SoC.

In a possible implementation manner of the second aspect, the operation indication information includes: an operation type and a data type.

In a possible implementation manner of the second aspect, the operation indication information is carried in a header of the first packet.

In a possible implementation manner of the second aspect, the MPI operation corresponding to the first data includes: an MPI _ reduce operation, or an MPI _ allreduce operation.

In a possible implementation manner of the second aspect, the network card is further coupled to the processor through a bus, and the apparatus further includes: and the sending unit is used for sending notification information to the processor, wherein the notification information is used for indicating that the data operation is completed.

In a third aspect, a data operation device is provided, where the device is a network card or a chip built in the network card, the network card is coupled with a memory through a bus, the memory stores codes and data, and the network card runs the codes in the memory to enable the device to execute the data operation method provided by any one of the first aspect and the possible implementation manners of the first aspect.

In another aspect of the present application, a computer-readable storage medium is provided, in which instructions are stored, which when executed on a computer, cause the computer to perform the data operation method provided by the first aspect or any one of the possible implementation manners of the first aspect.

In another aspect of the present application, a computer program product is provided, which is characterized in that when the computer program product is run on a device, the device is caused to execute the data operation method provided by the first aspect or any one of the possible implementation manners of the first aspect.

It is understood that any one of the data operation devices, the computer storage media or the computer program products provided above is used for executing the corresponding method provided above, and therefore, the beneficial effects achieved by the data operation devices, the computer storage media or the computer program products can refer to the beneficial effects in the corresponding methods provided above, and are not described herein again.

Drawings

FIG. 1 is a schematic diagram of an MPI operation;

FIG. 2 is a schematic diagram of an MPI operation provided by an embodiment of the present application;

fig. 3a is a schematic structural diagram of a server according to an embodiment of the present application;

fig. 3b is a schematic structural diagram of another server provided in the embodiment of the present application;

fig. 4 is a schematic flowchart of a data operation method according to an embodiment of the present application;

FIG. 5 is a schematic flowchart illustrating another data operation method according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an MPI operation provided by an embodiment of the present application;

FIG. 7 is a schematic structural diagram of a data operation device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of another data operation device according to an embodiment of the present disclosure.

Detailed Description

In this application, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated object, indicating that there may be three relationships, for example, a and/or B, which may indicate: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple. In addition, the embodiments of the present application use the words "first", "second", etc. to distinguish between similar items or items having substantially the same function or effect. For example, the first threshold and the second threshold are only used for distinguishing different thresholds, and the order of the thresholds is not limited. Those skilled in the art will appreciate that the words "first," "second," and the like do not limit the number or order of execution.

It is noted that the words "exemplary" or "such as" are used herein to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present relevant concepts in a concrete fashion.

Before describing the embodiments of the present application, first, related terms referred to in the embodiments of the present application will be described.

Message Passing Interface (MPI) is a message passing programming interface, and provides a multi-language function library for implementing a series of MPI interfaces. The MPI standard defines a set of functions that allow an application to pass messages from one MPI process to another.

The MPI aggregate communication may refer to communication that implements different functions through MPI, and the functions may be referred to as MPI aggregate communication functions, and the MPI aggregate communication functions include reduce function types such as MPI _ reduce and MPI _ reduce. The MPI _ Reduce and the MPI _ Allreduce are defined standard set communication functions, and the difference between the MPI _ Reduce and the MPI _ Allreduce is that the final result of the MPI _ Reduce is that a certain process node in a communication domain obtains a final calculation result, and the MPI _ Allreduce is that each process node in the communication domain can obtain the final calculation result.

Wherein the MPI collective communication may also be generally referred to as MPI operation, and may be generally decomposed into three parts, synchronization, computation, and communication, respectively. The synchronization can refer to synchronization and information interaction between different operation processes or synchronization between tasks in different steps in the same process; the calculation may refer to performing a specified operation on the input data within each process; the communication may refer to the transfer of data between different nodes in a communication domain. For ease of description, this MPI collective communication is referred to herein collectively as MPI operation.

For example, as shown in fig. 2, assuming that an MPI _ allreduce operation is performed in a communication domain with a networking size of 8 nodes (respectively denoted as P0 to P7), and a recursive doubling (recursive doubling) algorithm is used, each node in the communication domain only needs to transmit and receive communication for 3 times, and when all nodes complete 3 times of transmitting and receiving communication, the MPI _ allreduce operation is completed. The specific implementation steps may include the following steps S01 to S03.

And S01, mutually exchanging 1/8 data by the nodes with the distance of 1, and performing reduction operation, wherein the result is that each node obtains the reduction result of 1/4 data. For example, as shown in table 1 below, data a and B are exchanged between P0 and P1, data C and D are exchanged between P2 and P3, data E and F are exchanged between P4 and P5, data G and H are exchanged between P6 and P7, and each node performs an addition operation, so that P0 and P1 obtain a + B, P2 and P3 obtain C + D, P4 and P5 obtain E + F, and P6 and P7 obtain G + H.

S02, the nodes with the distance of 2 exchange 1/4 data mutually, and carry out reduction operation, and the result is that each node obtains the reduction result of the 1/2 data. For example, as shown in table 1 below, data a + B and C + D are exchanged between P0 and P2, data E + F and G + H are exchanged between P4 and P6, and data E + F and G + H are exchanged between P5 and P7, respectively, and each node performs an addition operation, so that a + B + C + D is obtained from P0 to P3, and E + F + G + H is obtained from P4 to P7.

And S03, mutually exchanging 1/2 data by the nodes with the distance of 4, and performing reduction operation, wherein the result is that each node obtains the reduction result of all the data. For example, as shown in table 1 below, data a + B + C + D and E + F + G + H are exchanged between P0 and P4, between P1 and P5, between P2 and P6, and between P3 and P7, respectively, and each node performs an addition operation, so that a + B + C + D + E + F + G + H is obtained from P0 to P7.

TABLE 1

Fig. 3a and fig. 3b are schematic structural diagrams of two exemplary servers provided in an embodiment of the present application, where the servers may include: a memory 301, a processor 302, a network card 303 and a bus 304, the memory 301, the processor 302 and the network card 303 being connected to each other by the bus 304.

The memory 301 may be used to store data, software programs, and modules, and mainly includes a program storage area and a data storage area, where the program storage area may store an operating system, an application program required by at least one function, and the like, and the data storage area may store data created when the device is used, and the like. For example, the operating system may include a Linux operating system, a Unix operating system, a Window operating system, or the like; the Applications (APPs) required for the at least one function may include an artificial intelligence (artificial intelligence) related APP, a High Performance Computing (HPC) related APP, a deep learning (deep learning) related APP, or a Computer Graphics (CG) related APP, and the like. In one possible example, the memory 301 includes, but is not limited to, static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), or high speed random access memory (high speed RAM). Further, the memory 301 may also include other non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

In addition, the processor 302 is used to control and manage the operation of the server, such as by running or executing software programs and/or modules stored in the memory 301, as well as invoking data stored in the memory 301, performing various functions of the server and processing the data. In one possible example, the processor 302 includes, but is not limited to, a Central Processing Unit (CPU), a Network Processing Unit (NPU), a Graphics Processing Unit (GPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, transistor logic, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 302 may also be a combination that performs computing functions, including for example, one or more microprocessor combinations, digital signal processors and microprocessors, and the like.

The network card 303 may be used to implement communication between the server and an external network, for example, the network card 303 may be an intelligent network interface card (smart NIC). In some possible embodiments, the network card 303 may support a Remote Direct Memory Access (RDMA) mode, for example, the network card 303 receives a message from a network in the RDMA mode and sends the message to other devices in the network in the RDMA mode. The network card 303 may store the received packet in the memory 301 in an RDMA manner.

The bus 304 may include an Extended Industry Standard Architecture (EISA) bus, and/or a peripheral component interconnect (PCIe) bus, among others. The bus 304 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in fig. 3a and 3b, but this does not indicate only one bus or one type of bus.

In the embodiment of the present application, as shown in fig. 3a, the memory 301, the processor 301 and the network card 303 may all be integrated in a system of chip (SoC) of the server. Alternatively, as shown in fig. 3b, the memory 301 and the processor 301 may be integrated into a system on chip (SoC) of the server, and the network card 303 is an external network card and is connected to the SoC through an external bus.

Fig. 4 is a schematic flowchart of a data operation method provided in an embodiment of the present application, where the method may be executed by the network card in the server provided above, and the method includes the following steps.

S401: the network card receives a first message, the first message comprises operation indication information and first data, and data operation needing MPI operation on the first data is determined according to the operation indication information.

Wherein the server may be any server in a communication domain comprising a plurality of servers that are collectively operable to perform MPI operations. The servers may send messages to each other through a network (entret), where the messages may include data used for data operations in the MPI operation. For example, the plurality of servers includes a first server and a second server, where the server may be the first server, the server may receive a first message sent by the second server, and the server may also send a second message to the second server. The first message and the second message may be messages of the same format, and only data included in the messages are different, and the first message is described as an example hereinafter.

In addition, the first packet may be a packet based on a protocol over Converged Ethernet (RoCE). The first packet may include a packet header and a payload, the operation indication information may be carried in the packet header of the first packet, and the first data may be carried in the payload of the first packet. Illustratively, on the basis of an existing RoCE protocol packet, an extension packet header is added, and the operation indication information is carried in the extension packet header, for example, a 4-bit (bits) extension header reduce _ eth is added in a standard RDMA transmission header field, where 1bit may be used to indicate a specific data type reduce _ type (for example, the data type may include int8, int16, int32, uint8, uint16, uint32, FP16, or FP 32), and the like, another 1bit is used to indicate a specific operation type reduce _ code (for example, the operation type includes max, min, sum, and the like), and the remaining 2bits may be reserved. Accordingly, when the operation indication information is carried in the extended header, the RDMA-based software-hardware interface may also add a corresponding Write Read (WR) type for reading or writing of the extended header.

The MPI operation corresponding to the first data may be any MPI operation including data operation, for example, the MPI operation may be an MPI _ reduce operation or an MPI _ reduce operation.

Optionally, the operation indication information may include an operation type and a data type of the data operation in the MPI operation, for example, the operation type may be addition, subtraction, multiplication, or the like, and the data type may be a half-precision floating point number, a single-precision floating point number, a double-precision floating point number, an integer, or the like. The description of the specific operation type and data type, and the related description of the MPI operation described above can be referred to the description in the related art, and the embodiments of the present application are not described herein.

Specifically, when the server is performing the MPI operation, a processor (e.g., a CPU) in the server may send a data operation task to the network card. The network card of the subsequent server may receive the first message sent by other servers in the network, and the network card may analyze the first message to obtain the operation indication information and the first data included in the first message. When the server analyzes the operation instruction information, the network card can determine that data operation in MPI operation needs to be performed on the first data according to the operation instruction information. For example, the operation indication information includes an operation type and a data type of the MPI _ reduce operation, and the network card may determine, according to the operation type and the data type, a data operation that requires the MPI _ reduce operation on the first data.

S402: the network card retrieves the second data from the memory.

The memory may include a memory, which may be a dynamic random access memory DRAM. The second data may be local data stored in the DRAM for data operations of the MPI operation. The data type of the second data may be the same as the data type of the first data, for example, the data type of the first data and the data type of the second data are both the data type indicated by the operation indication information in the first message.

In addition, the storage address corresponding to the second data may be carried in the first message. Specifically, after the network card receives the first message and parses the first message, the network card may obtain a storage address of the second data from the first message, so that the network card may obtain the second data from the memory of the server based on the storage address.

S403: the network card completes data operation of the first data and the second data to obtain a first operation result.

When the network card acquires the first data and the second data, the network card may perform data operation on the first data and the second data based on the operation indication information to obtain a first operation result. For example, if the operation type indicated by the operation indication information is addition and the data type is floating point number, the network card may add the first data and the second data based on an addition rule corresponding to the floating point number to obtain a first operation result; or, if the operation type indicated by the operation indication information is multiplication and the data type is floating point number, the network card may multiply the first data and the second data based on a multiplication rule corresponding to the floating point number to obtain a first operation result.

Further, as shown in fig. 5, after S403, the method further includes: and S404.

S404: the network card stores the first operation result in the memory.

Specifically, when the network card obtains the first operation result, the network card may store the first operation result in a memory of the server, for example, the network card stores the first operation result in a DRAM included in the memory. Optionally, the storage address of the first operation result may be the same as the storage address of the second data, that is, the network card may store the first operation result in the storage location where the second data is located according to the storage address of the second data to overwrite the second data.

Optionally, as shown in fig. 5, after S404, the method further includes: and S405.

S405: the network card sends notification information to the processor, wherein the notification information is used for indicating that the data operation is completed.

Specifically, after the network card stores the first operation result in the memory, the network card may send notification information to the processor, where the notification information is used to indicate that the data operation is completed. When the processor receives the communication information, the processor may determine that the data operation is complete, thereby synchronizing the state information associated with the MPI operation to ensure that the actual state of the MPI operation is consistent with the recorded state. Optionally, the processor may further send a next task to the network card, so that the network card continues to execute the corresponding task.

Further, the processor may divide the data operation in the MPI operation into a plurality of data operation tasks, and sequentially send the plurality of data operation tasks to the network card according to the sequence of the plurality of data operation tasks, that is, after the previous data operation task is completed, the next data operation task is sent to the network card until the plurality of data operation tasks are completed. For each of the plurality of data calculation tasks, the network card may be executed in accordance with the method provided above.

For example, for the MPI operation shown in fig. 2, the data operation in the MPI operation may include three data operation tasks, and taking the server as P0 as an example, the network card may complete the MPI operation by sequentially performing three data operations. Specifically, first, the processor sends a task of data operation a + B to the network card, and the network card executes the operation a + B according to the above S401 to S405 and reports the operation; secondly, the processor sends a task of data operation A + B + C + D to the network card, and the network card executes the A + B + C + D operation and reports the operation according to the steps S401 to S405; and finally, the processor sends a task of data operation A + B + C + D + E + F + G + H to the network card, and the network card executes the operation A + B + C + D + E + F + G + H according to the steps S401 to S405 and reports the operation A + B + C + D + E + F + G + H.

For example, as shown in fig. 6, a processor, a memory, a network card, and a bus in the server may be integrated in the SoC of the server, and the memory is a DRAM and the processor is a CPU for example. In the data operation method provided in the embodiment of the present application, the network card stores the first operation result in the memory after receiving the first message, and the network card only needs to perform one read operation and one write operation, that is, reads the second data from the memory and writes the first operation result in the memory, so as to complete the data operation. As shown in fig. 6, step 1 shows that the network card reads data in the DDR of the server end into the network card, and waits for the data in the network to participate in the operation; step 2, the network card receives data in a network, identifies the need for reduce calculation through related information in a message header, completes calculation while processing a work queue element (RQ _ WQE) of a receiving queue, and writes back a calculation result to a memory of a server end; step 3 represents that the software reads a Completion Queue Element (CQE) by means of interrupt or polling. Specifically, the method comprises the following steps: the network card receives and analyzes a first message in a network to obtain operation indication information and first data, wherein the operation indication information is carried in a message header of the first message; the network card determines the number operation of MPI operation to the first data according to the operation indication information in the message header; the network card reads the second data A1 from a storage (such as an internal memory) of the server into the network card, so that the calculation of the first data and the second data is completed while the RQ _ WQE is processed, and the calculation result is written back to the storage of the service; then, the processor reads the CQE in an interrupt or polling manner, that is, the processor receives the notification information sent by the network card to complete information synchronization of the MPI operation. In fig. 6, the second data is represented by A1, and the first operation result is represented by R1. It should be understood that after an RDMA packet from a network is received, a local RQ _ WQE needs to be consumed, where the RQ _ WQE indicates a local DDR space, after a first packet is received, it is determined that an MPI operation needs to be performed on first data based on operation indication information carried in an extended packet header, so after the first data is obtained, a path following operation is performed on the first data first, and after a calculation result is obtained, the calculation result is written into a memory space indicated by the RQ _ WQE, and a subsequent CPU may obtain a corresponding calculation result from a corresponding memory space.

In the above execution process, from the perspective of the CPU, the CPU does not sense the entire computation process, and only processes the reported interrupt after the computation is completed, thereby greatly reducing Operating System (OS) noise (noise) of the CPU and improving the execution efficiency of the CPU. The whole process only needs one DDR write and one DDR read, and the whole delay comprises DDR read delay, RDMA network card processing data operation (calculation) delay and 1 DDR write operation.

In the embodiment of the application, when the network card receives a first message to acquire operation indication information and first data in the first message, the network card may directly acquire second data from the memory, and complete data operation of the first data and the second data according to the operation indication information to obtain a first operation result. Compared with the prior art, the network card does not need to write the first data into the memory, but directly acquires the second data when acquiring the first data, namely, performs channel following calculation on the first data and the second data, thereby reducing the read-write times of the memory, reducing the time delay of MPI operation and improving the MPI execution efficiency. In addition, when the network card, the processor and the memory in the server are all integrated in the SoC of the server, the end-to-end transmission delay can be reduced, and the execution efficiency of the MPI operation is further improved.

The data operation method in the MPI operation provided in the embodiment of the present application is mainly described from the perspective of the server. It is understood that the server includes corresponding hardware structures and/or software modules for performing the respective functions in order to implement the above-described functions. Those of skill in the art will readily appreciate that the various illustrative network elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as hardware or combinations of hardware and computer software. Whether a function is performed as hardware or computer software drives hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiment of the present application, the data operation device in the MPI operation may be divided into functional modules according to the above method, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. It should be noted that, in the embodiment of the present application, the division of the module is schematic, and is only one logic function division, and there may be another division manner in actual implementation.

Fig. 7 shows a schematic diagram of a possible structure of the data computing apparatus according to the foregoing embodiment, where each functional module is divided into corresponding functions, the apparatus is a network card or a chip built in the network card, the network card is coupled with the memory through a bus, and the apparatus includes: a receiving unit 501, a processing unit 502 and an acquisition unit 503. The receiving unit 501 is configured to support the apparatus to receive a first packet from a network; the processing unit 502 is configured to support the apparatus to analyze the first packet, to obtain operation indication information and first data included in the first packet, where the operation indication information is used to indicate data operation that an message passing interface MPI operation needs to be performed on the first data; the obtaining unit 503 is configured to support the apparatus to obtain second data from a memory, where the second data is local data of the data operation in the MPI operation; the processing unit 502 is further configured to support the apparatus to complete data operation of the first data and the second data in the MPI operation, so as to obtain a first operation result. Further, the apparatus further comprises: a writing unit 504 and a sending unit 505. The write unit 504 is used to support the apparatus to write the first operation result into the memory; the sending unit 505 is used to support the apparatus to send notification information to the processor, where the notification information is used to indicate that the data operation is completed.

It should be noted that all relevant contents of each step related to the above method embodiment may be referred to the functional description of the corresponding functional module, and are not described herein again.

On the basis of hardware implementation, the processing unit 502 and the writing unit 504 in this application may be part of functions of a processor of the apparatus, and the receiving unit 501, the obtaining unit 503 and the sending unit 505 may be a set of functions of a transceiver of the apparatus, where the transceiver may generally include a transmitter and a receiver, and a specific transceiver may also be referred to as a communication interface.

Fig. 8 is a schematic diagram showing another possible structure of the data operation device according to the above embodiment, a network card or a chip built in the network card, where the network card is coupled with the memory through a bus, and the device includes: a processor 602 and a communications interface 603. The processor 602 is used to control and manage the actions of the apparatus, for example, the processor 602, through the communication interface 603, may be used to support the apparatus to perform the processes of S401 to S405 in the above embodiments, and/or other processes for the techniques described herein. In addition, the apparatus may further include a memory 601 and a bus 604, the processor 602, the communication interface 603, and the memory 601 being connected to each other through the bus 604; the communication interface 603 is used to support the apparatus for communication; the memory 601 is used to store the program codes and data of the apparatus.

The processor 602 may be, among other things, a central processing unit, a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, transistor logic, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor may also be a combination of computing functions, e.g., comprising one or more microprocessors, a digital signal processor and a microprocessor, or the like. The bus 604 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 8, but that does not indicate only one bus or one type of bus.

In another embodiment of the present application, a readable storage medium is further provided, where the readable storage medium stores computer execution instructions, and when one device (which may be a single chip, a chip, or the like) executes the steps of the network card in the method provided in the foregoing method embodiment. The aforementioned readable storage medium may include: u disk, removable hard disk, read only memory, random access memory, magnetic or optical disk, etc. for storing program codes.

In another embodiment of the present application, there is also provided a computer program product comprising computer executable instructions stored in a computer readable storage medium; when the at least one processor of a device can read the computer-readable storage medium to execute the computer-executable instructions, the at least one processor executes the computer-executable instructions to cause the device to perform the steps of the network card in the method provided by the above-mentioned method embodiment.

Finally, it should be noted that: the above description is only an embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions within the technical scope of the present disclosure should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A data operation method applied to a network card, wherein the network card is coupled to a memory through a bus, the method comprising:

receiving a first message, wherein the first message comprises operation indication information and first data;

determining data operation needing Message Passing Interface (MPI) operation on the first data according to the operation indication information;

acquiring second data from the memory, wherein the second data is local data of the data operation in the MPI operation;

and completing the data operation of the first data and the second data in the MPI operation to obtain a first operation result.
The method of claim 1, wherein the network card, the memory, and the bus are integrated in a system on a chip (SoC).
The method according to claim 1 or 2, wherein the operation instruction information includes: an operation type and a data type.
The method according to any of claims 1-3, wherein the operation indication information is carried in a header of the first packet.
The method according to any one of claims 1-4, wherein the MPI operation comprises: an MPI _ reduce operation, or an MPI _ allreduce operation.
The method according to any of claims 1-5, wherein the first message further includes a storage address of the second data, and the retrieving the second data from the storage comprises:

and acquiring the second data from the memory according to the storage address of the second data.
The method of claim 6, further comprising:

and storing the first operation result in a storage position where the second data is located according to the storage address of the second data so as to cover the second data.
The method of any of claims 1-7, wherein the network card is further coupled to a processor via the bus, the method further comprising:

and sending notification information to the processor, wherein the notification information is used for indicating that the data operation is completed.
A data arithmetic device is characterized in that the device is a network card or a chip built in the network card, the network card is coupled with a memory through a bus, and the device comprises:

the device comprises a receiving unit, a sending unit and a receiving unit, wherein the receiving unit is used for receiving a first message, and the first message comprises operation indication information and first data;

the processing unit is used for determining data operation needing Message Passing Interface (MPI) operation on the first data according to the operation indication information;

the acquisition unit is used for acquiring second data from the memory, wherein the second data is local data of the data operation in the MPI operation;

the processing unit is further configured to complete data operation of the first data and the second data in the MPI operation, and obtain a first operation result.
The apparatus of claim 9, wherein the network card, the memory, and the bus are integrated in a system on a chip (SoC).
The apparatus according to claim 9 or 10, wherein the operation instruction information includes: an operation type and a data type.
The apparatus according to any of claims 9-11, wherein the operation indication information is carried in a header of the first packet.
The apparatus according to any one of claims 9-12, wherein the MPI operation comprises: an MPI _ reduce operation, or an MPI _ allreduce operation.
The apparatus according to any one of claims 9-13, wherein the first packet further includes a storage address of the second data, and the obtaining unit is further configured to:

and acquiring the second data from the memory according to the storage address of the second data.
The apparatus of claim 14, further comprising:

and the writing unit is used for storing the first operation result in a storage position where the second data is located according to the storage address of the second data so as to cover the second data.
The apparatus of any one of claims 9-15, wherein the network card is further coupled to a processor via the bus, the apparatus further comprising:

and the sending unit is used for sending notification information to the processor, wherein the notification information is used for indicating that the data operation is completed.
A data operation device, wherein the device is a network card or a chip built in the network card, the network card is coupled with a memory through a bus, the memory stores codes and data, and the network card runs the codes in the memory to make the device execute the data operation method according to any one of claims 1 to 8.
A computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the data computation method of any one of claims 1-8.
A computer program product, characterized in that it, when run on an apparatus, causes the apparatus to carry out the data computation method of any one of claims 1 to 8.