CN117971766A - Data transmission method and system for single network card multiple GPUs based on GPUDrirect RDMA technology - Google Patents

Data transmission method and system for single network card multiple GPUs based on GPUDrirect RDMA technology Download PDF

Info

Publication number
CN117971766A
CN117971766A CN202410148139.4A CN202410148139A CN117971766A CN 117971766 A CN117971766 A CN 117971766A CN 202410148139 A CN202410148139 A CN 202410148139A CN 117971766 A CN117971766 A CN 117971766A
Authority
CN
China
Prior art keywords
gpu
data
queue
hash
core
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410148139.4A
Other languages
Chinese (zh)
Inventor
杨露
锁强
岳晨阳
郭燕
奚智雯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Advanced Technology Research Institute
Original Assignee
Wuxi Advanced Technology Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Advanced Technology Research Institute filed Critical Wuxi Advanced Technology Research Institute
Priority to CN202410148139.4A priority Critical patent/CN117971766A/en
Publication of CN117971766A publication Critical patent/CN117971766A/en
Pending legal-status Critical Current

Links

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a data transmission method and a system for multiple GPUs of a single network card based on GPUDrirect RDMA technology, which utilize the characteristic that HCA supports multiple queues and the advantage of multi-core CPU (Central processing Unit) has parallelism, and respectively manage 4 GPU video memories by 4 logic cores in parallel to realize 4 groups of logic core-GPU communication queues. The invention decouples the 4-way GPU API call and the CPU I/O transmission operation, allows the CPU to asynchronously process the GPU I/O request, and enables 4 GPU video memories and HCAs to carry out parallel RDMA data transmission without switching RDMA channels back and forth between the 4 GPUs and the HCAs. Thus, the GPU I/O call returns faster without waiting for the GPU I/O request to propagate through the high latency PCIe bus, overlapping data transfers and GPU computations. Therefore, the access of 4 GPUs to the system memory is minimized, so that the system meets the requirements of low transmission delay, high node resource utilization rate and high data throughput, and high-performance batch transmission can be realized.

Description

Data transmission method and system for single network card multiple GPUs based on GPUDrirect RDMA technology
Technical Field
The invention relates to a data transmission method and system for multiple GPUs with single network card based on GPUDrirect RDMA technology, and belongs to the technical field of data transmission.
Background
With the high-speed development of high-performance computing, deep learning, big data and other applications, the network rate gradually becomes excessive to 100Gpbs, and the requirements on the network transmission rate and delay are continuously improved. In the conventional TCP/IP technology, during the processing of a data packet, the data packet is processed by an operating system protocol stack and is moved for multiple times, so that a large-scale data transmission task occupies most of the processor resources and the memory bus bandwidth. The traditional TCP/IP technology cannot meet the requirements of low communication delay and high resource utilization rate in the application of high-bandwidth network technology, has serious network delay effect and cannot meet the real-time requirement.
The GPU of NVIDIA provides GUPDriect technology, so that data in the GPU video memory can be directly transmitted to the remote GPU video memory through remote direct internal access RDMA network card under the distributed environment, the data copying operation from the video memory to the system memory is avoided, the cost of data transmission is reduced, the characteristic of RDMA can enable network communication under the distributed environment to have very low delay, and meanwhile, the cost of a remote CPU is reduced. .
Ideally, the addition of three additional GPUs to the system should be able to double the throughput of the server by a factor of 3 without changing the hardware architecture. However, the PCIe topology on the server allows only one of the four GPUs to use the same Host channel adapter (Host CHANNEL ADAPTER) HCA, implementing GUPDriect RDMA technology, and the other three GPUs must exit RMDA transport mechanism using conventional TCP/IP transport techniques. Under this condition, if all the 4 GPUs want to use GUPDriect technology, the RDMA path must be switched back and forth, which can cause network data transmission to fail to fully utilize the resource nodes of the system, system performance will be reduced, and data transmission throughput cannot meet the requirement.
Disclosure of Invention
The invention provides a data transmission method and system for multiple GPUs of a single network card based on GPUDrirect RDMA technology, which solve the problems disclosed in the background technology.
In order to solve the technical problems, the invention adopts the following technical scheme:
a data transmission method of single network card multiple GPUs based on GPUDrirect RDMA technology comprises the following steps:
4 GPU video memories are respectively managed in parallel by 4 logic cores, so that 4 groups of logic core-GPU communication queues are realized;
Distributing the data streams of the same link to the same logic core-GPU communication queue through an RSS (really simple syndication) distribution mechanism;
Distributing the data stream to 4 groups of logic core-GPU communication queues through a software distribution mechanism;
the 4 sets of logical core-GPU communication queues receive the data streams.
Further, the method for distributing the data traffic of the same link to the same logic core-GPU communication queue through the RSS splitting mechanism comprises the following steps: after the network adapter receives the data packet, calculating the Hash value of the symmetric RSS Hash algorithm Hash1 according to the five-tuple information of the data packet, establishing the mapping relation between the Hash value of the symmetric RSS Hash algorithm Hash1 and 4 groups of logic core-GPU communication queues to obtain a Hash mapping table 1, traversing Ha Xiying the data packet received subsequently, judging whether the current Hash value has a corresponding communication queue or not, and if the current Hash value has the recorded mapping relation, directly distributing the current Hash value to the corresponding communication queue; when the calculated Hash value is not in the Hash mapping table 1, a new link data stream is indicated, and if the load of the logic core in the communication queue corresponding to the Hash value of the Hash1 of the symmetric RSS Hash algorithm is smaller than or equal to the average load of 4 logic cores, the data stream is distributed to the corresponding communication queue.
Further, the method for equally distributing the data traffic to the 4 groups of logic core-GPU communication queues through the software distribution mechanism comprises the following steps: calculating a Hash value of a symmetric Hash algorithm Hash2, and establishing a mapping relation between the Hash value of the symmetric Hash algorithm Hash2 and 4 groups of logic core-GPU communication queues to obtain a Hash mapping table 2; traversing Ha Xiying a table 2 to judge whether a corresponding communication queue exists in the current hash value for the data packet received later, and directly distributing the current hash value to the corresponding communication queue if the corresponding communication queue exists in the recorded mapping relation; when the calculated Hash value is not in the Hash mapping table 2, a new link data stream is indicated, and if the logic core load in the communication queue corresponding to the Hash value of the symmetric RSS Hash algorithm Hash1 is larger than the current average logic core load, the data stream is distributed to the communication queue corresponding to the symmetric Hash algorithm Hash 2.
Further, the method also comprises the step of updating the mapping relation between the Hash value of the symmetric RSS Hash algorithm Hash1 and the communication queue to a Hash mapping table 1; and updating the mapping relation between the Hash value of the symmetric Hash algorithm Hash2 and the communication queue to the Hash mapping table 2.
Further, the method further comprises the following steps: when GPUNet is initialized, a standard CPU interface is used for initializing a GPU network buffer area, and 4 independent memories are allocated for 4 GPUs; establishing a mapping table from a memory virtual address to a physical address, and injecting the GPU memory into RDMA hardware of the network card; GPUNet uses RDMA registered memory as a memory pool, and allocates a receive buffer and a transmit buffer for each communication queue.
Further, the process of each set of logical core-GPU communication queues receiving a data stream is:
the HCA writes back the queue CQ through the state, informing the logical core that new data is to be received;
the logic core transmits descriptor information of a circular buffer area between the logic core and the GPU to the HCA through a command sending queue SQ;
The data packet is directly copied from the remote host memory to the GPU video memory through RDMA;
the HCA notifies the logic core of the completion of data reception through the state write-back queue CQ;
The logic core updates the circulation buffer area on behalf of the remote host and informs the GPU of data reception;
And the GPU calls grecv () function to read the data and updates the circular buffer area to indicate that the data is received. The update triggers the logic core to inform the HCA of finishing data receiving through the command sending queue SQ;
after the HCA receives the command, the remote host is updated, indicating that a data reception is complete.
Accordingly, a data transmission system of a single network card multiple GPUs based on GPUDrirect RDMA technology includes: the multi-core load balancing processing module and the data transmission module;
the data transmission module comprises 4 logic cores, 4 GPU video memories are respectively managed in parallel by the 4 logic cores, and 4 groups of logic core-GPU communication queues are realized;
the multi-core load balancing processing module is used for distributing the data streams of the same link to the same logic core-GPU communication queue through an RSS (really simple syndication) splitting mechanism; and distributing the data stream to 4 groups of logic core-GPU communication queues through a software distribution mechanism.
Further, the data transmission module further comprises a memory management module, and the memory management module is used for distributing a receiving buffer area and a sending buffer area for each communication queue.
Further, the data transmission module further comprises a flow control module, wherein the flow control module is used for realizing flow control of the network buffer area relevant to each connection flow through a circulation buffer area between the logic core and the GPU and a queue manager between the logic core and the HCA.
Further, the queues between the logical core and the HCA include a state write back queue CQ and a command send queue SQ.
The invention has the beneficial effects that: the characteristic that HCA supports multiple queues and the advantage of multi-core CPU has parallelism are utilized, 4 GPU video memories are respectively managed by 4 logic cores in parallel, and 4 groups of logic core-GPU communication queues are realized. The invention decouples the 4-way GPU API call and the CPU I/O transmission operation, allows the CPU to asynchronously process the GPU I/O request, and enables 4 GPU video memories and HCAs to carry out parallel RDMA data transmission without switching RDMA channels back and forth between the 4 GPUs and the HCAs. Thus, the GPU I/O call returns faster without waiting for the GPU I/O request to propagate through the high latency PCIe bus, overlapping data transfers and GPU computations. Therefore, the access of 4 GPUs to the system memory is minimized, so that the system meets the requirements of low transmission delay, high node resource utilization rate and high data throughput, and high-performance batch transmission can be realized.
Drawings
Fig. 1 is a schematic structural diagram of a data transmission system and a method of a single network card multiple GPUs based on GPUDrirect technology provided by the invention.
Fig. 2 is a specific structural schematic diagram of a dual Hash shunting mechanism of a load balancing module provided by the present invention.
Fig. 3 is a schematic diagram of a specific application of a data transmission system and a method of a single network card multiple GPUs based on GPUDrirect technology according to the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present invention, and are not intended to limit the scope of the present invention.
As shown in fig. 1, the data transmission system of the single network card multiple GPUs based on GPUDriect technology comprises a multi-core load balancing processing module and a data transmission module; the multi-core load balancing processing module is a double-Hash data distribution mechanism combining a symmetrical RSS (Receive Side Scaling) distribution mechanism and a software distribution mechanism. 4 logic cores are utilized to manage 4 GPU video memories in parallel, the IP address, the protocol and the port five-tuple information are obtained by analyzing the data packet, the Hash mapping table 1 and the Hash mapping table 2 are obtained by calculating through double Hash functions, and the communication queue is determined. The method solves the problem of Hash collision of single Hash, fully exerts the parallel characteristic of multi-core CPU in the processor system, and realizes the efficient distribution of data on 4 groups of logic core-GPU communication queues.
The symmetric RSS splitting mechanism changes Ha Xibi a key value K on the basis of RSS to obtain a hash mapping table 1, and ensures that the data flow of the same link is distributed to the same communication queue as the hash value of the same connected bidirectional data packet after the hash algorithm.
The software distribution mechanism can distribute the data flow into 4 groups of logic core-GPU communication queues evenly, so that the data packets can be cached in 4 GPU network buffers evenly, and the problem of unbalanced load distribution is effectively solved. And adopting a symmetrical hash algorithm, taking each bit of quintuple information of the data packet into operation, taking the square value of an operand, and carrying out exclusive or operation to obtain a hash Ha Yingshe table 2, so as to avoid hash collision.
The data transmission module consists of 4 parallel independent GPU network buffer modules. The data transmission module consists of a memory management module and a flow control module. A GPUNet network layer protocol is employed to provide socket abstraction and high-level network APIs for GPU programs. On hardware, GPUNet's RDMA network protocol stack is completely offloaded to the network card. On the software, the method directly interacts with the RDMA network card, avoids the intervention of an operating system, and achieves the effect of zero copy. The invention decouples GPU API call and CPU I/O transmission operation, allows the logic core to asynchronously process GPU I/O request, and the logic core on 4 groups of communication queues can realize data asynchronous processing with the GPU.
The memory management module is used for caching the data packets in a network buffer of the GPU application. The invention allocates 4 large independent memories for 4 groups of logic core-GPU communication queues for caching the network when GPUNet is initialized. And respectively initializing each GPU network buffer area in the CPU memory by using a standard GPUNet CPU interface, and uniformly managing the buffer area memory by using a GPU socket. To initiate RDMA hardware transfers, GPU memory is injected into the RDMA hardware of the network card.
The flow control module is used for managing the flow control of the network buffer area related to each connection flow, and is realized through a circulation buffer area between the logic core and the GPU and a queue manager between the logic core and the HCA, and the CPU controls the network card through a standard host driver, so that the network card is available for all the logic cores. The ring buffers are managed in a ring fashion RingBuffer, shared by the logical cores and the GPUs. By cycling the buffers, the logic core helps the HCA and remote HCA update the counters in the GPU-accessible memory. The queue controller implements message passing between the logical core and the HCA, including command send queues SQ from the logical core to the HCA direction and status write back of the HCA to the logical core direction to the queue CQ. Command send queue SQ and status write back queue CQ are messaging with descriptors, two separate ring buffers in memory.
The invention adopts GPUNet network layer protocol, follows layered design, the bottom layer is channel abstraction depending on bottom layer transmission, the middle socket layer realizes a reliable stream abstraction based on sequential connection on each channel, and the top layer realizes standard socket API for GPU. The invention adopts a socket Vsocket between the user layer and the Verbs interface layer, utilizes the environment variable LD_PRELOAD to realize the operation of executing the interception API, and converts TCP/UDP traffic into RDMA high-speed traffic under the condition of not modifying the application.
The GPU cannot access the doorbell register through a memory mapped I/O. Thus, the GPU cannot control the HCA's "doorbell" register to trigger a send operation. And the HCA driver does not allow placement of completion queue structures in GPU memory, but the HCA needs to use the completion queue to complete the notification.
The invention uses 4 logic cores to manage 4 GPUs in parallel, and completes the sending and receiving operations. At the initialization of the logical core, 4 independent command ring buffers are initialized, all managed in a ring fashion RingBuffer. This command ring buffer is a piece of system host shared by the logic core and the GPU, the logic core being part of the completion notification processor, and the GPU making a circular buffer update for each gsend/grecv call. To ensure consistent concurrent updates, the updates are treated as two independent instances of producer-consumer coordination: firstly, between the GPU and the HCA, updating a pointer for generating received data in a GPU network buffer; second, between the GPU and the remote HCA, a pointer is updated that consumes the data sent from the GPU network buffer. In the above command ring buffer, the logic core helps the HCA and remote HCA update the counters in the GPU-accessible memory.
The queue controller adopted by the invention realizes the bidirectional message transmission between the logic core and the HCA. Command queues SQ from logical core to HCA direction and status queues CQ from HCA to logical core direction. The command queue SQ is used to issue commands for the host, including instructions issued to the HCA and instructions for HCA state management. The state queue CQ is used for feeding back HCA state information to the logic core, and comprises the execution state and the abnormal state of the HCA on the instruction.
The command send queue SQ and status write back queue CQ manage descriptor information with a ring buffer, in memory as two separate pieces of space. Each logical core works in parallel, so each logical core has its own command send queue SQ and status write-back queue CQ. To implement two ring descriptors of SQ and CQ, a head pointer and a tail pointer need to be designed to indicate the head and tail of the queue, the distance between the head and tail being the number of descriptors, and the head and tail varying as the HCA uses descriptors continuously. The register related to the descriptor is designed as a BASE address register BASE, and is used for indicating the position of the descriptor in the memory; a descriptor HEAD pointer register HEAD for indicating the start position of the current descriptor; a descriptor TAIL pointer register TAIL for indicating the end position of the current descriptor; the description number register SIZE is used for indicating the number of descriptors which need to be read by the network card at one time; a descriptor enable register CTRL for enabling the descriptor queue.
The invention discloses a data transmission method of a single network card multi-GPU based on GPUDriect technology, which comprises the following steps:
Firstly, before the network data packet is cached by the GPU, the network data packet is uniformly distributed to 4-path communication queues by the multi-core load balancing processing module. The RSS function is turned on the multi-core CPU by setting the rss_hf and mq_mode fields in the rte _eth_conf fabric. Then, the function of symmetrical RSS is added on the basis.
A bi-directional packet, the protocol type location in their five-tuple is exactly the same. The implementation scheme of the symmetric RSS Hash algorithm Hash1 is to ensure that Hash values obtained by calculation of a source IP address, a destination IP address, a source port and a destination port are identical, so that data packets can be distributed to the same logic core. In order to make Hash (srcIP, srcPort, dstIP, dstPort, K) =hash (srcIP, srcPort, dstIP, dstPort, K), the ranges of K corresponding to srcIP and dstIP, srcPort and dstPort are required to be equal, respectively, thus yielding the following formula:
The key K value needed in the case is calculated as follows:
K[1:15]=K[17:31]=K[33:47]=K[49:63]=K[65:79]=K[81:95]=K[97:111]=K[113:127]
K[16]=K[48]=K[80]=K[96]=K[112]
K[32]=K[64]
calculating a symmetric RSS Hash algorithm Hash1 according to the obtained secret key K value to obtain a Hash map
Table 1.
The software distribution mechanism adopts a symmetrical Hash algorithm Hash2 to realize balanced distribution of data streams. The symmetric Hash algorithm Hash2 divides a source IP address S_IP, a destination IP address D_IP, a source Port S_Port and a destination Port D_Port into a plurality of byte segments by 8 bits, then uses a field to perform square operation, and finally uses the lower 8 bits S_IP 8 2 of the result to perform exclusive OR operation, as follows:
according to the symmetric Hash algorithm Hash2, the Hitachi Ha Yingshe table 2 is obtained.
The design flow of the multi-core load balancing processing module is shown in fig. 2, after the network adapter receives the data packet, hash values of a symmetric RSS Hash algorithm Hash1 and a symmetric Hash algorithm Hash2 are calculated according to five-tuple information of the data packet, and a mapping relation between the Hash values and a 4-path communication queue is established to obtain a Hash mapping table 1 and a Hash mapping table 2. And traversing Ha Xiying the table 1 and the hash mapping table 2 to judge whether the current hash value exists or not for the data packet received later, wherein the mapping relation which indicates that the current hash value exists is recorded, and the mapping relation is directly distributed to the corresponding communication queue. When the calculated hash value is not in the mapping table, the new link data stream is indicated, and further judgment is carried out. If the load of the logic core corresponding to the symmetric RSS Hash algorithm Hash1 is smaller than or equal to the average load of 4 logic cores, the data stream is distributed to the corresponding communication queue, and the mapping relation between the Hash value of the symmetric RSS Hash algorithm Hash1 and the queue is updated to Ha Xiying table 1. If the logic core load is larger than the current average logic core load, the data stream is distributed to a communication queue corresponding to the symmetrical Hash algorithm Hash2, and the mapping relation between the Hash value of the symmetrical Hash algorithm Hash2 and the queue is updated to the Hash table 2.
The data transfer module uses GPUNet network layer protocols to provide socket abstraction and high level network API-VSocket for GPU programs. The data transmission module is mainly realized by a memory management module and a flow control module.
The implementation mode of the memory management module is that when GPUNet is initialized, a standard CPU interface is used for initializing a GPU network buffer area, 4 large independent memories are allocated for 4 GPUs, a mapping table from a memory virtual address to a physical address is established, and the GPU memory is injected into DMA hardware of a network card. GPUNet uses RDMA registered memory as a memory pool, and allocates a receive buffer and a transmit buffer for each communication queue.
An embodiment of a flow control module is shown in fig. 3. Fig. 3 shows a flow control process for handling the reception of a data packet. As described above, the data packet is distributed to the communication queues X (0-3) after passing through the double Hash splitting mechanism, and then the specific receiving process of the data packet is as follows:
(1) The HCA writes back the state to the queue CQ, informing the logical core x that new data is to be received;
(2) The logic core x transmits descriptor information of a circular buffer RingBuffer between the logic core and the GPU to the HCA through a command sending queue SQ;
(3) Copying the data packet from the remote host memory to the GPUx video memory through direct RDMA;
(4) The HCA notifies the logic core x of the completion of data reception through the state write-back queue CQ;
(5) Logical core x represents remote host update loop buffer RingBuffer, informing the GPU of the receipt of data.
(6) The GPU calls grecv () function to read the data and updates the circular buffer RingBuffer, indicating that the data is received.
(7) This update triggers the logical core x to notify the HCA that data reception is completed through the command send queue SQ.
(8) After the HCA receives the command, the remote host is updated, indicating that a data reception is complete.
The invention decouples GPU API calls and CPU I/O transfer operations, allowing the CPU to asynchronously process GPU I/O requests. Therefore, the GPU I/O call returns faster, and the data transmission and the GPU calculation overlap without waiting for the GPU I/O request to be transmitted through a high-delay PCIe bus, so that high-performance batch transmission can be realized.
The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing is illustrative of the present invention and is not to be construed as limiting thereof, but rather as providing for the use of additional embodiments and advantages of all such modifications, equivalents, improvements and similar to the present invention are intended to be included within the scope of the present invention as defined by the appended claims.

Claims (10)

1. A data transmission method of a single network card multiple GPUs based on GPUDrirect RDMA technology is characterized in that:
4 GPU video memories are respectively managed in parallel by 4 logic cores, so that 4 groups of logic core-GPU communication queues are realized;
Distributing the data streams of the same link to the same logic core-GPU communication queue through an RSS (really simple syndication) distribution mechanism;
Distributing the data stream to 4 groups of logic core-GPU communication queues through a software distribution mechanism;
the 4 sets of logical core-GPU communication queues receive the data streams.
2. The method for transmitting data of multiple GPUs with single network card based on GPUDrirect RDMA technology according to claim 1, wherein the method for distributing data traffic of the same link to the same logic core-GPU communication queue through RSS splitting mechanism is: after the network adapter receives the data packet, calculating the Hash value of the symmetric RSS Hash algorithm Hash1 according to the five-tuple information of the data packet, establishing the mapping relation between the Hash value of the symmetric RSS Hash algorithm Hash1 and 4 groups of logic core-GPU communication queues to obtain a Hash mapping table 1, traversing Ha Xiying the data packet received subsequently, judging whether the current Hash value has a corresponding communication queue or not, and if the current Hash value has the recorded mapping relation, directly distributing the current Hash value to the corresponding communication queue; when the calculated Hash value is not in the Hash mapping table 1, a new link data stream is indicated, and if the load of the logic core in the communication queue corresponding to the Hash value of the Hash1 of the symmetric RSS Hash algorithm is smaller than or equal to the average load of 4 logic cores, the data stream is distributed to the corresponding communication queue.
3. The method for transmitting data of multiple GPUs of single network card based on GPUDrirect RDMA technology according to claim 2, wherein the method for equally distributing data traffic to 4 groups of logic core-GPU communication queues by a software splitting mechanism is as follows: calculating a Hash value of a symmetric Hash algorithm Hash2, and establishing a mapping relation between the Hash value of the symmetric Hash algorithm Hash2 and 4 groups of logic core-GPU communication queues to obtain a Hash mapping table 2; traversing Ha Xiying a table 2 to judge whether a corresponding communication queue exists in the current hash value for the data packet received later, and directly distributing the current hash value to the corresponding communication queue if the corresponding communication queue exists in the recorded mapping relation; when the calculated Hash value is not in the Hash mapping table 2, a new link data stream is indicated, and if the logic core load in the communication queue corresponding to the Hash value of the symmetric RSS Hash algorithm Hash1 is larger than the current average logic core load, the data stream is distributed to the communication queue corresponding to the symmetric Hash algorithm Hash 2.
4. The data transmission method of a single network card multiple GPUs based on GPUDrirect RDMA technology according to claim 3, further comprising updating the mapping relationship between the Hash value of the symmetric RSS Hash algorithm Hash1 and the communication queue to the Hash map table 1; and updating the mapping relation between the Hash value of the symmetric Hash algorithm Hash2 and the communication queue to the Hash mapping table 2.
5. The method for data transmission between multiple GPUs with a single network card based on GPUDrirect RDMA technology of claim 1, further comprising: when GPUNet is initialized, a standard CPU interface is used for initializing a GPU network buffer area, and 4 independent memories are allocated for 4 GPUs; establishing a mapping table from a memory virtual address to a physical address, and injecting the GPU memory into RDMA hardware of the network card; GPUNet uses RDMA registered memory as a memory pool, and allocates a receive buffer and a transmit buffer for each communication queue.
6. The method for data transmission of multiple GPUs with single network card based on GPUDrirect RDMA technology as claimed in claim 1, wherein each set of logic core-GPU communication queues receives the data stream by:
the HCA writes back the queue CQ through the state, informing the logical core that new data is to be received;
the logic core transmits descriptor information of a circular buffer area between the logic core and the GPU to the HCA through a command sending queue SQ;
The data packet is directly copied from the remote host memory to the GPU video memory through RDMA;
the HCA notifies the logic core of the completion of data reception through the state write-back queue CQ;
The logic core updates the circulation buffer area on behalf of the remote host and informs the GPU of data reception;
and the GPU calls grecv () function to read the data and updates the circular buffer area to indicate that the data is received.
The update triggers the logic core to inform the HCA of finishing data receiving through the command sending queue SQ;
after the HCA receives the command, the remote host is updated, indicating that a data reception is complete.
7. A single-network-card multi-GPU data transmission system based on GPUDrirect RDMA technology, comprising: the multi-core load balancing processing module and the data transmission module;
the data transmission module comprises 4 logic cores, 4 GPU video memories are respectively managed in parallel by the 4 logic cores, and 4 groups of logic core-GPU communication queues are realized;
the multi-core load balancing processing module is used for distributing the data streams of the same link to the same logic core-GPU communication queue through an RSS (really simple syndication) splitting mechanism; and distributing the data stream to 4 groups of logic core-GPU communication queues through a software distribution mechanism.
8. The GPUDrirect RDMA technology based single network card multiple GPU data transmission system of claim 7, wherein said data transmission module further comprises a memory management module, said memory management module configured to allocate a receive buffer and a transmit buffer for each communication queue.
9. The GPUDrirect RDMA technology based single network card multiple GPU data transmission system of claim 7, wherein said data transmission module further comprises a flow control module for implementing flow control of each connection flow related network buffer through a circular buffer between a logical core and a GPU and a queue manager between a logical core and an HCA.
10. The GPUDrirect RDMA technology based single network card multiple GPU data transfer system of claim 9, wherein the queues between the logical core and the HCA comprise a state write back queue CQ and a command send queue SQ.
CN202410148139.4A 2024-02-01 2024-02-01 Data transmission method and system for single network card multiple GPUs based on GPUDrirect RDMA technology Pending CN117971766A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410148139.4A CN117971766A (en) 2024-02-01 2024-02-01 Data transmission method and system for single network card multiple GPUs based on GPUDrirect RDMA technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410148139.4A CN117971766A (en) 2024-02-01 2024-02-01 Data transmission method and system for single network card multiple GPUs based on GPUDrirect RDMA technology

Publications (1)

Publication Number Publication Date
CN117971766A true CN117971766A (en) 2024-05-03

Family

ID=90856069

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410148139.4A Pending CN117971766A (en) 2024-02-01 2024-02-01 Data transmission method and system for single network card multiple GPUs based on GPUDrirect RDMA technology

Country Status (1)

Country Link
CN (1) CN117971766A (en)

Similar Documents

Publication Publication Date Title
US20210112003A1 (en) Network interface for data transport in heterogeneous computing environments
US11842216B2 (en) Data processing unit for stream processing
US11748278B2 (en) Multi-protocol support for transactions
EP3706394A1 (en) Writes to multiple memory destinations
US10331595B2 (en) Collaborative hardware interaction by multiple entities using a shared queue
CN117971715A (en) Relay coherent memory management in multiprocessor systems
WO2015078219A1 (en) Information caching method and apparatus, and communication device
US9507752B2 (en) Methods, apparatus and systems for facilitating RDMA operations with reduced doorbell rings
US20090077567A1 (en) Adaptive Low Latency Receive Queues
US11709774B2 (en) Data consistency and durability over distributed persistent memory systems
US9864717B2 (en) Input/output processing
CN114095251B (en) SSLVPN implementation method based on DPDK and VPP
EP3563534B1 (en) Transferring packets between virtual machines via a direct memory access device
CN111966446B (en) RDMA virtualization method in container environment
WO2023003603A1 (en) Cache allocation system
CN114756388A (en) RDMA (remote direct memory Access) -based method for sharing memory among cluster system nodes as required
US20220358002A1 (en) Network attached mpi processing architecture in smartnics
US9137167B2 (en) Host ethernet adapter frame forwarding
WO2017063447A1 (en) Computing apparatus, node device, and server
CN117971766A (en) Data transmission method and system for single network card multiple GPUs based on GPUDrirect RDMA technology
KR20040056309A (en) A network-storage apparatus for high-speed streaming data transmission through network
US20230061794A1 (en) Packet transmission scheduling
CN111585787B (en) Device and method for improving forwarding efficiency of NFV service chain based on FPGA
CN115529275B (en) Message processing system and method
US20210328945A1 (en) Configurable receive buffer size

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination