CN117294642A - Multi-tenant on-network aggregation transmission system and method suitable for RDMA network - Google Patents

Multi-tenant on-network aggregation transmission system and method suitable for RDMA network Download PDF

Info

Publication number
CN117294642A
CN117294642A CN202311200603.1A CN202311200603A CN117294642A CN 117294642 A CN117294642 A CN 117294642A CN 202311200603 A CN202311200603 A CN 202311200603A CN 117294642 A CN117294642 A CN 117294642A
Authority
CN
China
Prior art keywords
node
network
aggregation
module
memory
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311200603.1A
Other languages
Chinese (zh)
Inventor
李文信
李宇龙
李克秋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202311200603.1A priority Critical patent/CN117294642A/en
Publication of CN117294642A publication Critical patent/CN117294642A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/24Multipath
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L1/00Arrangements for detecting or preventing errors in the information received
    • H04L1/12Arrangements for detecting or preventing errors in the information received by using return channel
    • H04L1/16Arrangements for detecting or preventing errors in the information received by using return channel in which the return channel carries supervisory signals, e.g. repetition request signals
    • H04L1/1607Details of the supervisory signal
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L12/00Data switching networks
    • H04L12/02Details
    • H04L12/16Arrangements for providing special services to substations
    • H04L12/18Arrangements for providing special services to substations for broadcast or conference, e.g. multicast
    • H04L12/1863Arrangements for providing special services to substations for broadcast or conference, e.g. multicast comprising mechanisms for improved reliability, e.g. status reports
    • H04L12/1868Measures taken after transmission, e.g. acknowledgments
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a multi-tenant network aggregation transmission system and method suitable for RDMA network, wherein a workbench node (3) and a parameter server node (4) are deployed on a host side (1), the workbench node (3) is provided with a speed limiter (5) and a double transmission path control module (6), the parameter server node (4) is provided with a memory request module (10), and the memory request module (10) comprises a data/request aggregation module (11) and a feedback module (12); the exchanger side (3) is provided with a memory allocation module (16), an aggregation module (17) and a broadcasting module (18); the memory allocation algorithm based on the first-come first-serve strategy is adopted at the exchange side to control the host side in the fast path mode, and the host side is used for determining that the host side executes the gradient data packet network aggregation. The invention improves the transmission efficiency of the multi-tenant transmission system as a whole, and realizes that the INA solution is completely compatible with RDMA.

Description

Multi-tenant on-network aggregation transmission system and method suitable for RDMA network
Technical Field
The invention belongs to the field of data center networks, and particularly relates to a multi-tenant on-network aggregation transmission system and method for a large-scale data center.
Background
In recent years, large-scale data centers increasingly employ the following two techniques to expedite training of deep neural networks. The first is In-network aggregation (INA), which offloads gradient aggregation from a Parameter Server (PS) to a network switch or FPGA intermediate box, thereby releasing network resources, accelerating training tasks, and improving training extensibility. The other is Remote Direct memory access (Remote Direct MemoryAccess, RDMA), which offloads network functions onto a hardware network card, thereby bypassing a software protocol stack and realizing low latency and high throughput. However, in network aggregation breaks the boundaries of computation and network, packets are consumed in the network, resulting in misjudgment of reliability of RDMA. Thus, using RDMA networks for INA solutions becomes a problem. And this problem translates into the following two goals:
target one: RDMA compatibility: RDMA supports three types of connection transfer: reliable connections (Reliable Connection, RC), unreliable connections (Unreliable Connection, UC), and unreliable datagrams (Unreliable Datagram, UD). Wherein the RC supports all RDMA transfer primitives and can guarantee up to one ordered and damage-free delivery. At the same time, DNN training performance may drop dramatically as packets are lost. Thus, to make the INA solution fully compatible with RDMA, RC must be enabled.
Target II: INA multi-tenant: INAs reduce network traffic by using switch memory (registers) to store and aggregate gradients. However, currently commercially available programmable switches have limited memory resources, e.g., tofino switches have only about 10MB of available memory, whereas DNN training clusters are typically shared by a large number of concurrent tasks. INA solutions must also support multi-tenant learning when integrated with RDMA to improve work scalability.
In order to achieve compatibility of RDMA with INA in a multi-tenant environment, a series of solutions have been proposed in this respect. Summarizing into two broad categories, best effort based schemes and static allocation based schemes. (1) The parameter servers PS and network switches for gradient aggregation are reserved based on best effort schemes like ATP, A2TP, ESA. In this scenario, the Worker streams data packets to a rack-top switch that employs dynamic packet level aggregation in a best effort manner. If the switch has an idle aggregator, the switch will perform network aggregation, otherwise it pushes it to the parameter server for aggregation, called intra-server aggregation (In-server Aggregation, ISA). This dynamic aggregation scheme supports multi-tenant learning, but is not RDMA RC-compatible. This is because packets carrying small sequence numbers may be discarded when performing network aggregation, and when PS aggregation is performed on packets carrying large sequence numbers, the transmission protocol on the end side may consider that the network has lost packets, triggering unnecessary retransmission of the packets aggregated on the network. (2) A static allocation based scheme such as SwitchML, netReduce removes PS and completely offloads the gradient onto the switch. This solution provides a solution for establishing RDMA RC connections between the workers or between the workers and the switch. However, because static switching of memory partitions is performed during training of each task, the number of concurrent training tasks is limited. Therefore, such static switched memory schemes cannot support multi-tenant training scenarios.
Disclosure of Invention
The invention aims to design a multi-tenant on-network aggregation transmission system and method suitable for an RDMA network, and realizes supporting multi-tenant on-network aggregation transmission by combining a double transmission path with all-reduce level memory requests.
The invention is realized by the following technical scheme:
the multi-tenant network aggregation transmission system suitable for the RDMA network comprises a host side 1 and a exchanger side 2, wherein at least one workbench node 3 and at least one parameter server node 4 are deployed on the host side 1, the workbench node 3 is connected with the exchanger side 3 through a first RDMA network card 13, and the parameter server node 4 is connected with the exchanger side 3 through a second RDMA network card 14;
the workbench node 3 is provided with a speed limiter 5 and a double transmission path control module 6; the speed limiter 5 is used for limiting the traffic sent by a single task to the network; the dual transmission path control module 6 provides dual transmission path control logic at the host side, and further comprises a slow path mode 7, a fast path mode 8 and a marker 9, wherein the slow path mode 7 is used for intra-server aggregation, and the fast path mode 8 is used for network aggregation; the marker 9 is used for inserting a request mark in the aggregation data packet in the server;
the parameter server node 4 is provided with a memory request module 10, which is used for all-reduce level memory requests in the parameter server node 4; the memory request module 10 further includes a data/request aggregation module 11, a feedback module 12, and a fast/slow path mode 18; the data/request aggregation module 11 is configured to aggregate gradient data and memory request results from all the workbench nodes 3 to the parameter server PS node 4; the feedback module 12 is configured to return the parameter data and the memory request result sent from the PS node to the Worker node 3, where the fast/slow path mode 18 is configured to form a transmission queue pair reliably connected with the slow path mode 7 and the fast path mode 8 in the Worker node and a transmission queue pair virtually reliably connected with the slow path mode 7 and the fast path mode 8;
the exchange side 3 is provided with a memory allocation module 16, an aggregation module 17 and a broadcasting module 18; the memory allocation module 16 adopts a memory allocation algorithm based on a first-come-first-serve strategy to provide the double transmission path control of the host side 1 for training tasks; the aggregation module 17 is configured to perform network aggregation on the exchange side, and the broadcast module 18 is configured to distribute the lost gradient data packets to restore the connection status, and return the parameter packets to the parameter server node 4.
Further, the slow path mode and the fast path mode of the workbench node 3 respectively comprise a data transmission queue pair reliably connected point to point and a data transmission queue pair virtually and reliably connected many to one; the parameter server node 4 comprises a data transmission queue pair which is reliably connected point to point and a data transmission queue pair which is reliably connected in a one-to-many virtual manner; the data transmission queue pair of the point-to-point reliable connection in the workbench node 3 is connected with the data transmission queue pair of the point-to-point reliable connection in the parameter server node 4; the data transmission queue pair of the one-to-many virtual reliable connection in the parameter server node 4 is connected with the data transmission queue pair of the many-to-one virtual reliable connection in the workbench node 3.
Further, the slow path mode 7 establishes a point-to-point reliable connection between the Worker node 3 and the parameter server node 4.
Further, the fast path mode 8 establishes a many-to-one virtual reliable connection for the Worker node and the parameter server node 4.
Further, when a task enters the communication phase, the host side 1 transmits the data packet by default in the slow path mode.
Further, the memory allocation algorithm based on the first-come first-serve policy includes:
when the exchanger receives the data packet, analyzing the destination QPN field in the data packet header to obtain a task identification number and a requested memory pool index IDX;
carrying out allocation state tracking on the N memory pools, and if the data packet carries a memory request bit mark and comes from a workbench node, calling memory allocation by a switch side until allocation requests of all the memory pools are completed;
finally, the switch forwards the packet to the reference server node.
The invention discloses a multi-tenant on-network aggregation transmission method suitable for an RDMA network, which comprises the following steps:
constructing the multi-tenant network aggregation transmission system suitable for RDMA network according to any one of claims 1-6, and realizing connection establishment between a workbench node at a host side and a reference server node;
the method comprises the steps that a memory allocation algorithm based on a first-come first-served strategy is adopted at a switch side to control the host side in a fast path mode, the host side is used for determining that the switch side receives gradient data packets transmitted from a workbench node through an executive INA field carried by the gradient data packets of the workbench node, an aggregation module at the switch side is used for executing network aggregation of the gradient data packets, after the aggregation is completed, a gradient data packet head and a load are modified at the switch side, and the gradient data packets are forwarded to a transmission queue pair which is in virtual reliable connection on a reference server;
and realizing the intra-network ACK broadcasting, and starting the intra-network ACK broadcasting to the transmission queues of the virtual reliable connection on all the workbench nodes when the gradient data packet reaches the transmission queue pair of the virtual reliable connection on the reference server node so as to recover the connection state of all the workbench nodes.
Compared with the prior art, the invention achieves the following beneficial technical effects:
1) The host side drive is embodied, so that training tasks can be dynamically utilized In Network Aggregation (INA) according to aggregation requirements;
2) Retransmission is avoided, so that the transmission efficiency of the multi-tenant transmission system is improved as a whole;
3) Implementing INA solutions without reboot is fully compatible with RDMA.
Drawings
FIG. 1 is a diagram of a multi-tenant on-network aggregate transport system architecture suitable for RDMA networks of the present invention;
fig. 2 is a general flow chart of a multi-tenant on-network aggregate transport method applicable to RDMA networks of the present invention.
FIG. 3 is an exemplary diagram of a host side deployment of an embodiment of the present invention;
FIG. 4 is an exemplary diagram of a memory pool allocation algorithm code implementation;
reference numerals:
1. host side, 2, exchange side, 3, worker node, 4, reference server node, 5, speed limiter, 6, dual transmission path control module (DTP), 7, slow path mode, 8, fast path mode, 9, marker, 10, memory request module (AMR), 11, data/request aggregation module, 12, feedback module, 13, first RDMA network card, 14, second RDMA network card, 15, memory allocation module, 16, aggregation module, 17, broadcast module, 18, fast/slow path mode.
2. The end side of the invention adopts a commercial Mellanox ConnectX-5 dual-port 100Gbps network card, and the switch adopts a commercial Tofino programmable switch supporting P4 language.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
First, the end-side dynamic host may pre-decide on packet behavior (ISA or INA) to proactively decide on the necessity of retransmission to support RDMA compatibility. In addition, the end-side driver enables training tasks to dynamically take advantage of INA's according to their aggregate needs. That is, one task may proactively free memory for other tasks to support multi-tenant training.
Examples:
as shown in fig. 1, the multi-tenant network aggregation transmission system applicable to RDMA network of the present invention includes a host side 1 and a switch side 2. The host side 1 specifically includes a Worker node 3 and a parameter server node 4 (PS node). The Worker node 3 includes a speed limiter 5 and a dual transmission path management (Dual transmissionpaths, DTP) module 6. The speed limiter 5 is used for limiting the traffic of a single task to the network to a size not exceeding the delay bandwidth product. The dual transmission path control module 6 further comprises a slow path mode 7, a fast path mode 8 and a marker 9; the dual transmission path management module is used for providing dual transmission path (Dual transmissionpaths, DTP) management logic at the host side. Based on dual transmission path (Dual transmissionpaths, DTP) management logic, slow path mode 7 is used for aggregating ISA in the server and fast path mode 8 is used for aggregating INA in the network. The marker 9 is used for inserting a request mark in the ISA data packet so as to allow the task to perceive the memory on the aggregation of the servers. The parameter server node 4 further comprises a memory request module 10 for an All-reduce level memory request (All-reduce-level Memory Request, AMR) in the parameter server node 4. The memory request module 10 further includes a data/request aggregation module 11, a feedback module 12, and a fast/slow path mode 18. The data/request aggregation module 11 is configured to aggregate gradient data and memory request results from all the Worker nodes 3 to the Parameter Server (PS) node 4. The feedback module 12 is configured to return the parameter data or the memory request result sent from the PS node to all the Worker nodes. The fast/slow path mode 18 is used to form a reliably connected transmission queue pair and a virtually reliably connected transmission queue pair with the slow path mode 7 and the fast path mode 8 in the Worker node.
The Worker node 3 is connected to the exchange side 3 via a first RDMA network card 13. The parameter server node 4 is connected to the exchange side 3 via a second RDMA network card 14.
The exchange side 3 further comprises a memory allocation module 16, an aggregation module 17 and a broadcasting module 18. The memory allocation module 16 adopts a memory allocation algorithm based on a first-come-first-serve strategy to provide the dual transmission path control of the host side 1 for training tasks. The aggregation module 17 is used for performing on-line aggregation on the exchange side. The broadcast module 18 is configured to distribute the lost gradient packets to restore the connection status and return the parameter packets to the parameter server node 4.
As shown in fig. 2, the overall flow of the multi-tenant network aggregation transmission method applicable to the RDMA network of the present invention specifically includes the following steps:
constructing a multi-tenant network aggregation transmission system suitable for an RDMA network, and realizing connection between a workbench node at a host side and a reference server node;
the method comprises the steps that a memory allocation algorithm based on a first-come first-served strategy is adopted at a switch side to control the host side in a fast path mode, the gradient data packet transmitted from a workbench node is determined to be received at the switch side by utilizing an isINA field carried by the gradient data packet of the workbench node, an aggregation module at the switch side is utilized to execute network aggregation of the gradient data packet, after aggregation is completed, a gradient data packet head and a load are modified at the switch side, and the gradient data packet is forwarded to a transmission queue pair which is in virtual reliable connection on a reference server;
and realizing the intra-network ACK broadcasting, and starting the intra-network ACK broadcasting to the transmission queues of the virtual reliable connection on all the workbench nodes when the gradient data packet reaches the transmission queue pair of the virtual reliable connection on the reference server node so as to recover the connection state of all the workbench nodes.
As shown in fig. 3, an example of a host side deployment in the embodiment of the present invention is, for example, a flow example in which two transmission paths are established for 2 Worker nodes and 1 reference server PS node. All the Worker nodes and PS nodes follow standard procedures to establish a reliable connection RC. And the workbench node performs gradient data transmission. And the PS node transmits the parameters.
The workbench node 1 has a slow path mode and a fast path mode, wherein the slow path mode comprises a local gradient data transmission queue pair RC QP with reliable connection formed by local QPNs (queue pair numbers) of #1 and # 3. The fast path mode includes a virtual reliably connected remote transmission queue pair (virtual RC QP, VQP) formed by the remote QPN #5 and # 7. The workbench node 2 also has a slow path and a fast path, and the slow path mode comprises a local transmission queue pair RC QP with reliable connection formed by the local QPNs #2 and # 4. The fast path mode comprises a remote transmission queue pair VQP of virtual reliable connection formed by the remote QPN #6 and the remote QPN # 7. Each Worker node creates one vQP.
The PS node of this embodiment has two slow path modes and one fast path mode. The slow path mode comprises two reliably connected local data transmission queue pairs RC QP formed by the local QPNs #3, #1 and the local QPNs #4, # 2. The fast path mode includes a virtual reliable connection data transmission queue pair VQP formed by the remote QPNs #7 and # 5. The PS node creates a vQP. And using the local QPN and the remote QPN as context information of the connection of the workbench node and the PS node. The workbench node and PS node deployment at the host side are specifically described as follows:
1. establishing connection between a workbench node and a PS node: the RC QP with the local QPN of the No. 1 workbench node being #1 is connected with the RC QP with the local QPN of the No. 3 on the PS node, and vice versa, the RC QP with the local QPN of the No. 1 workbench node being #3 is connected with the RC QP with the local QP of the No. 1 workbench node being #1 on the PS node; the RC QP with the local QPN of #2 of the number 2 workbench node is connected with the RC QP with #4 on the PS node, and vice versa, the RC QP with the local QPN of #4 of the number 2 workbench node is connected with the RC QP with #2 on the PS node, namely, the slow path mode establishes point-to-point RC connection with the PS node for each workbench node according to the traditional PS node architecture.
The vQP of the PS node having the distal QPN of #7, #5 is assigned as the destination to the vQP of #5, #7 of all the Worker nodes, and vQP of the No. 1 Worker node is designated as the destination of vQP of the PS node. For example, the two Worker nodes, the Worker node No. 1 and the Worker node No. 2, each have one vQP (remote QPNs #5, #7 and remote QPNs #6, #7, respectively), and vQP of remote QPN #7 is designated as the destination. That is, the fast path mode establishes many-to-one virtual RC connections for all of the Worker nodes with the PS node.
2. Aggregation at the exchange side is realized: under the condition that the memory request is successful, the host side 1 serving as a terminal streams the gradient data packet on a fast path mode, and determines that the gradient data packet is received by the exchange side; for example, the host side determines the behavior of the exchange side in receiving ladder-to-packet data by allowing the gradient packet to carry an isINA field that is intra-network aggregation is performed. If the isINA field is set to 0, the switch side forwards the packet to the PS node without any modification. If the isINA is set to 1, the exchange side 2 performs network aggregation in the aggregation module 16, and after the aggregation is completed, the exchange side modifies the gradient packet header and the load, and forwards the gradient packet to vQP on the PS node 4 for further processing;
3. implementing intra-network ACK broadcasting: when the gradient data packet reaches vQP of PS node 4, the RDMA network card is used to enable intra-network ACK broadcasting to vQP of all the Worker nodes to recover the connection state of all the Worker nodes, and the broadcasting mechanism of this intra-network ACK broadcasting is used to ensure that all the Worker nodes can receive the necessary ACK data packet, so as to maintain a consistent connection state and prevent unnecessary retransmission.
Further description of the related art is as follows:
(1) All-reduce-level memory requests
The deep neural network training task iteratively undergoes a communication and computation phase, referred to as an "on and off" training mode. During the communication phase of a task, all-reduce operations are typically required to synchronize parameters. Based on this, the invention adopts all-reduce level memory request strategy. The specific strategy is as follows:
when a task enters the communication phase, the host side transmits the data packet by default in the slow path mode. This default behavior allows the PS node to provide the necessary guarantees for gradient aggregation in the absence of sufficient memory at the switch side. The invention allows each Worker node to set a memory request flag isRequest at intervals (per Round-Trip Time) to request memory from the switch side.
(2) Memory allocation algorithm based on first service strategy
The present invention splits the switch memory into multiple memory pools, each pool containing a fixed number of aggregators (including delay bandwidth product size memories) sufficient for the switch to process on-network aggregated packets at the line rate. Unlike previous work, the present invention can dynamically allocate and de-allocate memory pools for each task during its lifecycle. As shown in fig. 3, the specific flow of the memory pool allocation algorithm is as follows:
when the exchanger receives the data packet, analyzing the destination QPN field in the data packet header to obtain a task identification number JOBID and a requested memory pool index IDX (JOBID% N);
the initial state of the memory pool is an idle state, allocation state tracking is carried out on N memory pools, if the data packet carries a memory request flag bit isRequest and comes from a workbench node, the exchange side calls memory allocation until allocation requests of all the memory pools are completed;
finally, the switch forwards the packet to the reference server node.
The processing procedure for whether the memory pool has completed memory allocation is as follows: if the memory allocation has not been completed, writing the JOBID into the memory pool s_pool at the switch side to indicate that the memory pool has been allocated to the current task; after the memory allocation is completed, the switch returns the current content of s_pool, if the return value is equal to the current task ID, which indicates that the memory allocation is successful, the switch sets the request success flag bit isAllocate to 1.
As shown in fig. 4, an exemplary diagram of the algorithm code implementation for memory pool allocation is shown in the following specific flow:
assuming there are N memory pools, the present invention uses N registers named s_pools to track the allocation status of each memory pool. Initially, the value of all s_pool registers is set to-1, indicating that all memory pools are in idle state. When the exchanger receives the data packet, the invention firstly uses the destination QPN field in the packet head to look up table to obtain the task number JOBID and the requested memory pool index IDX (JOBID%N).
When the switch receives a packet, the switch parses the packet header (line 11), and if the packet carries the memory request bit isRequest flag (line 12) and comes from the Worker node (line 13), the switch will invoke the memory allocation procedure (line 14). This process determines if the requested pool has been allocated and if not already (line 2), the switch writes the JOBID into s_pool (line 3) to indicate that the memory pool has been allocated to the current task (line 9). After the memory allocation is completed, the switch returns the current content of s_pool (line 5), and if the return value is equal to the current task ID (line 15), which indicates that the memory allocation is successful, the switch sets the request success flag bit isAllocate to 1 (line 16). Finally, the switch forwards this packet to the PS (line 22).
(3) Reference server node request aggregation and feedback
It is contemplated that different Worker nodes may not be in the same rack, and that the memory request of the current task is only successful if the memory requests of all the Worker nodes are successful. Therefore, the PS end is responsible for collecting the memory allocation results of all Woker nodes, and aggregating and returning the results to each Worker. The specific flow is as follows:
when the reference server node receives data packets with memory request mark bits isRequest from all workbench nodes, firstly verifying whether all the data packets carry memory request success mark bits isAllocate; if yes, the memory request of the task is successful, the reference server node sets a task request success marking bit isSuccess in the broadcast parameter packet, otherwise, the memory request of the task fails. When the Worker node receives the data packet carrying the isSuccess flag bit, the Worker node can choose to transmit the data packet on the fast path to enjoy the benefits of network aggregation.
(4) Memory release
To support multi-tenancy, switch memory should be dynamically freed up for other tasks to apply. Specifically, there are two cases where a memory pool needs to be released:
in the first case, the memory allocation fails. If the switch receives a data packet sent by the PS node, the switch carries a data packet of which the memory request mark bit isRequest is carried but the task request success mark bit isSuccess is lacked;
in the second case, all-reduce has been completed. If the exchange receives a data packet with an end of aggregation flag bit isEnd, this indicates the end of the communication phase.
For the two cases, the invention allows the switch to call the memory release process. In this process, the exchange side sets s_pool to-1, indicating that no task is currently using the memory pool.
In summary, the multi-tenant network aggregation transmission system and method for RDMA network of the present invention combines dual transmission paths with all-reduce level memory requests. Wherein (1) to support RDMA RC compatibility, the present invention employs dual transmission paths: ISA takes a slow path and INA takes a fast path. For slow paths, the present invention reuses the point-to-point connection of the conventional PS. For the fast path, the invention builds a many-to-one virtual connection for each end host and designs the switch and host logic together to recover the lost connection state. In this way, the invention can support compatible RDMA RC without modifying commercial RDMA network card, and ensure that ISA transmission of slow path is not affected. To support multi-tenant learning, the dual path design ensures that when the switch memory runs out, the INA will revert to the default ISA. (2) To increase the efficiency of switching memory, the present invention further utilizes an all-reduce level strategy for allocating memory among tasks that follows a well-known "on and off" training pattern. That is, it allows each task to request memory when entering the communication phase and to free memory during the computation phase. Therefore, the invention can not only enjoy low time delay and high throughput brought by RDMA, but also support a plurality of tasks to share rare memory, and provide expansibility of network aggregation.
The key insight of this strategy is that the host side decides whether intra-network aggregation is to be performed or not. First, the host-side dynamic host may pre-decide on packet behavior (ISA or INA) to proactively decide on the necessity of retransmission to support RDMA compatibility. In addition, the host-side driver enables training tasks to dynamically take advantage of INA's according to their aggregate needs. That is, one task may proactively free memory for other tasks to support multi-tenant training.
It should be noted that, while the present invention has been shown and described with reference to certain exemplary embodiments thereof, it should be understood by those skilled in the art that the present invention is not limited to the above-described embodiments.
It should be understood that the detailed description is presented by way of example only and is not intended to limit the invention. All changes and modifications that do not depart from the spirit of the invention are intended to be covered by the scope of the invention.

Claims (7)

1. The multi-tenant network aggregation transmission system suitable for the RDMA network is characterized by comprising a host side (1) and a exchanger side (2), wherein at least one workbench node (3) and at least one parameter server node (4) are deployed on the host side (1), the workbench node (3) is connected with the exchanger side (3) through a first RDMA network card (13), and the parameter server node (4) is connected with the exchanger side (3) through a second RDMA network card (14);
the workbench node (3) is provided with a speed limiter (5) and a double transmission path control module (6); the speed limiter (5) is used for limiting the flow of a single task sent to the network; the dual transmission path control module (6) provides dual transmission path control logic at a host side, and further comprises a slow path mode (7), a fast path mode (8) and a marker (9), wherein the slow path mode (7) is used for intra-server aggregation, and the fast path mode (8) is used for network aggregation; the marker (9) is used for inserting a request mark in the aggregation data packet in the server;
the parameter server node (4) is provided with a memory request module (10) for all-reduce level memory requests in the parameter server node (4); the memory request module (10) further comprises a data/request aggregation module (11), a feedback module (12) and a fast/slow path mode (18); the data/request aggregation module (11) is used for aggregating gradient data and memory request results reaching the parameter server node (4) from all the workbench nodes (3); the feedback module 12 is configured to return parameter data and a memory request result sent from the PS node to the Worker node (3), where the fast/slow path mode (18) is configured to form a transmission queue pair reliably connected with the slow path mode (7) and the fast path mode (8) in the Worker node and a transmission queue pair virtually reliably connected with the slow path mode and the fast path mode;
the exchanger side (3) is provided with a memory allocation module (16), an aggregation module (17) and a broadcasting module (18); the memory allocation module (16) adopts a memory allocation algorithm based on a first-come first-serve strategy to support training tasks and provide double transmission path control of the host side (1); the aggregation module (17) is used for performing network aggregation at the exchange side, and the broadcasting module (18) is used for distributing the lost gradient data packets to restore the connection state and returning the parameter packets to the parameter server node (4).
2. A multi-tenant on-network aggregation transport system adapted for RDMA networks according to claim 1, characterized in that the slow path mode and the fast path mode of the Worker node (3) comprise a data transfer queue pair of point-to-point reliable connections and a data transfer queue pair of many-to-one virtual reliable connections, respectively; the parameter server node (4) comprises a data transmission queue pair which is reliably connected point to point and a data transmission queue pair which is reliably connected in a one-to-many virtual manner; the data transmission queue pair of the point-to-point reliable connection in the workbench node (3) is connected with the data transmission queue pair of the point-to-point reliable connection in the parameter server node (4); the data transmission queue pairs of the one-to-many virtual reliable connection in the parameter server node (4) are connected with the data transmission queue pairs of the many-to-one virtual reliable connection in the workbench node (3).
3. A multi-tenant on-network aggregation transport system adapted for RDMA networks according to claim 2, characterized in that the slow path mode (7) establishes a point-to-point reliable connection for the Worker node (3) and the parameter server node (4).
4. A multi-tenant on-network aggregation transport system for RDMA networks according to claim 2, characterized in that the fast path mode (8) establishes a many-to-one virtual reliable connection for the parameter server node (4) of the Worker node.
5. A multi-tenant, network aggregation transport system for RDMA networks according to claim 2, characterized in that the host side (1) defaults to transport data packets over a slow path mode when a task enters the communication phase.
6. The multi-tenant on-network aggregate transport system of claim 1, wherein the first-come-first-service policy-based memory allocation algorithm comprises:
when the exchanger receives the data packet, analyzing the destination QPN field in the data packet header to obtain a task identification number and a requested memory pool index IDX;
carrying out allocation state tracking on the N memory pools, and if the data packet carries a memory request bit mark and comes from a workbench node, calling memory allocation by a switch side until allocation requests of all the memory pools are completed;
finally, the switch forwards the packet to the reference server node.
7. A multi-tenant on-network aggregation transmission method suitable for an RDMA network, comprising:
constructing the multi-tenant network aggregation transmission system suitable for RDMA network according to any one of claims 1-6, and realizing connection establishment between a workbench node at a host side and a reference server node;
the method comprises the steps that a memory allocation algorithm based on a first-come first-served strategy is adopted at a switch side to control the host side in a fast path mode, the gradient data packet transmitted from a workbench node is determined to be received at the switch side by utilizing an isINA field carried by the gradient data packet of the workbench node, an aggregation module at the switch side is utilized to execute network aggregation of the gradient data packet, after aggregation is completed, a gradient data packet head and a load are modified at the switch side, and the gradient data packet is forwarded to a transmission queue pair which is in virtual reliable connection on a reference server;
and realizing the intra-network ACK broadcasting, and starting the intra-network ACK broadcasting to the transmission queues of the virtual reliable connection on all the workbench nodes when the gradient data packet reaches the transmission queue pair of the virtual reliable connection on the reference server node so as to recover the connection state of all the workbench nodes.
CN202311200603.1A 2023-09-18 2023-09-18 Multi-tenant on-network aggregation transmission system and method suitable for RDMA network Pending CN117294642A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311200603.1A CN117294642A (en) 2023-09-18 2023-09-18 Multi-tenant on-network aggregation transmission system and method suitable for RDMA network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311200603.1A CN117294642A (en) 2023-09-18 2023-09-18 Multi-tenant on-network aggregation transmission system and method suitable for RDMA network

Publications (1)

Publication Number Publication Date
CN117294642A true CN117294642A (en) 2023-12-26

Family

ID=89252841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311200603.1A Pending CN117294642A (en) 2023-09-18 2023-09-18 Multi-tenant on-network aggregation transmission system and method suitable for RDMA network

Country Status (1)

Country Link
CN (1) CN117294642A (en)

Similar Documents

Publication Publication Date Title
US20220214934A1 (en) System and method for facilitating hybrid message matching in a network interface controller (nic)
CN113709057B (en) Network congestion notification method, proxy node, network node and computer equipment
US10210125B2 (en) Receive queue with stride-based data scattering
CN107623646B (en) Data stream transmission method, sending equipment and receiving equipment
US6687758B2 (en) Port aggregation for network connections that are offloaded to network interface devices
US9479384B2 (en) Data stream scheduling method, device, and system
CN113014528B (en) Message processing method, processing unit and virtual private network server
US11303571B2 (en) Data communication method and data communications network
CN111459417B (en) Non-lock transmission method and system for NVMeoF storage network
CN112838992B (en) Message scheduling method and network equipment
US9172653B2 (en) Sending request messages to nodes indicated as unresolved
CN116260887A (en) Data transmission method, data transmission device, data reception device, and storage medium
CN113179228B (en) Method, device, equipment and medium for improving switch stacking reliability
CN113904976A (en) Multi-path data transmission method and device for lossy network based on RDMA
CN114124830A (en) RDMA service quality assurance method and system for multiple application scenes of data center
US9317678B2 (en) System and method for managing logins in a network interface
US8539113B2 (en) Indicators for streams associated with messages
US20120320909A1 (en) Sending request messages over designated communications channels
CN117294642A (en) Multi-tenant on-network aggregation transmission system and method suitable for RDMA network
CN107231316B (en) Message transmission method and device
US11886938B2 (en) Message communication between integrated computing devices
CN108924066B (en) Message forwarding method and device
CN112422457B (en) Message processing method and device and computer storage medium
US20050083960A1 (en) Method and apparatus for transporting parcels of data using network elements with network element storage
CN111586040A (en) High-performance network data receiving method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination