CN115714750A - ToR switch-based data center RDMA Incast solution method and system - Google Patents

ToR switch-based data center RDMA Incast solution method and system Download PDF

Info

Publication number
CN115714750A
CN115714750A CN202211200973.0A CN202211200973A CN115714750A CN 115714750 A CN115714750 A CN 115714750A CN 202211200973 A CN202211200973 A CN 202211200973A CN 115714750 A CN115714750 A CN 115714750A
Authority
CN
China
Prior art keywords
switch
sending
end switch
receiving
incast
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211200973.0A
Other languages
Chinese (zh)
Inventor
张娇
苏昱臻
潘恬
黄韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202211200973.0A priority Critical patent/CN115714750A/en
Publication of CN115714750A publication Critical patent/CN115714750A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention provides a method and a system for solving RDMA Incast of a data center based on a ToR switch, wherein the method comprises the following steps: based on the fact that a sending end switch senses an incast condition, if the incast condition occurs, caching the data stream; the network card receiving rate of the user receiving end is obtained based on the receiving end switch, the data volume size that the user receiving end can receive in unit time is calculated based on the network card receiving rate, the link utilization rate is calculated based on the data volume size that the user receiving end can receive in unit time, the receiving rate of the receiving end switch is calculated based on the data volume size that the receiving end switch can receive in unit time, the sending rate of every sending end switch is calculated based on the sending end quantity that is connected with the sending end switch and the receiving rate of the receiving end switch, the incast condition occurs the sending end switch sends data to the receiving end switch based on the sending rate.

Description

ToR switch-based data center RDMA Incast solution method and system
Technical Field
The invention relates to the technical field of data transmission, in particular to a method and a system for solving RDMA Incast of a data center based on a TOR switch.
Background
Cloud services, by virtue of their powerful computing resources and unlimited storage capabilities, have become an important dependence for the development of information technology enterprises today. Cloud service resource providers such as amazon, arbiba, google and the like adopt Data Centers (DC) to provide services for large-scale online learning, distributed storage, network search and other applications. With the continuous development of the upper application technology, the performance of the data center infrastructure is put on higher demands. The link rate of data centers has increased from 1/10Gbps in the past to 100Gbps today. The resource decoupling needs to control the in-network delay within 3-5 mus to ensure good program performance, and machine learning expects the in-network tail delay of the data center to be less than 50 mus. The high link rate and low latency requirements in a data center pose significant challenges to the design of transmission control.
The conventional TCP network protocol stack requires the intervention of the operating system kernel to make the data copy. At high link rates, TCP transmission will result in high CPU utilization, resulting in transmission delays of typically hundreds of microseconds. Furthermore, the maximum bandwidth supported by an operating system kernel thread is only tens of Gbps. Such transmission performance has not been able to meet the demands of today's data centers. Thus, remote Direct Memory Access (RDMA) technology is applied to data center networks. RDMA directly reads and writes the memory without the intervention of operating systems of two communication parties. By bypassing the kernel, RDMA achieves high link rates, low CPU utilization, and microsecond intra-network transmission delays. The RDMA technology used in the data center is RoCE (RDMAover converged ethernet), and the data link layer and the network layer of the latest version of RoCEv2 use the existing legacy network. Various schemes have been proposed by academia for transport control for RDMA.
However, the incust problem is easily caused in the existing RDMA network, and the incust problem refers to that when a request direction requests a plurality of servers to synchronously transmit data, the plurality of servers in the network transmit corresponding data almost at the same time under the conditions of shallow cache and high link rate of a data center network. This can result in a large number of packets flooding within the data center network in a short amount of time, causing a catastrophic breakdown of the network.
In the existing Incast problem solution, windows are generally required to be arranged between switches of adjacent layers of an RDMA network, transmission is limited layer by adopting the windows, and arrangement difficulty is high.
Disclosure of Invention
In view of this, the embodiments of the present invention provide a data center RDMA Incast solution based on a ToR switch, so as to eliminate or improve one or more defects existing in the prior art.
One aspect of the present invention provides a method for solving RDMA Incast of a data center based on a ToR switch, wherein an RDMA network includes a sending end switch connected to a sending end and a receiving end switch connected to a user receiving end, both the sending end switch and the receiving end switch are ToR switches, and the method includes the steps of:
based on the fact that a sending end switch senses an incast condition, if the incast condition occurs, caching the data stream in the sending end switch;
the network card receiving rate of a user receiving end is obtained based on a receiving end switch, the data volume size which can be received by the user receiving end in unit time is calculated based on the network card receiving rate, the link utilization rate of the receiving end switch is calculated based on the data volume size which can be received by the user receiving end in unit time, the data volume size which can be received by the receiving end switch in unit time is calculated based on the link utilization rate, the receiving rate of the receiving end switch is calculated based on the data volume size which can be received by the receiving end switch in unit time, the sending rate of each sending end switch is calculated based on the sending end quantity which is connected with the sending end switch and the receiving rate of the receiving end switch, and the incast condition occurs, the sending end switch sends data to the receiving end switch based on the sending rate.
Adopt above-mentioned scheme, this scheme only needs to arrange the procedure at the sending end switch and the receiving end switch of RDMA network, utilize the perception of sending end switch incast condition, when the incast condition takes place, buffer the data stream, avoid a large amount of data streams to rush into the RDMA network and cause the data to block up, on the other hand, the receiving end switch calculates the maximum receiving rate of receiving end switch according to the condition of user receiving end, and distribute sending rate for every sending end switch, output the data that sending end switch cached fast and high-efficiently, this scheme need not arrange between the switch between sending end switch and receiving end switch, the treatment effeciency of the incast condition has not only been improved, and the degree of difficulty of arranging has been reduced.
In some embodiments of the present invention, the step of sensing an incast condition based on the sender switch includes:
the sending end switch counts the data size sent to the same user receiving end by the sending end switch within a preset time length;
and when the data volume sent to the same user receiving end through the sending end switch within the preset time length is larger than a preset incast threshold value, judging that the incast condition occurs.
In some embodiments of the present invention, the sensing an incast condition by the sender-side switch further includes that, after it is determined that the incast condition occurs, the sender-side switch continues to transmit a first preset amount of data to the user receiver; if incast happens, caching the data stream in the sending end switch, and caching the data stream in the sending end switch after the sending end switch transmits the data volume with the first preset size to the user receiving end.
In some embodiments of the present invention, if an incast condition occurs, the step of caching the data stream in the sending-end switch further includes constructing a queue for the data stream cached in the sending-end switch based on an input time of the data stream, and in the step of sending data to the receiving-end switch based on the sending rate, the sending-end switch sends data to the receiving-end switch based on a queue sequence.
In some embodiments of the present invention, in the step of calculating the size of the data amount that can be received by the user receiving end in the unit time based on the network card receiving rate, the size of the data amount that can be received by the user receiving end in the unit time is calculated based on the following formula:
Δ=B dip ×τ update
wherein, B dip Indicating the network card receiving rate, tau, of the user receiving end update The time length of unit time is represented, and delta represents the size of data quantity which can be received by a user receiving end in the unit time;
in the step of calculating the link utilization rate of the receiving-end switch based on the size of the data volume which can be received by the user receiving end in unit time, the link utilization rate of the receiving-end switch is calculated based on the following formula:
Figure BDA0003872457620000031
wherein, Δ represents the data volume that can be received by the user receiving end in unit time, qlen represents the data volume cached by the port connected between the current receiving end switch and the user receiving end, and η represents the link utilization rate of the receiving end switch.
In some embodiments of the present invention, in the step of calculating the size of the data amount that can be received by the receiving-end switch in unit time based on the link utilization, the size of the data amount that can be received by the receiving-end switch in unit time calculated in a previous unit time is obtained, and the size of the data amount that can be received by the receiving-end switch in unit time is calculated based on the size of the data amount that can be received by the receiving-end switch in unit time calculated in the previous unit time and the link utilization.
In some embodiments of the present invention, in the step of calculating the size of the data amount that can be received by the receiving-end switch in a unit time based on the size of the data amount that can be received by the receiving-end switch in a unit time calculated in a previous unit time and the link utilization, the size of the data amount that can be received by the receiving-end switch in a unit time is calculated based on the following formula:
Figure BDA0003872457620000032
wherein, B dip Indicating the network card receiving rate, tau, of the user receiving end update Represents the time length per unit time, eta represents the link utilization of the receiver switch, W old Represents the data volume W calculated in the previous unit time and can be received by the receiving end switch in the unit time new The size of data that the receiver switch can receive in the current unit time is shown, and both alpha and beta are constants.
In some embodiments of the present invention, in the step of calculating the receiving rate of the receiver switch based on the size of the amount of data that the receiver switch can receive in a unit time, the receiving rate of the receiver switch is calculated based on the following formula:
Figure BDA0003872457620000041
wherein R is new Indicating the receiving rate, W, of the receiving side switch new Indicating the size of the data, τ, that the receiver switch can receive at the current unit time update Indicating the length of time per unit time.
In some embodiments of the present invention, in the step of calculating the transmission rate of each transmitting-end switch based on the number of transmitting-ends connected to the transmitting-end switch and the receiving rate of the receiving-end switch, the transmission rate of each transmitting-end switch is calculated based on the following formula:
Figure BDA0003872457620000042
wherein R is ToR Indicating the currently calculated sending rate, R, of the sending-side switch new Indicates the receiving rate, num, of the receiving switch ToR Indicates the number of senders, num, connected to the sender switch currently being computed Indicating the total number of senders connected to the sender switch where the incast condition occurred.
Another aspect of the present invention further provides a data center RDMAIncast solution system based on a ToR switch, where the system includes a sending-end switch management module and a receiving-end switch management module,
the sending end switch management module is used for sensing an incast condition based on a sending end switch, and if the incast condition occurs, caching the data stream in the sending end switch;
the receiving end switch management module is used for acquiring the network card receiving rate of a user receiving end based on a receiving end switch, calculating the data volume size which can be received by the user receiving end in unit time based on the network card receiving rate, calculating the link utilization rate of the receiving end switch based on the data volume size which can be received by the user receiving end in unit time, calculating the data volume size which can be received by the receiving end switch in unit time based on the link utilization rate, calculating the receiving rate of the receiving end switch based on the data volume size which can be received by the receiving end switch in unit time, calculating the sending rate of each sending end switch based on the sending end number connected with the sending end switch and the receiving rate of the receiving end switch, and sending data to the receiving end switch based on the sending rate by the sending end switch under the incast condition.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and drawings.
It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the specific details set forth above, and that these and other objects that can be achieved with the present invention will be more clearly understood from the detailed description that follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention.
Fig. 1 is a schematic diagram of an embodiment of a data center RDMA Incast solution based on a ToR switch.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the following embodiments and the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the solution according to the present invention are shown in the drawings, and other details not so related to the present invention are omitted.
It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.
It is also noted that, unless otherwise specified, the term "coupled" is used herein to refer not only to a direct connection, but also to an indirect connection with an intermediate.
Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the drawings, the same reference numerals denote the same or similar parts, or the same or similar steps.
In the prior art, the Incast problem mainly has the following two characteristics: 1. in a many-to-one communication mode, a plurality of servers respond to transmission data of the same requester to form an asymmetric data transmission mode; 2. in response to synchronicity, for a data request from a requester, servers in the data center network respond and begin transmitting data at approximately the same time, resulting in a large volume of traffic flooding within the data center network in a short period of time. These characteristics of Incast will have a serious impact on the data center network. For a traditional TCP network, incast may cause an overflow of switch buffers in the network, resulting in a large amount of lost and retransmitted data packets. For RDMA networks, incast may cause head-of-line blocking (head-of-line blocking) problems, trigger Priority Flow Control (PFC) mechanisms, and even packet loss retransmission. The PFC mechanism triggers severe ghost ringing delay and may also cause problems such as pause frame storm and deadlock in the network. For the TCP Incast problem, the prior art proposes many TCP Incast solutions based on end-to-end transmission control and ACK backhaul mechanisms, but the TCP Incast solutions fail within RDMA networks due to high link rates and changes in transmission mechanisms of RDMA networks. An accurate and effective RDMA Incast control scheme is urgently needed to be provided.
In the prior art, the solutions of RDMA Incast of the data center are mainly classified into a Congestion Control (CC) solution and a congestion control-based secondary solution. In RDMA networks, the congestion control mainly used in the existing industry is as follows: 1. the DCQCN employs an explicit congestion control (ECN) mechanism to control the sending rate of the flow based on the length of the switch egress port queue. 2. By adopting the RTT (round-trip-time) gradient as a congestion feedback signal, the congestion condition in the network can be sensed more sensitively, and the congestion is responded in time. The core ideas of DCQCN and Timely are that a heuristic algorithm is adopted to adjust the sending rate of a sending end, the adjustment granularity is thick, and the network performance cannot be rapidly converged when congestion changes. 3. The network utilization rate of the bottleneck link in the network is calculated by adopting an in-network telemetry (INT) technology to serve as a congestion feedback signal, so that the sending rate and window adjustment with finer granularity are realized, and the Flow Completion Time (FCT), fairness, tail time delay and the like are excellent.
The congestion control-based secondary solution is mainly a measure proposed for the features of Incast in combination with the features of RDMA networks. The earliest work was DCQCN +, which is an improvement over the shortcomings of DCQCN in facing Incast. The DCQCN + maintains a flow table recording active flows at the receiving end, and the receiving end feeds back a Congestion Notification Packet (CNP) according to the number of the active flows. The transmitting end of the DCQCN + adopts an improved Reaction Point (RP) algorithm to enhance the processing capability of the Incast problem. Another approach in the prior art proposes an Incast control strategy based on a dynamic window of a switch. The exchanger maintains a sending window for each receiving end, limits the data burst quantity in a fixed time interval, and the upstream link updates the window value of the downstream link according to the available bandwidth period. After the window is exhausted, redundant data at a receiving end of the switch is isolated by adopting a Virtual Output Queue (VOQ), so that the influence of Incast on other flows is reduced, and the overall network performance under the Incast scene is improved.
However, the above conventional RDMA congestion control adopts an end-to-end control scheme, and it takes at least one RTT to respond to congestion in the network, and although the DCQCN + improves the deficiency that the DCQCN faces Incast, it still adopts end-to-end transmission control, and is also not applicable to RDMA high link rate networks. This makes Incast unable to achieve the ideal control, and as RDMA network link rates continue to increase, more and more flows will complete transmission within one RTT. Therefore, the traditional congestion control scheme cannot respond to the Incast problem in time and influences the transmission of non-Incast flows. In addition, most congestion control adopts a heuristic algorithm to adjust the sending rate, and the network performance cannot be converged in a short time. In summary, the conventional congestion control scheme has failed to address the Incast problem in RDMA networks.
To solve the above problems, as shown in fig. 1, the present invention provides a method for RDMA Incast solution of a data center based on a ToR switch, where an RDMA network includes a sending-end switch connected to a sending end and a receiving-end switch connected to a user receiving end, and the method includes the steps of:
in some embodiments of the present invention, the sender switch and the receiver switch are both ToR switches.
In some embodiments of the present invention, a TOR (Top of rack) access is a more common routing from server to switch or switch to switch. The TOR switches may be both access layer switches and aggregation layer switches and core layer switches. When the TOR switch is positioned at the top of the server cabinet and plays a role of connecting the server with the core layer switch or the convergence layer switch, the TOR switch is called as an access layer switch; when the TOR switch is located on top of the switch cabinet, it is called a convergence layer or core layer switch.
Step S100, based on the fact that a sending end switch senses an incast condition, if the incast condition happens, caching a data stream in the sending end switch;
in some embodiments of the present invention, the sending-end switch is provided with a buffer space, and the data stream is buffered in the buffer space of the sending-end switch.
Step S200, acquiring the network card receiving rate of the user receiving end based on the receiving end switch, calculating the data volume size which can be received by the user receiving end in unit time based on the network card receiving rate, calculating the link utilization rate of the receiving end switch based on the data volume size which can be received by the user receiving end in unit time, calculating the data volume size which can be received by the receiving end switch in unit time based on the link utilization rate, calculating the receiving rate of the receiving end switch based on the data volume size which can be received by the receiving end switch in unit time, calculating the sending rate of each sending end switch based on the sending end number connected with the sending end switch and the receiving rate of the receiving end switch, and sending data to the receiving end switch based on the sending rate by the sending end switch under the incast condition.
In some embodiments of the present invention, in the step of calculating the sending rate of each sending-end switch based on the number of sending ends connected to the sending-end switch and the receiving rate of the receiving-end switch, the sending rate is allocated to the sending-end switch based on the number of sending ends connected to the sending-end switch, if the number of sending ends connected to the sending-end switch is large, a large rate is allocated, and if the number of sending ends connected to the sending-end switch is small, a small rate is allocated, so that each sending-end switch in which an incast condition occurs can stably output data.
Adopt above-mentioned scheme, this scheme only needs to arrange the procedure at the sending end switch and the receiving end switch of RDMA network, utilize the perception of sending end switch incast condition, when the incast condition takes place, buffer the data stream, avoid a large amount of data streams to rush into the RDMA network and cause the data to block up, on the other hand, the receiving end switch calculates the maximum receiving rate of receiving end switch according to the condition of user's receiving end, and distribute sending rate for every sending end switch, output the data that sending end switch cached fast and high-efficiently, this scheme need not arrange between the switch between sending end switch and receiving end switch, the treatment effeciency of the incast condition has not only been improved, and reduced and arranged the degree of difficulty, accelerate procedural processing speed.
In some embodiments of the present invention, the sender switch that has an Incast condition performs rate limiting on the transmission of the Incast traffic according to a token bucket mechanism.
The token bucket mechanism is a rate limiting algorithm commonly used by a switch, tokens are generated according to a set rate, the burst amount of data is limited according to the residual quantity of the tokens, so that the data transmission rate is limited, and after the tokens are transmitted, the tokens are returned to the token bucket to continue to execute an output task.
In some embodiments of the present invention, the step of sensing an incast condition based on a sending-end switch includes:
the sending end switch counts the data volume sent to the same user receiving end by the sending end switch within a preset time length;
and when the data volume sent to the same user receiving end through the sending end switch within the preset time length is larger than a preset incast threshold value, judging that the incast condition occurs.
In some embodiments of the present invention, the sending-end switch and the receiving-end switch both maintain a data flow table for recording internal data flows of the sending-end switch or the receiving-end switch, and the data flow table is emptied once every preset time interval;
setting a data sub-flow table for each sending end connected with the sending end switch in a sending end switch data flow table;
in the specific implementation process, when the data volume in the data flow table of the sender switch is larger than a preset incast threshold, it is determined that an incast condition occurs.
In some embodiments of the present invention, when the size of the data volume sent to the same user receiving end through the sending end switch within a preset time length is greater than a preset incast threshold, it is determined that an incast condition occurs, the sending end switch performs an incast flag, when the data volume cached in the sending end switch is less than or equal to the incast threshold, the incast flag is removed, and when the sending end switch removes the incast flag, the sending end switch normally sends the data.
In some embodiments of the present invention, the step of sensing an incast condition based on the sender switch further includes, after it is determined that the incast condition occurs, the sender switch continuously transmits a data volume of a first preset size to the user receiver; if incast occurs, caching the data flow in the transmitting end switch, and caching the data flow in the transmitting end switch after the transmitting end switch transmits the data volume with the first preset size to the user receiving end.
By adopting the scheme, if the data transmission of the sending end switch is stopped immediately after the incast condition is judged to occur, if the waste of the bandwidth on the link when the receiving end switch waits for the allocated rate value is caused, the sending end switch continues to transmit the data volume with the first preset size to the user receiving end after the incast condition is judged to occur, and the waste of the bandwidth on the link when the receiving end switch waits for the allocated rate value is avoided.
In some embodiments of the present invention, if an incast occurs, the step of caching the data stream in the sending-end switch further includes constructing a queue for the data stream cached in the sending-end switch based on an input time of the data stream, and in the step of sending data to the receiving-end switch by the sending-end switch based on the sending rate in the incast, the sending-end switch sends data to the receiving-end switch based on a queue sequence.
In some embodiments of the present invention, in the step of calculating the size of the data amount that can be received by the user receiving end in the unit time based on the network card receiving rate, the size of the data amount that can be received by the user receiving end in the unit time is calculated based on the following formula:
Δ=B dip ×τ update
wherein, B dip Indicates the network card receiving rate of the user receiving end, tau update The time length of unit time is represented, and delta represents the size of data volume which can be received by a user receiving end in unit time;
in the step of calculating the link utilization rate of the receiving-end switch based on the size of the data volume which can be received by the user receiving end in unit time, the link utilization rate of the receiving-end switch is calculated based on the following formula:
Figure BDA0003872457620000091
wherein, Δ represents the data volume that can be received by the user receiving end in unit time, qlen represents the data volume cached by the port connected between the current receiving end switch and the user receiving end, and η represents the link utilization rate of the receiving end switch.
In some embodiments of the present invention, in the step of calculating the size of the data amount that can be received by the receiving-end switch in unit time based on the link utilization, the size of the data amount that can be received by the receiving-end switch in unit time calculated in a previous unit time is obtained, and the size of the data amount that can be received by the receiving-end switch in unit time is calculated based on the size of the data amount that can be received by the receiving-end switch in unit time calculated in the previous unit time and the link utilization.
In some embodiments of the present invention, in the step of calculating the size of the data amount that can be received by the receiver switch in the current unit time based on the size of the data amount that can be received by the receiver switch in the previous unit time and the link utilization ratio, the size of the data amount that can be received by the receiver switch in the unit time is calculated based on the following formula:
Figure BDA0003872457620000092
wherein, B dip Indicating the network card receiving rate, tau, of the user receiving end update Represents the time length per unit time, eta represents the link utilization of the receiver switch, W old Represents the data volume W calculated in the previous unit time and can be received by the receiving end switch in the unit time new The data size that the receiver switch can receive at the current unit time is shown, and both alpha and beta are constants.
In some embodiments of the present invention, the calculated size of the data amount that can be received by the receiving-end switch in the current unit time is used in the next calculation as the size of the data amount that can be received by the receiving-end switch in the unit time calculated in the previous unit time to participate in the next calculation.
By adopting the scheme, the data volume which can be received by the receiving end switch in the unit time next unit time is calculated based on the data volume which can be received by the receiving end switch in the unit time before the receiving end switch is calculated, so that the calculation accuracy is improved.
In some embodiments of the present invention, in the step of calculating the receiving rate of the receiver switch based on the size of the data amount that the receiver switch can receive in a unit time, the receiving rate of the receiver switch is calculated based on the following formula:
Figure BDA0003872457620000101
wherein R is new Indicating the receiving rate, W, of the receiving side switch new Indicating the size of the data, τ, that the receiver switch can receive at the current unit time update Time of unit timeThe length of the gap.
In some embodiments of the present invention, in the step of calculating the transmission rate of each transmitting-end switch based on the number of transmitting-ends connected to the transmitting-end switch and the receiving rate of the receiving-end switch, the transmission rate of each transmitting-end switch is calculated based on the following formula:
Figure BDA0003872457620000102
wherein R is ToR Indicating the currently calculated sending rate, R, of the sending-end switch new Indicating the receiving rate, num, of the receiving side switch ToR Indicates the number of senders, num, connected to the sender switch currently being computed Indicating the total number of senders connected to the sender switch where the incast condition occurred.
By adopting the scheme, the scheme is based on the data volume which can be received by the receiving end switch in unit time and calculated by the receiving end switch in the previous unit time, the data volume which can be received by the receiving end switch in unit time next unit time is calculated, the total receiving rate of the receiving end switch is calculated based on the data volume which can be received by the receiving end switch in unit time next unit time, the sending rate is distributed to each sending end switch based on the quantity of the sending ends connected with the sending end switch which is calculated currently, the sending rate can be distributed to each sending end switch every other unit time, the maximum rate transmission can be ensured, and the incast condition is solved.
Another aspect of the present invention also provides a data center RDMA Incast solution system based on a ToR switch, the system including a sending side switch management module and a receiving side switch management module,
the sending end switch management module is used for sensing an incast condition based on a sending end switch, and caching the data stream in the sending end switch if the incast condition occurs;
the receiving end switch management module is used for obtaining the network card receiving rate of a user receiving end based on a receiving end switch, calculating the data volume size that the user receiving end can receive in unit time based on the network card receiving rate, calculating the link utilization rate of the receiving end switch based on the data volume size that the user receiving end can receive in unit time, calculating the data volume size that the receiving end switch can receive in unit time based on the link utilization rate, calculating the receiving rate of the receiving end switch based on the data volume size that the receiving end switch can receive in unit time, calculating the sending rate of each sending end switch based on the sending end quantity connected with the sending end switch and the receiving rate of the receiving end switch, and generating an incast condition that the sending end switch sends data to the receiving end switch based on the sending rate.
By adopting the scheme, the system only needs to arrange programs at the sending end switch and the receiving end switch of the RDMA network, and utilizes the sending end switch to sense the incast condition, when the incast condition occurs, the data stream is cached, so that the phenomenon that a large number of data streams flow into the RDMA network to cause data congestion is avoided.
In the prior art, the RDMA Incast problem presents a huge challenge to the data center network. Since the traditional RDMA congestion control scheme employs end-to-end control and heuristic algorithms to deal with in-network congestion. After the Incast occurs, a buffer area of the switch accumulates a large amount of data, and the problems of triggering of a PFC mechanism and blocking of a head of a queue, even a serious packet loss phenomenon, are caused. In addition, triggering of the PFC mechanism can also cause problems such as pause frame storm, deadlock, etc.
The goal of RDMA Incast control is to minimize the impact of Incast on non-Incast streaming without limiting its normal transmission. The method has the key points of fast and accurate identification and flow scheduling of the Incast and reduction of accumulation of the Incast flow in a buffer area of the switch. The scheme of coarse-grained Incast identification and flow scheduling is achieved through a dynamic window mechanism of a switch and a VOQ, the coarse-grained control mechanism cannot quickly identify the Incast, a large amount of Incast flow is accumulated in a buffer area of the switch at the beginning stage of the Incast, and transmission of non-Incast flow is seriously affected.
According to the scheme, the Incast is quickly and accurately identified and isolated through the sending end switch and the receiving end switch, incast flow accumulation in a switch buffer area is reduced, and the transmission rate of the Incast flow is controlled through a token bucket mechanism, so that the Incast transmission rate is quickly converged.
The main advantages of the scheme include:
1. the Incast flow is quickly and accurately identified and isolated, and the Incast flow accumulation of a buffer area of the switch is reduced;
2. a token bucket mechanism is adopted to accurately control the transmission rate of the Incast flow, so that the Incast flow is quickly converged to the available bandwidth of a bottleneck link, and the influence of the Incast on the transmission of the non-Incast flow is reduced;
3. the method is only deployed on a sending end switch and a receiving end switch, is compatible with the existing transmission control scheme, and has low hardware overhead.
The embodiment of the present invention further provides a data center RDMAIncast solution device based on a ToR switch, where the device includes a computer device, where the computer device includes a processor and a memory, where the memory stores computer instructions, the processor is configured to execute the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the device implements the steps implemented by the foregoing method.
An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the steps implemented by the above data center RDMAIncast solution based on a ToR switch. The computer readable storage medium may be a tangible storage medium such as Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disks, removable storage disks, CD-ROMs, or any other form of storage medium known in the art.
Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations thereof. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link.
It is to be understood that the invention is not limited to the specific arrangements and instrumentality described above and shown in the drawings. A detailed description of known methods is omitted herein for the sake of brevity. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and illustrated, and those skilled in the art can make various changes, modifications and additions, or change the order between the steps, after comprehending the spirit of the present invention.
Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A data center RDMA Incast solution method based on a TOR switch is characterized in that an RDMA network comprises a sending end switch connected with a sending end and a receiving end switch connected with a user receiving end, the sending end switch and the receiving end switch are both the TOR switch, and the method comprises the following steps:
based on the fact that a sending end switch senses an incast condition, if the incast condition occurs, caching the data stream in the sending end switch;
the network card receiving rate of a user receiving end is obtained based on a receiving end switch, the data volume size which can be received by the user receiving end in unit time is calculated based on the network card receiving rate, the link utilization rate of the receiving end switch is calculated based on the data volume size which can be received by the user receiving end in unit time, the data volume size which can be received by the receiving end switch in unit time is calculated based on the link utilization rate, the receiving rate of the receiving end switch is calculated based on the data volume size which can be received by the receiving end switch in unit time, the sending rate of each sending end switch is calculated based on the sending end quantity which is connected with the sending end switch and the receiving rate of the receiving end switch, and the incast condition occurs, the sending end switch sends data to the receiving end switch based on the sending rate.
2. The ToR-switch-based data center RDMA Incast solution of claim 1, wherein the sender-switch-aware-Incast-based step comprises:
the sending end switch counts the data size sent to the same user receiving end by the sending end switch within a preset time length;
and when the data volume sent to the same user receiving end through the sending end switch within the preset time length is larger than a preset incast threshold value, judging that the incast condition occurs.
3. The ToR-switch-based RDMA Incast solution method according to claim 1 or 2, wherein the step of sensing an Incast condition based on the sender-side switch further comprises the sender-side switch continuing to transmit a first preset amount of data to the user receiver when the Incast condition is determined to occur; if incast occurs, caching the data flow in the transmitting end switch, and caching the data flow in the transmitting end switch after the transmitting end switch transmits the data volume with the first preset size to the user receiving end.
4. The RDMA Incast solution for a ToR switch-based data center according to claim 1, wherein if an Incast condition occurs, the step of buffering the data stream in the sending-end switch further comprises constructing a queue for the data stream buffered in the sending-end switch based on an input time of the data stream, and in the step of sending data to the receiving-end switch based on a sending rate by the sending-end switch in the Incast condition, the sending-end switch sends data to the receiving-end switch based on a queue order.
5. The method for the ToR switch-based RDMA Incast solution of the data center, according to claim 1, wherein in the step of calculating the size of the data volume that can be received by the user receiver per unit time based on the network card receiving rate, the size of the data volume that can be received by the user receiver per unit time is calculated based on the following formula:
Δ=B dip ×τ update
wherein, B dip Indicates the network card receiving rate of the user receiving end, tau update The time length of unit time is represented, and delta represents the size of data volume which can be received by a user receiving end in unit time;
in the step of calculating the link utilization rate of the receiving-end switch based on the size of the data volume which can be received by the user receiving end in unit time, the link utilization rate of the receiving-end switch is calculated based on the following formula:
Figure FDA0003872457610000021
wherein, Δ represents the data volume that the user receiving end can receive in unit time, glen represents the data volume cached at the port connected between the current receiving end switch and the user receiving end, and η represents the link utilization rate of the receiving end switch.
6. The RDMA Incast solution for the ToR-switch-based data center according to claim 1, wherein in the step of calculating the size of the data volume that can be received by the sink switch in a unit time based on the link utilization, the size of the data volume that can be received by the sink switch in a unit time calculated in a previous unit time is obtained, and the size of the data volume that can be received by the sink switch in a current unit time is calculated based on the size of the data volume that can be received by the sink switch in a unit time calculated in a previous unit time and the link utilization.
7. The RDMA Incast solution method based on ToR switch, according to claim 6, wherein in the step of calculating the size of the data amount that can be received by the sink switch in the current unit time based on the size of the data amount that can be received by the sink switch in the previous unit time calculated by the sink switch and the link utilization, the size of the data amount that can be received by the sink switch in the current unit time is calculated based on the following formula:
Figure FDA0003872457610000022
wherein, B dip Indicates the network card receiving rate of the user receiving end, tau update Represents the time length per unit time, eta represents the link utilization of the receiver switch, W old Represents the calculated receiving end intersection of the previous unit timeThe size of the data volume, W, that the switch can receive in unit time new The size of data that the receiver switch can receive in the current unit time is shown, and both alpha and beta are constants.
8. The ToR-switch-based RDMA Incast solution of the data center, according to claim 1, wherein in the step of calculating the receiving rate of the sink switch based on a size of an amount of data that the sink switch can receive in a unit time, the receiving rate of the sink switch is calculated based on a formula:
Figure FDA0003872457610000031
wherein R is new Indicating the receiving rate, W, of the receiving side switch new Indicating the size of the data, τ, that the receiver switch can receive at the current unit time update Indicating the length of time per unit time.
9. The ToR switch-based data center RDMA Incast solution of claim 1, wherein in the step of calculating the sending rate of each sending-end switch based on the number of sending ends connected to the sending-end switch and the receiving rate of the receiving-end switch, the sending rate of each sending-end switch is calculated based on the following formula:
Figure FDA0003872457610000032
wherein R is ToR Indicating the currently calculated sending rate, R, of the sending-end switch new Indicating the receiving rate, num, of the receiving side switch ToR Indicates the number of senders, num, connected to the sender switch currently being computed Indicating the total number of senders connected to the sender switch where the incast condition occurred.
10. A data center RDMA Incast solution system based on a ToR switch is characterized by comprising a sending end switch management module and a receiving end switch management module,
the sending end switch management module is used for sensing an incast condition based on a sending end switch, and caching the data stream in the sending end switch if the incast condition occurs;
the receiving end switch management module is used for obtaining the network card receiving rate of a user receiving end based on a receiving end switch, calculating the data volume size that the user receiving end can receive in unit time based on the network card receiving rate, calculating the link utilization rate of the receiving end switch based on the data volume size that the user receiving end can receive in unit time, calculating the data volume size that the receiving end switch can receive in unit time based on the link utilization rate, calculating the receiving rate of the receiving end switch based on the data volume size that the receiving end switch can receive in unit time, calculating the sending rate of each sending end switch based on the sending end quantity connected with the sending end switch and the receiving rate of the receiving end switch, and generating an incast condition that the sending end switch sends data to the receiving end switch based on the sending rate.
CN202211200973.0A 2022-09-29 2022-09-29 ToR switch-based data center RDMA Incast solution method and system Pending CN115714750A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211200973.0A CN115714750A (en) 2022-09-29 2022-09-29 ToR switch-based data center RDMA Incast solution method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211200973.0A CN115714750A (en) 2022-09-29 2022-09-29 ToR switch-based data center RDMA Incast solution method and system

Publications (1)

Publication Number Publication Date
CN115714750A true CN115714750A (en) 2023-02-24

Family

ID=85230907

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211200973.0A Pending CN115714750A (en) 2022-09-29 2022-09-29 ToR switch-based data center RDMA Incast solution method and system

Country Status (1)

Country Link
CN (1) CN115714750A (en)

Similar Documents

Publication Publication Date Title
CN109120544B (en) Transmission control method based on host end flow scheduling in data center network
EP3468119B1 (en) Method, apparatus and device for balancing load
US7016971B1 (en) Congestion management in a distributed computer system multiplying current variable injection rate with a constant to set new variable injection rate at source node
US20220303217A1 (en) Data Forwarding Method, Data Buffering Method, Apparatus, and Related Device
Hafeez et al. Detection and mitigation of congestion in SDN enabled data center networks: A survey
US20060203730A1 (en) Method and system for reducing end station latency in response to network congestion
JP2003518817A (en) Network switching method using packet scheduling
US11870698B2 (en) Congestion control method and apparatus, communications network, and computer storage medium
CN113949665B (en) Method, device, chip and computer storage medium for determining flow control threshold
US11165705B2 (en) Data transmission method, device, and computer storage medium
CN112995048B (en) Blocking control and scheduling fusion method of data center network and terminal equipment
CN110177051A (en) Data center's jamming control method based on fluidics
EP3108631B1 (en) Buffer bloat control
Khan et al. RecFlow: SDN-based receiver-driven flow scheduling in datacenters
US11805071B2 (en) Congestion control processing method, packet forwarding apparatus, and packet receiving apparatus
CN115714750A (en) ToR switch-based data center RDMA Incast solution method and system
CN114979011A (en) Congestion control method applied to park network
Bai et al. Ssp: Speeding up small flows for proactive transport in datacenters
Le et al. SFC: Near-source congestion signaling and flow control
Chen et al. On meeting deadlines in datacenter networks
TWI831622B (en) Apparatus for managing network flow congestion and method thereof
CN115914106B (en) Self-adaptive buffer method for network traffic forwarding
CN114765585B (en) Service quality detection method, message processing method and device
WO2023123075A1 (en) Data exchange control method and apparatus
Zhang Burst Forwarding Network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination