CN117579559A - RoCEv2 congestion control method based on AI - Google Patents
RoCEv2 congestion control method based on AI Download PDFInfo
- Publication number
- CN117579559A CN117579559A CN202410064926.0A CN202410064926A CN117579559A CN 117579559 A CN117579559 A CN 117579559A CN 202410064926 A CN202410064926 A CN 202410064926A CN 117579559 A CN117579559 A CN 117579559A
- Authority
- CN
- China
- Prior art keywords
- link
- congestion
- data
- branch
- main
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000005540 biological transmission Effects 0.000 claims abstract description 69
- 238000004891 communication Methods 0.000 claims abstract description 31
- 238000012549 training Methods 0.000 claims abstract description 19
- 238000010276 construction Methods 0.000 claims abstract description 12
- 238000005457 optimization Methods 0.000 claims abstract description 12
- 208000027744 congestion Diseases 0.000 claims description 126
- 238000011156 evaluation Methods 0.000 claims description 15
- 238000012544 monitoring process Methods 0.000 claims description 11
- 239000012634 fragment Substances 0.000 claims description 10
- 238000012423 maintenance Methods 0.000 claims description 10
- 238000004364 calculation method Methods 0.000 claims description 8
- 238000013524 data verification Methods 0.000 claims description 8
- 101100517651 Caenorhabditis elegans num-1 gene Proteins 0.000 claims description 6
- 230000001934 delay Effects 0.000 claims description 5
- 238000004140 cleaning Methods 0.000 claims description 4
- 230000003828 downregulation Effects 0.000 claims description 4
- 230000003827 upregulation Effects 0.000 claims description 4
- 230000033228 biological regulation Effects 0.000 claims description 2
- 230000015572 biosynthetic process Effects 0.000 claims description 2
- 230000008676 import Effects 0.000 claims description 2
- 238000003786 synthesis reaction Methods 0.000 claims description 2
- 238000012797 qualification Methods 0.000 claims 1
- 238000005516 engineering process Methods 0.000 description 5
- 238000013507 mapping Methods 0.000 description 3
- 108010014172 Factor V Proteins 0.000 description 1
- 108010054218 Factor VIII Proteins 0.000 description 1
- 102000001690 Factor VIII Human genes 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000002222 downregulating effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/10—Flow control; Congestion control
- H04L47/12—Avoiding congestion; Recovering from congestion
- H04L47/125—Avoiding congestion; Recovering from congestion by balancing the load, e.g. traffic engineering
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/02—Topology update or discovery
- H04L45/08—Learning-based routing, e.g. using neural networks or artificial intelligence
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/24—Multipath
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/28—Routing or path finding of packets in data switching networks using route fault recovery
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L45/00—Routing or path finding of packets in data switching networks
- H04L45/44—Distributed routing
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/10—Flow control; Congestion control
- H04L47/26—Flow control; Congestion control using explicit feedback to the source, e.g. choke packets
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L47/00—Traffic control in data switching networks
- H04L47/10—Flow control; Congestion control
- H04L47/30—Flow control; Congestion control in combination with information about buffer occupancy at either end or at transit nodes
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/50—Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses an AI-based RoCEv2 congestion control method, which relates to the technical field of congestion control, and comprises the steps of constructing a plurality of transmission links, carrying out communication interconnection on the transmission links by adopting an RDMA protocol, constructing a topology link cluster network, and collecting link related data through the topology link cluster network; constructing a link flow model corresponding to the topological link cluster network according to the link related data, carrying out feedback training on the link flow model, further generating feedback link parameters to carry out link construction optimization, and further carrying out congestion identification through the link flow model; and obtaining a congestion identification result, performing congestion control based on the corresponding result, evaluating the link flow condition after congestion control, and further generating a control scheme, thereby playing a role in effectively preventing congestion before congestion and timely relieving the congestion during congestion.
Description
Technical Field
The invention relates to the technical field of congestion control, in particular to an AI-based RoCEv2 congestion control method.
Background
The RoCEv2 technology is one of the most popular interconnection technologies in the current high-performance computing clusters, and adopts RDMA protocol, and has the characteristics of low delay and high bandwidth. However, under high load, the RoCEv2 network is easy to generate congestion, which causes problems of packet loss, delay increase and the like in data transmission and affects system performance.
Therefore, how to effectively control the congestion situation in the RoCEv2 network is an important research direction, how to effectively prevent congestion before congestion, and how to timely break down congestion when congestion, and how to evaluate the overall situation after congestion control is completed, which are all the problems that we need to consider at present.
Disclosure of Invention
In order to solve the above problems, an object of the present invention is to provide an AI-based RoCEv2 congestion control method.
The aim of the invention can be achieved by the following technical scheme: an AI-based RoCEv2 congestion control method comprising the steps of:
step S1: constructing a plurality of transmission links, adopting RDMA protocol to carry out communication interconnection on the transmission links, further constructing a topology link cluster network, and acquiring link related data through the topology link cluster network;
step S2: constructing a link flow model corresponding to the topological link cluster network according to the link related data, carrying out feedback training on the link flow model, further generating feedback link parameters to carry out link construction optimization, and further carrying out congestion identification through the link flow model;
step S3: and obtaining a congestion identification result, performing congestion control based on the corresponding result, evaluating the link flow condition after congestion control, and further generating a control scheme.
Further, the process of constructing a plurality of transmission links and adopting RDMA protocol communication interconnection to construct a topological link cluster network comprises the following steps:
arranging a plurality of hosts at a preset plurality of link points, respectively numbering i and j, i=1, 2,3, … …, n, j=1, 2,3, … …, m, n and m are natural numbers larger than 0, acquiring a plurality of topological point sequence pairs, and recording as L, L= < i, j >, wherein the hosts are provided with two real-time states of 'working' and 'standby', setting the topological point sequence pairs as link starting points and link ending points, the host at the link starting points sends a communication request, the next host at the 'standby' receives the communication request to establish a communication relationship, continuously generating and sending a new communication request to the host corresponding to the next topological point sequence pair, repeating the operation until the communication request is sent to any one link ending point, and further constructing a plurality of transmission links;
configuring a switch for each transmission link, wherein the switch is used for carrying out communication interconnection between each transmission link, the switch is provided with a sending area and a caching area, the caching area adopts an RDMA protocol for registration, acquired transmission data are converted into RDMA local memory data after registration is completed, a grabbing request is sent to the sending area of the next switch through the sending area of the switch, the address and the caching area information of the next switch are acquired through the grabbing request and returned to the current switch, the RDMA local memory data of the caching area of the current switch are further transmitted to the next switch, and after the RDMA local memory data are received by the switches, the fact that the communication interconnection of the transmission links through the RAMA protocol is successful is indicated, and a topology link cluster network is built.
Further, the process of collecting the link related data through the topological link cluster network comprises the following steps:
the topology link trunking network sets data acquisition time, link rest time and data verification time, the topology link trunking network acquires link related data in the data acquisition time and generates link maintenance early warning, an administrator overhauls a corresponding failed transmission link according to the link maintenance early warning in the link rest time, acquires the link related data in the data verification time and imports the link related data into a preset data cleaning program to perform data cleaning, acquires a data format of the link related data, compares and judges the data format with a preset standard format, and further executes corresponding operation.
Further, the process of constructing the link traffic model according to the link related data includes:
the link related data comprises a link bandwidth, a link packet loss rate, a link delay, a link queue length and a link type, the link type comprises a main link and a branch link, the different link types are provided with corresponding link parameters, the link parameters comprise a bandwidth utilization rate threshold, a packet loss upper limit, a delay threshold and a congestion judgment length, the link related data and the link parameters of the main link and the branch link are respectively summarized, a plurality of main link model fragments and branch link model fragments are further generated, the junction point of the main link and the branch link is marked as a splicing point, and splicing synthesis of the main link model fragments and the branch link model fragments is carried out at the splicing points, so that a link flow model is constructed.
Further, the process of performing the feedback training on the link traffic model and further generating feedback link parameters to perform the link construction optimization includes:
the primary link flow monitoring area and the secondary link flow monitoring area are respectively mapped for the main link and the branch link through the link flow model, the model AI calculation force is obtained, further feedback training of the link flow model is started, feedback link parameters are generated and transmitted to an administrator, and the administrator arranges related operation and maintenance personnel to perform link construction optimization.
Further, the process of performing the congestion identification through the link traffic model includes:
the method comprises the steps of obtaining respective link bandwidths, link packet loss rates, link delays and link queue lengths of a main link and a branch link through a link flow model, comparing the respective link bandwidths, link packet loss rates, link delays and link queue lengths with respective bandwidth utilization thresholds, packet loss upper limits, delay thresholds and congestion judging lengths of the main link and the branch link, further generating respective corresponding congestion risk coefficients, setting the congestion thresholds of the main link and the branch link, accumulating the respective congestion risk coefficients of the main link and the branch link, and carrying out numerical relation judgment with the respective congestion thresholds, further generating different congestion identification results according to judgment.
Further, the process of obtaining the congestion identification result and performing the congestion control based on the corresponding result includes:
and when the congestion identification result corresponding to the main link or the branch link is '0', the congestion control is not performed, and when the congestion identification result corresponding to the main link or the branch link is '1', the main link or the branch link with congestion phenomenon is positioned, a data transmission window is set for the association of the main link or the branch link, the real-time link data quantity of the main link or the branch link is acquired, a window up-regulation threshold value and a window down-regulation threshold value are set, the size relation among the real-time link data quantity, the window up-regulation threshold value and the window down-regulation threshold value is compared, and then corresponding window regulation operation is executed.
Further, the process of evaluating the link traffic condition after congestion control and generating the control scheme includes:
the number of times of congestion control is obtained and is recorded as Num1, the number of times of congestion control success is obtained and is recorded as Num2, further the congestion success rate is obtained and is recorded as Sc, and the Sc= (Num 1/Num 2) ×100% is obtained, and a probability interval one, a probability interval two and a probability interval three are preset and are respectively recorded as Ω 1 、Ω 2 And omega 3 ;
Wherein Ω 1 =(0,0.6),Ω 2 =[0.6,0.85],Ω 3 =(0.85,1);
When Sc is e 1 When the evaluation result is "bad";
when Sc is e 2 When the evaluation result is "good";
when Sc is e 3 When the evaluation result is "excellent";
and when the evaluation result is 'poor' and 'good', increasing the numerical value of the model AI calculation force, and when the evaluation result is 'excellent', summarizing all operations performed by the data transmission window and recording the operations into a preset scheme template, thereby generating a control scheme.
Compared with the prior art, the invention has the beneficial effects that:
1. constructing a plurality of transmission links, converting transmission data received by a plurality of switches into RDMA local memory data by adopting an RDMA protocol, carrying out communication interconnection of the transmission links, further constructing a topology link cluster network, setting data acquisition time, link rest time and data verification time to acquire link related data, wherein on one hand, the RDMA protocol adopts a host memory access technology, so that CPU participation is reduced in the data transmission process, thereby improving data transmission performance and efficiency, on the other hand, the transmission links with faults are found in the data acquisition time, overhauled in the link rest time, and the compliance of the link related data is verified in the data verification time, so that faults are timely found and removed, and the correct compliance of the link related data is ensured.
2. Constructing a link flow model corresponding to a topological link trunking network according to link related data, calibrating a main link and a branch link, setting a training period for feedback training, marking the link flow model as a compliance model until a preset condition is met, improving the prediction accuracy of the model to a certain extent, generating feedback link parameters corresponding to the main link and the branch link respectively through the link flow model after feedback training, indicating the risk of congestion of the current link, wherein the feedback link parameters are used for constructing and optimizing the link, and playing roles of preventing congestion and preventing accidents.
3. When the link construction optimization is finished, congestion identifies that the topology link cluster network still has congestion, a main link or a branch link with congestion in the topology link cluster network is positioned, a data transmission window is set for congestion control, congestion is found out in time, congestion is cleared, the link flow condition is evaluated after the congestion control is finished, a corresponding control scheme is generated, and the generated control scheme can be used for congestion control of other subsequent same congestion conditions.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
As shown in fig. 1, an AI-based RoCEv2 congestion control method includes the steps of:
step S1: constructing a plurality of transmission links, adopting RDMA protocol to carry out communication interconnection on the transmission links, further constructing a topology link cluster network, and acquiring link related data through the topology link cluster network;
step S2: constructing a link flow model corresponding to the topological link cluster network according to the link related data, carrying out feedback training on the link flow model, further generating feedback link parameters to carry out link construction optimization, and further carrying out congestion identification through the link flow model;
step S3: and obtaining a congestion identification result, performing congestion control based on the corresponding result, evaluating the link flow condition after congestion control, and further generating a control scheme.
Specifically, the process of constructing the plurality of transmission links and adopting the RDMA protocol to carry out communication interconnection among the transmission links so as to construct the topological link cluster network comprises the following steps:
arranging a plurality of hosts at a preset plurality of link points, numbering the hosts and the link points respectively, and recording the numbers of the hosts and the link points as i and j, wherein i=1, 2,3, … …, n, j=1, 2,3, … …, m, and n and m are natural numbers larger than 0;
acquiring a plurality of topological point sequence pairs, namely L, wherein L= < i, j >, the host is provided with two real-time states of 'working' and 'standby', and the topological point sequence pairs with a fixed number A and a fixed number B are respectively set as a link starting point and a link ending point;
the host at the starting point of each link sends a communication request, and then the next topological point sequence pair at the standby receives the communication request, and then establishes a communication relation, and the next host continuously generates and sends a new communication request to the host corresponding to the next topological point sequence pair, and repeats the operation until the communication request is sent to any one of the end points of the links, and then a plurality of transmission links are constructed;
taking a plurality of topological point sequence pairs included in each transmission link as corresponding identifier sub-symbols, further summarizing and connecting the identifier sub-symbols to generate corresponding symbol sequence strings, and recording the symbol sequence strings as St-ID [ k ], wherein k is the number of the transmission link, k=1, 2,3, … … and z, wherein z is a natural number greater than 0;
configuring an exchanger for each transmission link, wherein the exchanger is used for carrying out communication interconnection between each transmission link, the exchanger acquires communication authorities of a plurality of hosts, and further acquires transmission data of the plurality of hosts, and each exchanger is provided with a sending area and a cache area;
registering in the cache area by adopting an RDMA protocol, converting transmission data into RDMA local memory data after registration is completed, sending a grabbing request to a sending area of a next switch through a sending area of the RDMA local memory data, acquiring an address of the next switch and cache area information through the grabbing request, returning to the current switch, and further transmitting the RDMA local memory data of the cache area of the current switch to the next switch;
when the switch sends the grabbing request, the switch is the sender, and when the switch receives the grabbing request, the switch is the receiver;
and after the switches all receive the RDMA local memory data, the switches represent that the communication interconnection of the transmission links through the RAMA protocol is successful, and then the topology link cluster network is built.
It should be noted that, the symbol sequence string is St-ID [ k ] as the unique identity of each transmission link, so as to facilitate the monitoring and management of the symbol sequence string in the subsequent steps, wherein the fixed number A and the fixed number B are equal in value and can be changed, and the number of the constructed transmission links is equal to the fixed number A and the fixed number B; through binding and associating one exchanger for each transmission link and processing each exchanger by adopting an RDMA protocol, communication interconnection is carried out between different transmission links, and a topology link cluster network is further constructed; the RDMA protocol adopts a host memory access technology, so that the participation of a CPU is reduced in the data transmission process, and the data transmission performance and efficiency are improved.
Specifically, the process of collecting the link related data through the topological link cluster network includes:
the topology link cluster network sets data acquisition time, link rest time and data verification time, and are respectively marked as T Collecting 、T Rest T is as follows School and school ;
At data acquisition time T Collecting In the topology link cluster network, acquiring link related data, acquiring acquisition speed, recording the acquisition speed as V, acquiring historical acquisition average speed, and recording as V Are all Preset the critical speed of the link, denoted as V Temporary face (L) If V is greater than or equal to V Are all V < V Temporary face (L) If no operation is performed, otherwise, the transmission link which indicates that the current topology link cluster network has a fault is judged according to St-ID [ k ]]Positioning a failed transmission link, generating a link maintenance early warning and transmitting the link maintenance early warning to a preset manager;
at the link rest time T Rest The manager overhauls the corresponding failed transmission link according to the link overhauling early warning, generates an overhauling report after the overhauling is finished, and inputs the overhauling report into a preset overhauling database for storage;
at data verification time T School and school Acquiring link related data, importing the link related data into a preset data cleaning program for data cleaning, screening out repeated data, redundant data and incomplete data, acquiring a data format of the link related data after the data cleaning is finished, converting the data format into a standard format if the data format does not accord with a preset standard format, and not performing conversion operation if the data format accords with the preset standard format;
specifically, the process of constructing the link traffic model according to the link related data includes:
acquiring link related data correspondingly acquired in data acquisition time, wherein the link related data comprises link bandwidth, link packet loss rate, link delay, link queue length and link type;
the link types comprise a main link and a branch link, and different link types are provided with corresponding link parameters, wherein the link parameters comprise a bandwidth utilization threshold, an upper packet loss limit, a delay threshold and a congestion judging length;
the bandwidth utilization threshold value, the packet loss upper limit, the delay threshold value and the congestion judgment length of the link parameters corresponding to the marked main link are respectively B Main unit 、P Main unit 、Lat Main unit L and main unit Marking bandwidth utilization threshold, packet loss upper limit, delay threshold and congestion judging length of corresponding link parameters of branch links as B respectively Dividing into 、P Dividing into 、Lat Dividing into L and dividing into ;
Summarizing the link related data and the link parameters of the main link to generate a plurality of main link model fragments, summarizing the link related data and the link parameters of the branch link to generate a plurality of branch link model fragments;
marking the intersection point of the main link and the branch link as a splicing point, and then splicing and synthesizing the main link model segment and the branch link model segment at a plurality of splicing points to construct a link flow model;
specifically, the process of performing the feedback training on the link traffic model and further generating feedback link parameters to perform the link construction optimization includes:
mapping a primary link flow monitoring area for a main link through a link flow model, mapping a secondary link flow monitoring area for a branch link, and obtaining a model AI computing force associated with the link flow model, wherein the model AI computing force is marked as AIOPS;
setting an initial calculation force value for the model AI calculation force, and starting feedback training on the link flow model when the initial calculation force value is recorded as d, namely AIOPS=d;
setting a training period, and in the training period, acquiring actual flow values of a primary link flow monitoring area and a secondary link flow monitoring area, which are respectively recorded as Mb 1 Sum Mb 2 Obtaining a feedback predicted flow value of a primary link flow monitoring area through feedback training, and recordingFor Mb 1 Obtaining a feedback predicted flow value of a secondary link flow monitoring area, which is marked as Mb 2 `;
The preset grant prediction deviation I and the grant prediction deviation II are respectively marked as Q 1 And Q 2 Obtaining a primary link flow difference value and a secondary link flow difference value which are respectively marked as X 1 And X 2 Then there is X 1 =|Mb 1 -Mb 1 `|,X 2 =|Mb 2 -Mb 2 `|;
When X is 1 ≤Q 1 And X is 2 ≤Q 2 Meanwhile, when the link traffic model is established, marking the link traffic model as a combined scale type, and stopping feedback training;
when X is 1 >Q 1 Or X 2 >Q 2 When any one of the conditions is satisfied, marking the link flow model as a non-compliance model, setting a feedback coefficient I and a feedback coefficient II, respectively marking as alpha and beta, and according to alpha, beta and X 1 X is as follows 2 Generating feedback link parameters including main link feedback parameters and branch link feedback parameters, respectively denoted as G 1 And G 2 There is G 1 =α*X 1 ,G 2 =α*X 2 ;
Transmitting the feedback link parameters to an administrator, arranging related operation and maintenance personnel by the administrator to perform link construction optimization, and acquiring main link feedback parameters G by the operation and maintenance personnel 1 And branch link feedback parameter G 2 Presetting G 1 And G 2 The congestion risk threshold values corresponding to the congestion risk thresholds are respectively marked as H 1 And H 2 ;
If G 1 ≥H 1 Optimizing the topology structure of the main link, and adjusting the load rate and bandwidth corresponding to the main link until G 1 <H 1 Otherwise, not performing any operation;
if G 2 ≥H 2 Optimizing the topology structure of the branch link, acquiring the link state and real-time traffic of the branch link by adopting a dynamic routing protocol, acquiring the related route, adjusting by the dynamic routing protocol, and carrying out the path selection of the transmission data until G 2 <H 2 Otherwise, do not do anyWhat operation;
it should be noted that, by adjusting the topology structure, the connection mode and path selection of the transmission link are optimized to improve the availability, fault tolerance and transmission efficiency of the network, for example, by adopting technologies such as redundant links and multipath routing, the reliability and load balance of the link are improved; the load rate and bandwidth of the main link are provided with corresponding upper limit values, the link flow model is subjected to feedback training by setting a training period, the effect of reducing model prediction errors is achieved, and the congestion risk threshold H is set 1 And H 2 And according to G 1 And G 2 Make a judgment when G 1 ≥H 1 Or G 2 ≥H 2 The method indicates that the transmission link has congestion risk, and further carries out link construction optimization in time, thereby playing roles in preventing congestion and preventing the congestion from happening;
specifically, the congestion identification process through the link traffic model includes:
acquiring respective link bandwidths, link packet loss rates, link delays and link queue lengths of a main link and a branch link in a topological link cluster network through a link traffic model;
marking the corresponding link bandwidth, link packet loss rate, link delay and link queue length of the main link as B Main unit `、P Main unit `、Lat Main unit ' and L Main unit The corresponding link bandwidth, link packet loss rate, link delay and link queue length of the marked branch link are respectively B Dividing into `、P Dividing into `、Lat Dividing into ' and L Dividing into `;
According to B Main unit `、P Main unit `、Lat Main unit `、L Main unit `、B Main unit 、P Main unit 、Lat Main unit L and main unit Performing congestion identification of a main link;
when B is Main unit `≥B Main unit When a congestion risk factor of one is generated, marked as tau 1 ;
When P Main unit `≥P Main unit Generating congestion risk factor two, labeled τ 2 ;
When Lat Main unit `≥Lat Main unit In this case, a congestion risk factor of three is generated, denoted τ 3 ;
When L Main unit `≥L Main unit In this case, a congestion risk factor of four is generated, denoted τ 4 ;
Otherwise, not generating a congestion risk coefficient corresponding to the main link;
setting the congestion threshold of the main link, which is marked as YS 1 If τ 1 +τ 2 +τ 3 +τ 4 ≥YS 1 The result of congestion identification is "1", otherwise, the result of congestion identification is "0";
according to B Dividing into `、P Dividing into `、Lat Dividing into `、L Dividing into `、B Dividing into 、P Dividing into 、Lat Dividing into L and dividing into Performing congestion identification of branch links;
when B is Dividing into `≥B Dividing into In this case, a congestion risk factor five is generated, labeled τ 5 ;
When P Dividing into `≥P Dividing into When congestion risk factor six is generated, marked as tau 6 ;
When Lat Dividing into `≥Lat Dividing into Generating a congestion risk factor of seven, labeled τ 7 ;
When L Dividing into `≥L Dividing into When congestion risk factor eight is generated, labeled τ 8 ;
Otherwise, not generating congestion risk coefficients corresponding to the branch links;
setting the congestion threshold of the branch link, which is marked as YS 2 If τ 5 +τ 6 +τ 7 +τ 8 ≥YS 2 The result of congestion identification is "1", otherwise, the result of congestion identification is "0";
specifically, the process of obtaining the congestion identification result and performing the congestion control based on the corresponding result includes:
obtaining congestion identification results of 0 and 1, and when the corresponding congestion identification result of the main link or the branch link is 0, indicating that no congestion phenomenon occurs in the corresponding main link or branch link temporarily, and not performing congestion control;
when the result of the corresponding congestion identification of the main link or the branch link is '1', the congestion phenomenon of the link exists in the corresponding main link or branch link, so that congestion control is performed, and the content of the congestion control is as follows:
positioning a main link or a branch link with congestion, and setting a data transmission window for the main link or the branch link in an associated mode, wherein the data transmission window corresponds to a window width and a window height, and the corresponding values are recorded as E respectively 1 And E is 2 Further, the corresponding window mapping area is obtained and marked as S Window Has S Window =E 1 ×E 2 ;
Acquiring real-time link data quantity of main link or branch link, and recording as D Real world Setting window up-regulating threshold and window down-regulating threshold, respectively noted as ST Upper part And ST (ST) Lower part(s) ,ST Upper part <ST Lower part(s) Comparison D Real world 、ST Upper part ST Lower part(s) The corresponding window adjusting operation is further executed;
if D Real world ≤ST Upper part Then go through E 1 And E is 2 Self-increasing operation of (E) 1 +E1`,E 2 +E2 ', wherein E1 ' is the increased value of the window width, E2 ' is the increased value of the window height, and both E1 ' and E2 ' are real numbers greater than 0;
if D Real world ≥ST Lower part(s) Then go through E 1 And E is 2 Self-increasing operation of (E) 1 +E1``,E 2 +E2 ", wherein E1" is the increased value of the window width, E2 "is the increased value of the window height, and both E1" and E2 "are real numbers less than 0;
if ST is Upper part <D Real world <ST Lower part(s) The adjustment of the data transmission window is not carried out, the real-time link data quantity in the corresponding main link or branch link is the optimal data quantity at the moment, the data transmission speed corresponding to the data transmission window at the moment is obtained and is recorded as the optimal transmission speed, and the data transmission and transmission of the main link or the branch link are carried out according to the optimal transmission speed;
specifically, the process of evaluating the link traffic condition after congestion control and further generating the control scheme includes:
the number of times of congestion control is obtained and is recorded as Num1, the number of times of congestion control success is obtained and is recorded as Num2, further the congestion success rate is obtained and is recorded as Sc, and the Sc= (Num 1/Num 2) ×100% is obtained, and a probability interval one, a probability interval two and a probability interval three are preset and are respectively recorded as Ω 1 、Ω 2 And omega 3 ;
Wherein Ω 1 =(0,0.6),Ω 2 =[0.6,0.85],Ω 3 =(0.85,1);
When Sc is e 1 When the evaluation result is "bad";
when Sc is e 2 When the evaluation result is "good";
when Sc is e 3 When the evaluation result is "excellent";
when the evaluation result is 'poor' and 'good', the value of the model AI calculation power AIOPS is increased, and when the evaluation result is 'excellent', all operations performed by the data transmission window are summarized and recorded into a preset scheme template, so that a control scheme is generated;
the control scheme is stored in a preset terminal database after being generated, access rights of the terminal database are provided, the control scheme is further obtained and read for use, when congestion conditions similar to the control scheme occur in a transmission link, the control scheme is timely called, the transmission link with congestion is timely dredged through the control scheme, and congestion dredged efficiency is further improved;
the above embodiments are only for illustrating the technical method of the present invention and not for limiting the same, and it should be understood by those skilled in the art that the technical method of the present invention may be modified or substituted without departing from the spirit and scope of the technical method of the present invention.
Claims (8)
1. An AI-based RoCEv2 congestion control method, comprising the steps of:
step S1: constructing a plurality of transmission links, adopting RDMA protocol to carry out communication interconnection on the transmission links, further constructing a topology link cluster network, and acquiring link related data through the topology link cluster network;
step S2: constructing a link flow model corresponding to the topological link cluster network according to the link related data, carrying out feedback training on the link flow model, further generating feedback link parameters to carry out link construction optimization, and further carrying out congestion identification through the link flow model;
step S3: and obtaining a congestion identification result, performing congestion control based on the corresponding result, evaluating the link flow condition after congestion control, and further generating a control scheme.
2. The AI-based RoCEv2 congestion control method of claim 1 wherein the process of constructing a plurality of transmission links and adopting RDMA protocol communication interconnection to build a topology link trunking network comprises:
arranging a plurality of hosts at a preset plurality of link points, respectively numbering i and j, i=1, 2,3, … …, n, j=1, 2,3, … …, m, n and m are natural numbers larger than 0, acquiring a plurality of topological point sequence pairs, and recording as L, L= < i, j >, wherein the hosts are provided with two real-time states of 'working' and 'standby', setting the topological point sequence pairs as link starting points and link ending points, the host at the link starting points sends a communication request, the next host at the 'standby' receives the communication request to establish a communication relationship, continuously generating and sending a new communication request to the host corresponding to the next topological point sequence pair, repeating the operation until the communication request is sent to any one link ending point, and further constructing a plurality of transmission links;
configuring a switch for each transmission link, wherein the switch is used for carrying out communication interconnection between each transmission link, the switch is provided with a sending area and a caching area, the caching area adopts an RDMA protocol for registration, acquired transmission data are converted into RDMA local memory data after registration is completed, a grabbing request is sent to the sending area of the next switch through the sending area of the switch, the address and the caching area information of the next switch are acquired through the grabbing request and returned to the current switch, the RDMA local memory data of the caching area of the current switch are further transmitted to the next switch, and after the RDMA local memory data are received by the switches, the fact that the communication interconnection of the transmission links through the RAMA protocol is successful is indicated, and a topology link cluster network is built.
3. The AI-based RoCEv2 congestion control method according to claim 2, wherein the process of collecting link-related data through the topology link trunking network comprises:
the topology link trunking network sets data acquisition time, link rest time and data verification time, the topology link trunking network acquires link related data in the data acquisition time and generates link maintenance early warning, an administrator overhauls a corresponding failed transmission link according to the link maintenance early warning in the link rest time, acquires the link related data in the data verification time and imports the link related data into a preset data cleaning program to perform data cleaning, acquires a data format of the link related data, compares and judges the data format with a preset standard format, and further executes corresponding operation.
4. The AI-based RoCEv2 congestion control method of claim 3 wherein constructing the link traffic model from link-related data comprises:
the link related data comprises a link bandwidth, a link packet loss rate, a link delay, a link queue length and a link type, the link type comprises a main link and a branch link, the different link types are provided with corresponding link parameters, the link parameters comprise a bandwidth utilization rate threshold, a packet loss upper limit, a delay threshold and a congestion judgment length, the link related data and the link parameters of the main link and the branch link are respectively summarized, a plurality of main link model fragments and branch link model fragments are further generated, the junction point of the main link and the branch link is marked as a splicing point, and splicing synthesis of the main link model fragments and the branch link model fragments is carried out at the splicing points, so that a link flow model is constructed.
5. The AI-based RoCEv2 congestion control method of claim 4 wherein performing the feedback training on a link traffic model to generate feedback link parameters for the link construction optimization comprises:
the primary link flow monitoring area and the secondary link flow monitoring area are respectively mapped for the main link and the branch link through the link flow model, the model AI calculation force is obtained, further feedback training of the link flow model is started, feedback link parameters are generated and transmitted to an administrator, and the administrator arranges related operation and maintenance personnel to perform link construction optimization.
6. The AI-based RoCEv2 congestion control method of claim 5 wherein the step of performing the congestion qualification via a link traffic model comprises:
the method comprises the steps of obtaining respective link bandwidths, link packet loss rates, link delays and link queue lengths of a main link and a branch link through a link flow model, comparing the respective link bandwidths, link packet loss rates, link delays and link queue lengths with respective bandwidth utilization thresholds, packet loss upper limits, delay thresholds and congestion judging lengths of the main link and the branch link, further generating respective corresponding congestion risk coefficients, setting the congestion thresholds of the main link and the branch link, accumulating the respective congestion risk coefficients of the main link and the branch link, and carrying out numerical relation judgment with the respective congestion thresholds, further generating different congestion identification results according to judgment.
7. The AI-based RoCEv2 congestion control method of claim 6, wherein the steps of obtaining a result of congestion authentication and performing the congestion control based on the corresponding result comprise:
and when the congestion identification result corresponding to the main link or the branch link is '0', the congestion control is not performed, and when the congestion identification result corresponding to the main link or the branch link is '1', the main link or the branch link with congestion phenomenon is positioned, a data transmission window is set for the association of the main link or the branch link, the real-time link data quantity of the main link or the branch link is acquired, a window up-regulation threshold value and a window down-regulation threshold value are set, the size relation among the real-time link data quantity, the window up-regulation threshold value and the window down-regulation threshold value is compared, and then corresponding window regulation operation is executed.
8. The AI-based RoCEv2 congestion control method of claim 7 wherein evaluating the congestion-controlled link traffic condition and generating the control scheme includes:
the number of times of congestion control is obtained and is recorded as Num1, the number of times of congestion control success is obtained and is recorded as Num2, further the congestion success rate is obtained and is recorded as Sc, and the Sc= (Num 1/Num 2) ×100% is obtained, and a probability interval one, a probability interval two and a probability interval three are preset and are respectively recorded as Ω 1 、Ω 2 And omega 3 ;
Wherein Ω 1 =(0,0.6),Ω 2 =[0.6,0.85],Ω 3 =(0.85,1);
When Sc is e 1 When the evaluation result is "bad";
when Sc is e 2 When the evaluation result is "good";
when Sc is e 3 When the evaluation result is "excellent";
and when the evaluation result is 'poor' and 'good', increasing the numerical value of the model AI calculation force, and when the evaluation result is 'excellent', summarizing all operations performed by the data transmission window and recording the operations into a preset scheme template, thereby generating a control scheme.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410064926.0A CN117579559B (en) | 2024-01-17 | 2024-01-17 | Control method for RoCEv congestion based on AI |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410064926.0A CN117579559B (en) | 2024-01-17 | 2024-01-17 | Control method for RoCEv congestion based on AI |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117579559A true CN117579559A (en) | 2024-02-20 |
CN117579559B CN117579559B (en) | 2024-04-23 |
Family
ID=89864811
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410064926.0A Active CN117579559B (en) | 2024-01-17 | 2024-01-17 | Control method for RoCEv congestion based on AI |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117579559B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111224884A (en) * | 2018-11-27 | 2020-06-02 | 华为技术有限公司 | Processing method for congestion control, message forwarding device and message receiving device |
US20210119930A1 (en) * | 2019-10-31 | 2021-04-22 | Intel Corporation | Reliable transport architecture |
CN112866059A (en) * | 2021-01-18 | 2021-05-28 | 中国信息通信研究院 | Nondestructive network performance testing method and device based on artificial intelligence application |
CN114079638A (en) * | 2020-08-17 | 2022-02-22 | 中国电信股份有限公司 | Data transmission method, device and storage medium of multi-protocol hybrid network |
CN114268587A (en) * | 2021-12-23 | 2022-04-01 | 上海光华智创网络科技有限公司 | Data center network transmission protocol optimization method based on graph neural network |
US20220209993A1 (en) * | 2020-12-30 | 2022-06-30 | Oracle International Corporation | CLOUD SCALE MULTI-TENANCY FOR RDMA OVER CONVERGED ETHERNET (RoCE) |
CN115174432A (en) * | 2022-06-30 | 2022-10-11 | 苏州浪潮智能科技有限公司 | RDMA (remote direct memory Access) network state monitoring method, device, equipment and readable storage medium |
WO2023205003A1 (en) * | 2022-04-20 | 2023-10-26 | Oracle International Corporation | Network device level optimizations for latency sensitive rdma traffic |
-
2024
- 2024-01-17 CN CN202410064926.0A patent/CN117579559B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111224884A (en) * | 2018-11-27 | 2020-06-02 | 华为技术有限公司 | Processing method for congestion control, message forwarding device and message receiving device |
US20210119930A1 (en) * | 2019-10-31 | 2021-04-22 | Intel Corporation | Reliable transport architecture |
CN114079638A (en) * | 2020-08-17 | 2022-02-22 | 中国电信股份有限公司 | Data transmission method, device and storage medium of multi-protocol hybrid network |
US20220209993A1 (en) * | 2020-12-30 | 2022-06-30 | Oracle International Corporation | CLOUD SCALE MULTI-TENANCY FOR RDMA OVER CONVERGED ETHERNET (RoCE) |
CN112866059A (en) * | 2021-01-18 | 2021-05-28 | 中国信息通信研究院 | Nondestructive network performance testing method and device based on artificial intelligence application |
CN114268587A (en) * | 2021-12-23 | 2022-04-01 | 上海光华智创网络科技有限公司 | Data center network transmission protocol optimization method based on graph neural network |
WO2023205003A1 (en) * | 2022-04-20 | 2023-10-26 | Oracle International Corporation | Network device level optimizations for latency sensitive rdma traffic |
CN115174432A (en) * | 2022-06-30 | 2022-10-11 | 苏州浪潮智能科技有限公司 | RDMA (remote direct memory Access) network state monitoring method, device, equipment and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN117579559B (en) | 2024-04-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6310881B1 (en) | Method and apparatus for network control | |
US20080043716A1 (en) | Telemetry stream performance analysis and optimization | |
JP2004533184A (en) | Adaptive control of data packet size in networks | |
JPH0216628B2 (en) | ||
CN114629802B (en) | Service awareness-based power communication backbone network quality assessment method | |
CN115549954B (en) | Heterogeneous-based fragmented network resource security splicing communication system | |
CN111211988B (en) | Data transmission method and system for distributed machine learning | |
CN111935207A (en) | Block chain system consensus method based on improved C4.5 algorithm | |
CN101013955B (en) | Fast simulated annealing for traffic matrix estimation | |
CN114465945B (en) | SDN-based identification analysis network construction method | |
CN104601375A (en) | Intelligent substation communication network monitoring system, setting method and monitoring method | |
CN109889447B (en) | Network transmission method and system based on hybrid ring networking and fountain codes | |
CN117579559B (en) | Control method for RoCEv congestion based on AI | |
CN114448899A (en) | Method for balancing network load of data center | |
CN115022205A (en) | Cross-network data transmission method applied to high-concurrency scene of massive terminals | |
CN114050994A (en) | SRv 6-based network telemetry method | |
CN112866137A (en) | Data transmission control method based on congestion algorithm of different packet sending modes | |
CN115002035A (en) | Power data transmission congestion evasion method based on service level | |
CN112947151B (en) | Efficient filtering method and device based on double CAN buses of vehicle | |
CN115239086A (en) | Importance evaluation method based on power dispatching data network topological structure | |
Xu et al. | Minimizing multi-controller deployment cost in software-defined networking | |
CN110138444B (en) | Multi-domain optical network dynamic multicast sharing protection method based on fuzzy game | |
CN116389193B (en) | Virtual network mapping method based on dual alliance block chain and federation learning | |
CN115987794B (en) | Intelligent shunting method based on SD-WAN | |
CN110177029B (en) | Power communication network service operation quality evaluation method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |