CN117579559B

CN117579559B - Control method for RoCEv congestion based on AI

Info

Publication number: CN117579559B
Application number: CN202410064926.0A
Authority: CN
Inventors: 陈新蕾; 王新征; 贾晓洁; 赵玉兵; 王焕成
Original assignee: Enterprise Online Beijing Data Technology Co ltd
Current assignee: Enterprise Online Beijing Data Technology Co ltd
Priority date: 2024-01-17
Filing date: 2024-01-17
Publication date: 2024-04-23
Anticipated expiration: 2044-01-17
Also published as: CN117579559A

Abstract

The invention discloses an AI-based RoCEv congestion control method, which relates to the technical field of congestion control, and comprises the steps of constructing a plurality of transmission links, adopting an RDMA protocol to carry out communication interconnection on the transmission links, constructing a topology link cluster network, and acquiring link related data through the topology link cluster network; constructing a link flow model corresponding to the topological link cluster network according to the link related data, carrying out feedback training on the link flow model, further generating feedback link parameters to carry out link construction optimization, and further carrying out congestion identification through the link flow model; and obtaining a congestion identification result, performing congestion control based on the corresponding result, evaluating the link flow condition after congestion control, and further generating a control scheme, thereby playing a role in effectively preventing congestion before congestion and timely relieving the congestion during congestion.

Description

Control method for RoCEv congestion based on AI

Technical Field

The invention relates to the technical field of congestion control, in particular to an AI-based RoCEv <2 > congestion control method.

Background

The RoCEv technology is one of the most popular interconnection technologies in the current high-performance computing clusters, adopts RDMA protocol, and has the characteristics of low delay and high bandwidth. However, under high load, roCEv network is easy to generate congestion phenomenon, which causes problems of packet loss, delay increase and the like of data transmission, and affects system performance.

Therefore, how to effectively control the congestion situation in RoCEv network is an important research direction, how to effectively prevent congestion before congestion, and how to timely break down congestion when congestion, and how to evaluate the overall situation after congestion control is completed, which are all the problems that we need to consider at present.

Disclosure of Invention

In order to solve the above problems, an object of the present invention is to provide an AI-based RoCEv congestion control method.

The aim of the invention can be achieved by the following technical scheme: an AI-based RoCEv congestion control method, comprising the steps of:

Step S1: constructing a plurality of transmission links, adopting RDMA protocol to carry out communication interconnection on the transmission links, further constructing a topology link cluster network, and acquiring link related data through the topology link cluster network;

Step S2: constructing a link flow model corresponding to the topological link cluster network according to the link related data, carrying out feedback training on the link flow model, further generating feedback link parameters to carry out link construction optimization, and further carrying out congestion identification through the link flow model;

step S3: and obtaining a congestion identification result, performing congestion control based on the corresponding result, evaluating the link flow condition after congestion control, and further generating a control scheme.

Further, the process of constructing a plurality of transmission links and adopting RDMA protocol communication interconnection to construct a topological link cluster network comprises the following steps:

Arranging a plurality of hosts at a preset plurality of link points, respectively numbering i and j, i=1, 2,3, … …, n, j=1, 2,3, … …, m, n and m are natural numbers larger than 0, acquiring a plurality of topological point sequence pairs, recording as L, L= < i, j >, setting the hosts into two real-time states of 'working' and 'standby', setting the topological point sequence pairs as a link starting point and a link ending point, sending a communication request by the host at the link starting point, establishing a communication relation by the next host at the 'standby' after receiving the communication request, continuously generating and sending a new communication request to the host corresponding to the next topological point sequence pair, repeating the operation until the communication request is sent to any one link ending point, and further constructing a plurality of transmission links;

Configuring a switch for each transmission link, wherein the switch is used for carrying out communication interconnection between each transmission link, the switch is provided with a sending area and a caching area, the caching area adopts an RDMA protocol for registration, acquired transmission data are converted into RDMA local memory data after registration is completed, a grabbing request is sent to the sending area of the next switch through the sending area of the switch, the address and the caching area information of the next switch are acquired through the grabbing request and returned to the current switch, the RDMA local memory data of the caching area of the current switch are further transmitted to the next switch, and after the RDMA local memory data are received by the switches, the fact that the communication interconnection of the transmission links through the RAMA protocol is successful is indicated, and a topology link cluster network is built.

Further, the process of collecting the link related data through the topological link cluster network comprises the following steps:

The topology link trunking network sets data acquisition time, link rest time and data verification time, the topology link trunking network acquires link related data in the data acquisition time and generates link maintenance early warning, an administrator overhauls a corresponding failed transmission link according to the link maintenance early warning in the link rest time, acquires the link related data in the data verification time and imports the link related data into a preset data cleaning program to perform data cleaning, acquires a data format of the link related data, compares and judges the data format with a preset standard format, and further executes corresponding operation.

Further, the process of constructing the link traffic model according to the link related data includes:

The link related data comprises a link bandwidth, a link packet loss rate, a link delay, a link queue length and a link type, the link type comprises a main link and a branch link, the different link types are provided with corresponding link parameters, the link parameters comprise a bandwidth utilization rate threshold, a packet loss upper limit, a delay threshold and a congestion judgment length, the link related data and the link parameters of the main link and the branch link are respectively summarized, a plurality of main link model fragments and branch link model fragments are further generated, the junction point of the main link and the branch link is marked as a splicing point, and splicing synthesis of the main link model fragments and the branch link model fragments is carried out at the splicing points, so that a link flow model is constructed.

Further, the process of performing the feedback training on the link traffic model and further generating feedback link parameters to perform the link construction optimization includes:

The primary link flow monitoring area and the secondary link flow monitoring area are respectively mapped for the main link and the branch link through the link flow model, the model AI calculation force is obtained, further feedback training of the link flow model is started, feedback link parameters are generated and transmitted to an administrator, and the administrator arranges related operation and maintenance personnel to perform link construction optimization.

Further, the process of performing the congestion identification through the link traffic model includes:

the method comprises the steps of obtaining respective link bandwidths, link packet loss rates, link delays and link queue lengths of a main link and a branch link through a link flow model, comparing the respective link bandwidths, link packet loss rates, link delays and link queue lengths with respective bandwidth utilization thresholds, packet loss upper limits, delay thresholds and congestion judging lengths of the main link and the branch link, further generating respective corresponding congestion risk coefficients, setting the congestion thresholds of the main link and the branch link, accumulating the respective congestion risk coefficients of the main link and the branch link, and carrying out numerical relation judgment with the respective congestion thresholds, further generating different congestion identification results according to judgment.

Further, the process of obtaining the congestion identification result and performing the congestion control based on the corresponding result includes:

And when the congestion identification result corresponding to the main link or the branch link is '0', the congestion control is not performed, and when the congestion identification result corresponding to the main link or the branch link is '1', the main link or the branch link with congestion phenomenon is positioned, a data transmission window is set for the association of the main link or the branch link, the real-time link data quantity of the main link or the branch link is acquired, a window up-regulation threshold value and a window down-regulation threshold value are set, the size relation among the real-time link data quantity, the window up-regulation threshold value and the window down-regulation threshold value is compared, and then corresponding window regulation operation is executed.

Further, the process of evaluating the link traffic condition after congestion control and generating the control scheme includes:

Acquiring the number of times of congestion control, namely Num1, acquiring the number of times of congestion control success, namely Num2, further acquiring the congestion success rate, namely Sc, namely Sc= (Num 1/Num 2) multiplied by 100%, presetting a probability interval I, a probability interval II and a probability interval III, and respectively marking as omega ₁、Ω₂ and omega ₃;

wherein Ω ₁＝(0,0.6),Ω₂＝[0.6,0.85],Ω₃ = (0.85,1);

When Sc is epsilon omega ₁, the evaluation result is 'bad';

When Sc is epsilon omega ₂, the evaluation result is good;

when Sc is epsilon omega ₃, the evaluation result is 'excellent';

and when the evaluation result is 'poor' and 'good', increasing the numerical value of the model AI calculation force, and when the evaluation result is 'excellent', summarizing all operations performed by the data transmission window and recording the operations into a preset scheme template, thereby generating a control scheme.

Compared with the prior art, the invention has the beneficial effects that:

1. Constructing a plurality of transmission links, converting transmission data received by a plurality of switches into RDMA local memory data by adopting an RDMA protocol, carrying out communication interconnection of the transmission links, further constructing a topology link cluster network, setting data acquisition time, link rest time and data verification time to acquire link related data, wherein on one hand, the RDMA protocol adopts a host memory access technology, so that CPU participation is reduced in the data transmission process, thereby improving data transmission performance and efficiency, on the other hand, the transmission links with faults are found in the data acquisition time, overhauled in the link rest time, and the compliance of the link related data is verified in the data verification time, so that faults are timely found and removed, and the correct compliance of the link related data is ensured.

2. Constructing a link flow model corresponding to a topological link trunking network according to link related data, calibrating a main link and a branch link, setting a training period for feedback training, marking the link flow model as a compliance model until a preset condition is met, improving the prediction accuracy of the model to a certain extent, generating feedback link parameters corresponding to the main link and the branch link respectively through the link flow model after feedback training, indicating the risk of congestion of the current link, wherein the feedback link parameters are used for constructing and optimizing the link, and playing roles of preventing congestion and preventing accidents.

3. When the link construction optimization is finished, congestion identifies that the topology link cluster network still has congestion, a main link or a branch link with congestion in the topology link cluster network is positioned, a data transmission window is set for congestion control, congestion is found out in time, congestion is cleared, the link flow condition is evaluated after the congestion control is finished, a corresponding control scheme is generated, and the generated control scheme can be used for congestion control of other subsequent same congestion conditions.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

As shown in fig. 1, an AI-based RoCEv congestion control method includes the following steps:

Specifically, the process of constructing the plurality of transmission links and adopting the RDMA protocol to carry out communication interconnection among the transmission links so as to construct the topological link cluster network comprises the following steps:

Arranging a plurality of hosts at a preset plurality of link points, numbering the hosts and the link points respectively, and recording the numbers of the hosts and the link points as i and j, wherein i=1, 2,3, … …, n, j=1, 2,3, … …, m, and n and m are natural numbers larger than 0;

Acquiring a plurality of topological point sequence pairs, namely L, wherein L= < i, j >, the host is provided with two real-time states of 'working' and 'standby', and the topological point sequence pairs with a fixed number A and a fixed number B are respectively set as a link starting point and a link ending point;

the host at the starting point of each link sends a communication request, and then the next topological point sequence pair at the standby receives the communication request, and then establishes a communication relation, and the next host continuously generates and sends a new communication request to the host corresponding to the next topological point sequence pair, and repeats the operation until the communication request is sent to any one of the end points of the links, and then a plurality of transmission links are constructed;

Taking a plurality of topological point sequence pairs included in each transmission link as corresponding identifier sub-symbols, further summarizing and connecting the identifier sub-symbols to generate corresponding symbol sequence strings, and recording the symbol sequence strings as St-ID [ k ], wherein k is the number of the transmission link, k=1, 2,3, … … and z, wherein z is a natural number greater than 0;

Configuring an exchanger for each transmission link, wherein the exchanger is used for carrying out communication interconnection between each transmission link, the exchanger acquires communication authorities of a plurality of hosts, and further acquires transmission data of the plurality of hosts, and each exchanger is provided with a sending area and a cache area;

registering in the cache area by adopting an RDMA protocol, converting transmission data into RDMA local memory data after registration is completed, sending a grabbing request to a sending area of a next switch through a sending area of the RDMA local memory data, acquiring an address of the next switch and cache area information through the grabbing request, returning to the current switch, and further transmitting the RDMA local memory data of the cache area of the current switch to the next switch;

When the switch sends the grabbing request, the switch is the sender, and when the switch receives the grabbing request, the switch is the receiver;

And after the switches all receive the RDMA local memory data, the switches represent that the communication interconnection of the transmission links through the RAMA protocol is successful, and then the topology link cluster network is built.

It should be noted that, the symbol sequence string is St-ID [ k ] as the unique identity of each transmission link, so as to facilitate the monitoring and management of the symbol sequence string in the subsequent steps, wherein the fixed number A and the fixed number B are equal in value and can be changed, and the number of the constructed transmission links is equal to the fixed number A and the fixed number B; through binding and associating one exchanger for each transmission link and processing each exchanger by adopting an RDMA protocol, communication interconnection is carried out between different transmission links, and a topology link cluster network is further constructed; the RDMA protocol adopts a host memory access technology, so that the participation of a CPU is reduced in the data transmission process, and the data transmission performance and efficiency are improved.

Specifically, the process of collecting the link related data through the topological link cluster network includes:

the topological link cluster network sets data acquisition time, link rest time and data verification time, and the data acquisition time, the link rest time and the data verification time are respectively recorded as T _Collecting、T_Rest and T _{School and school};

In the data acquisition time T _Collecting, the topology link cluster network acquires link related data, acquires acquisition speed, records the acquisition speed as V, acquires historical acquisition average speed as V _{Are all}, presets link critical speed as V _{Temporary face (L)}, does not perform any operation if V is more than or equal to V _{Are all} and V is less than V _{Temporary face (L)}, otherwise, indicates that a fault transmission link exists in the current topology link cluster network, locates the fault transmission link according to St-ID [ k ], and generates link overhaul early warning and transmits the fault transmission link to a preset administrator;

In the link rest time T _Rest, an administrator overhauls the corresponding failed transmission link according to the link overhaul early warning, generates an overhaul report after the overhaul is finished, and inputs the overhaul report into a preset overhaul database for storage;

Acquiring link related data within a data verification time T _{School and school}, importing the link related data into a preset data cleaning program for data cleaning, further screening out repeated data, redundant data and incomplete data, acquiring a data format of the link related data after the data cleaning is finished, and converting the data format into a standard format if the data format does not accord with a preset standard format, and not performing conversion operation if the data format accords with the preset standard format;

Specifically, the process of constructing the link traffic model according to the link related data includes:

Acquiring link related data correspondingly acquired in data acquisition time, wherein the link related data comprises link bandwidth, link packet loss rate, link delay, link queue length and link type;

The link types comprise a main link and a branch link, and different link types are provided with corresponding link parameters, wherein the link parameters comprise a bandwidth utilization threshold, an upper packet loss limit, a delay threshold and a congestion judging length;

The bandwidth utilization threshold, the packet loss upper limit, the delay threshold and the congestion judging length of the link parameter corresponding to the main link are marked as B _{Main unit}、P_{Main unit}、Lat_{Main unit} and L _{Main unit} respectively, and the bandwidth utilization threshold, the packet loss upper limit, the delay threshold and the congestion judging length of the link parameter corresponding to the branch link are marked as B _{Dividing into}、P_{Dividing into}、Lat_{Dividing into} and L _{Dividing into} respectively;

Summarizing the link related data and the link parameters of the main link to generate a plurality of main link model fragments, summarizing the link related data and the link parameters of the branch link to generate a plurality of branch link model fragments;

marking the intersection point of the main link and the branch link as a splicing point, and then splicing and synthesizing the main link model segment and the branch link model segment at a plurality of splicing points to construct a link flow model;

Specifically, the process of performing the feedback training on the link traffic model and further generating feedback link parameters to perform the link construction optimization includes:

Mapping a primary link flow monitoring area for a main link through a link flow model, mapping a secondary link flow monitoring area for a branch link, and obtaining a model AI computing force associated with the link flow model, wherein the model AI computing force is marked as AIOPS;

Setting an initial calculation force value for the model AI calculation force, and starting feedback training on the link flow model when the initial calculation force value is recorded as d, namely AIOPS =d;

Setting a training period, in the training period, acquiring actual flow values of a primary link flow monitoring area and a secondary link flow monitoring area, respectively marking as Mb ₁ and Mb ₂, acquiring a feedback predicted flow value of the primary link flow monitoring area through feedback training, marking as Mb ₁ ', and acquiring a feedback predicted flow value of the secondary link flow monitoring area, marking as Mb ₂';

The first grant prediction deviation and the second grant prediction deviation are preset and respectively marked as Q ₁ and Q ₂, a primary link flow difference value and a secondary link flow difference value are obtained and respectively marked as X ₁ and X ₂, and then X ₁＝|Mb₁－Mb₁`|,X₂＝|Mb₂－Mb₂' exists;

when X ₁≤Q₁ and X ₂≤Q₂ are simultaneously established, marking a link flow model as a combined scale type, and stopping feedback training;

When either X ₁＞Q₁ or X ₂＞Q₂ is satisfied, marking the link flow model as a non-compliance model, setting a first feedback coefficient and a second feedback coefficient which are respectively marked as alpha and beta, and generating feedback link parameters according to alpha, beta, X ₁ and X ₂, wherein the feedback link parameters comprise a main link feedback parameter and a branch link feedback parameter which are respectively marked as G ₁ and G ₂, and G ₁＝α*X₁,G₂＝α*X₂ exists;

Transmitting the feedback link parameters to an administrator, arranging related operation and maintenance personnel by the administrator to perform link construction optimization, acquiring main link feedback parameters G ₁ and branch link feedback parameters G ₂ by the operation and maintenance personnel, presetting congestion risk thresholds corresponding to G ₁ and G ₂ respectively, and recording the congestion risk thresholds as H ₁ and H ₂ respectively;

if G ₁≥H₁ is carried out, optimizing the topological structure of the main link, and adjusting the load rate and the bandwidth corresponding to the main link until G ₁＜H₁ is carried out, otherwise, carrying out no operation;

if G ₂≥H₂, optimizing the topological structure of the branch link, acquiring the link state and the real-time flow of the branch link by adopting a dynamic routing protocol, acquiring the related route and adjusting by the dynamic routing protocol, and selecting a path for transmitting data until G ₂＜H₂, otherwise, not performing any operation;

It should be noted that, by adjusting the topology structure, the connection mode and path selection of the transmission link are optimized to improve the availability, fault tolerance and transmission efficiency of the network, for example, by adopting technologies such as redundant links and multipath routing, the reliability and load balance of the link are improved; the load rate and the bandwidth of the main link are provided with corresponding upper limit values, the link flow model is subjected to feedback training through setting a training period, the effect of reducing model prediction errors is achieved, congestion risk thresholds H ₁ and H ₂ are set, judgment is carried out according to G ₁ and G ₂, and when G ₁≥H₁ or G ₂≥H₂ shows that the transmission link has congestion risk, further link construction optimization is timely carried out, and the effects of congestion prevention and unexpected prevention are achieved;

Specifically, the congestion identification process through the link traffic model includes:

Acquiring respective link bandwidths, link packet loss rates, link delays and link queue lengths of a main link and a branch link in a topological link cluster network through a link traffic model;

Marking the link bandwidth, the link packet loss rate, the link delay and the link queue length corresponding to the main link as B _{Main unit}`、P_{Main unit}`、Lat_{Main unit} 'and L _{Main unit}' respectively, and marking the link bandwidth, the link packet loss rate, the link delay and the link queue length corresponding to the branch link as B _{Dividing into}`、P_{Dividing into}`、Lat_{Dividing into} 'and L _{Dividing into}' respectively;

performing congestion identification of the main link according to B _{Main unit}`、P_{Main unit}`、Lat_{Main unit}`、L_{Main unit}`、B_{Main unit}、P_{Main unit}、Lat_{Main unit} and L _{Main unit};

When B _{Main unit}`≥B_{Main unit}, a congestion risk factor of one, labeled τ ₁, is generated;

When P _{Main unit}`≥P_{Main unit}, a congestion risk factor II is generated, which is marked as tau ₂;

When Lat _{Main unit}`≥Lat_{Main unit}, a congestion risk factor of three is generated, labeled τ ₃;

When L _{Main unit}`≥L_{Main unit}, a congestion risk factor of four, labeled τ ₄, is generated;

Otherwise, not generating a congestion risk coefficient corresponding to the main link;

setting a congestion threshold of a main link, namely YS ₁, if tau ₁＋τ₂＋τ₃＋τ₄≥YS₁ is the congestion identification result is 1, otherwise, the congestion identification result is 0;

Performing congestion identification of the branch links according to B _{Dividing into}`、P_{Dividing into}`、Lat_{Dividing into}`、L_{Dividing into}`、B_{Dividing into}、P_{Dividing into}、Lat_{Dividing into} and L _{Dividing into};

When B _{Dividing into}`≥B_{Dividing into}, a congestion risk factor of five is generated, labeled τ ₅;

When P _{Dividing into}`≥P_{Dividing into}, a congestion risk factor of six is generated, labeled τ ₆;

When Lat _{Dividing into}`≥Lat_{Dividing into}, generating a congestion risk coefficient seven, labeled τ ₇;

When L _{Dividing into}`≥L_{Dividing into}, a congestion risk factor of eight is generated, labeled τ ₈;

Otherwise, not generating congestion risk coefficients corresponding to the branch links;

Setting a congestion threshold of a branch link, namely YS ₂, if tau ₅＋τ₆＋τ₇＋τ₈≥YS₂ is calculated, the congestion identification result is 1, otherwise, the congestion identification result is 0;

Specifically, the process of obtaining the congestion identification result and performing the congestion control based on the corresponding result includes:

obtaining congestion identification results of 0 and 1, and when the corresponding congestion identification result of the main link or the branch link is 0, indicating that no congestion phenomenon occurs in the corresponding main link or branch link temporarily, and not performing congestion control;

When the result of the corresponding congestion identification of the main link or the branch link is '1', the congestion phenomenon of the link exists in the corresponding main link or branch link, so that congestion control is performed, and the content of the congestion control is as follows:

Positioning a main link or a branch link with congestion, and setting a data transmission window for the main link or the branch link in an associated mode, wherein the data transmission window corresponds to a window width and a window height, corresponding numerical values are respectively recorded as E ₁ and E ₂, and further corresponding window mapping areas are obtained and recorded as S _Window, and S _Window＝E₁×E₂ exists;

Acquiring real-time link data quantity of a main link or a branch link, marking as D _{Real world}, setting a window up-regulation threshold and a window down-regulation threshold, marking as ST _{Upper part} and ST _{Lower part(s)},ST_{Upper part}＜ST_{Lower part(s)} respectively, comparing the size relation of D _{Real world}、ST_{Upper part} and ST _{Lower part(s)}, and further executing corresponding window regulation operation;

If D _{Real world}≤ST_{Upper part}, performing self-increasing operation of E ₁ and E ₂, namely E ₁＋E1`,E₂ +E2 ', wherein E1 ' is an increasing value of window width, E2 ' is an increasing value of window height, and E1 ' and E2 ' are real numbers larger than 0;

if D _{Real world}≥ST_{Lower part(s)}, performing self-increasing operation of E ₁ and E ₂, namely E ₁＋E1``,E₂ +E2 ', wherein E1 ' is an increasing value of window width, E2 ' is an increasing value of window height, and E1 ' and E2 ' are real numbers smaller than 0;

If ST _{Upper part}＜D_{Real world}＜ST_{Lower part(s)} is not performed, the data transmission window is not adjusted, the real-time link data amount in the corresponding main link or branch link is the optimal data amount, the data transmission speed corresponding to the data transmission window at the moment is obtained and is recorded as the optimal transmission speed, and the data transmission and transmission of the main link or the branch link are performed according to the optimal transmission speed;

specifically, the process of evaluating the link traffic condition after congestion control and further generating the control scheme includes:

wherein Ω ₁＝(0,0.6),Ω₂＝[0.6,0.85],Ω₃ = (0.85,1);

When Sc is epsilon omega ₁, the evaluation result is 'bad';

When Sc is epsilon omega ₂, the evaluation result is good;

when Sc is epsilon omega ₃, the evaluation result is 'excellent';

when the evaluation result is 'poor' and 'good', the numerical value of the model AI computing power AIOPS is increased, and when the evaluation result is 'excellent', all the operations performed by the data transmission window are summarized and recorded into a preset scheme template, so that a control scheme is generated;

The control scheme is stored in a preset terminal database after being generated, access rights of the terminal database are provided, the control scheme is further obtained and read for use, when congestion conditions similar to the control scheme occur in a transmission link, the control scheme is timely called, the transmission link with congestion is timely dredged through the control scheme, and congestion dredged efficiency is further improved;

The above embodiments are only for illustrating the technical method of the present invention and not for limiting the same, and it should be understood by those skilled in the art that the technical method of the present invention may be modified or substituted without departing from the spirit and scope of the technical method of the present invention.

Claims

1. The RoCEv congestion control method based on the AI is characterized by comprising the following steps:

Step S3: the method comprises the steps of obtaining a congestion identification result, performing congestion control based on a corresponding result, evaluating the link flow condition after congestion control, and further generating a control scheme;

Constructing a plurality of transmission links and adopting RDMA protocol communication interconnection, and further constructing a topological link cluster network, wherein the process comprises the following steps:

Configuring an exchanger for each transmission link, wherein the exchanger is used for carrying out communication interconnection between each transmission link, the exchanger is provided with a sending area and a buffer area, the buffer area adopts an RDMA protocol for registration, acquired transmission data are converted into RDMA local memory data after registration is completed, a grabbing request is sent to the sending area of the next exchanger through the sending area of the exchanger, the address and buffer area information of the next exchanger are acquired through the grabbing request and returned to the current exchanger, the RDMA local memory data of the buffer area of the current exchanger are further transmitted to the next exchanger, and after the RDMA local memory data are received by the plurality of exchangers, the fact that the communication interconnection of the plurality of transmission links through the RAMA protocol is successful is indicated, and a topology link cluster network is built;

the process of collecting the link related data through the topological link cluster network comprises the following steps:

Setting data acquisition time, link rest time and data verification time by the topology link trunking network, acquiring link related data by the topology link trunking network in the data acquisition time, generating link maintenance early warning, overhauling a corresponding failed transmission link by an administrator according to the link maintenance early warning in the link rest time, acquiring the link related data in the data verification time, importing the link related data into a preset data cleaning program for data cleaning, acquiring a data format of the link related data, comparing and judging with a preset standard format, and further executing corresponding operation;

the process of constructing the link traffic model according to the link related data comprises the following steps:

The link related data comprises a link bandwidth, a link packet loss rate, a link delay, a link queue length and a link type, the link type comprises a main link and a branch link, the different link types are provided with corresponding link parameters, the link parameters comprise a bandwidth utilization rate threshold, a packet loss upper limit, a delay threshold and a congestion judgment length, the link related data and the link parameters of the main link and the branch link are summarized respectively, a plurality of main link model fragments and branch link model fragments are generated, the junction point of the main link and the branch link is marked as a splicing point, and splicing synthesis of the main link model fragments and the branch link model fragments is carried out at the splicing points, so that a link flow model is constructed;

The feedback training is carried out on the link flow model, and further the process of generating feedback link parameters to carry out the link construction optimization comprises the following steps:

The method comprises the steps that a primary link flow monitoring area and a secondary link flow monitoring area are respectively mapped for a main link and a branch link through a link flow model, the AI calculation force of the model is obtained, further feedback training of the link flow model is started, feedback link parameters are generated and transmitted to an administrator, and the administrator arranges relevant operation and maintenance personnel to perform link construction optimization;

The process of performing the congestion authentication through the link traffic model comprises the following steps:

2. The AI-based RoCEv congestion control method according to claim 1, wherein the process of obtaining the result of congestion authentication and performing the congestion control based on the corresponding result includes:

3. The AI-based RoCEv's 2 congestion control method according to claim 2, wherein the process of evaluating the congestion controlled link traffic condition and generating the control scheme includes:

wherein Ω ₁＝(0,0.6),Ω₂＝[0.6,0.85],Ω₃ = (0.85,1);

When Sc is epsilon omega ₁, the evaluation result is 'bad';

When Sc is epsilon omega ₂, the evaluation result is good;

when Sc is epsilon omega ₃, the evaluation result is 'excellent';