CN113315716B

CN113315716B - Training method and equipment of congestion control model and congestion control method and equipment

Info

Publication number: CN113315716B
Application number: CN202110592772.9A
Authority: CN
Inventors: 周超; 陈艳姣; 夏振厂
Original assignee: Wuhan University WHU; Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Wuhan University WHU; Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2021-05-28
Filing date: 2021-05-28
Publication date: 2023-05-02
Anticipated expiration: 2041-05-28
Also published as: CN113315716A

Abstract

The disclosure provides a training method and device for a congestion control model, and a congestion control method and device. The congestion control method comprises the following steps: acquiring current first network state information and preference of current application to network transmission performance; inputting the acquired first network state information and the preferences into a congestion control model to obtain a predicted action to be executed for adjusting the size of a congestion window; the predicted actions are performed to reset the congestion window.

Description

Training method and equipment of congestion control model and congestion control method and equipment

Technical Field

The present disclosure relates generally to the field of communications technologies, and in particular, to a training method and apparatus for a congestion control model, and a congestion control method and apparatus.

Background

In recent years, in order to solve the network congestion problem and improve network performance, many congestion control protocols including heuristic protocols and learning-based protocols have been proposed.

The relation between rate control behaviour and observed performance is learned in an on-line manner based on the learned congestion control protocols PCC and PCC Vivace. To avoid hard mapping between states and actions collected in conventional TCP variants, they choose the best sending rate by employing online learning techniques that continually attempt to modify the sending rate in a small range to approach better utility function performance. Although PCC and PCC Vivace can achieve good performance. The congestion control protocol based on learning learns congestion control policies by interacting with the environment, which may select appropriate actions to control the sending rate or congestion window depending on the state of the network. However, learning based congestion control protocols drive performance through pre-designed rewards or objective functions that are fixed, and when new applications appear, these protocols fail to meet the performance requirements of the applications, thus requiring the objective functions to be redesigned and new models to be retrained.

Disclosure of Invention

Exemplary embodiments of the present disclosure provide a training method and apparatus for a congestion control model and a congestion control method and apparatus to solve at least the problems in the related art described above, or not solve any of the problems described above.

According to a first aspect of an embodiment of the present disclosure, there is provided a training method of a congestion control model, including: initializing a communication network environment used by a current training round; inputting the preference of the training round to the network transmission performance and the current first network state information into a congestion control model to obtain a predicted action which needs to be executed and is used for adjusting the size of a congestion window; executing the predicted action to reset the congestion window, and controlling the transmitting end to transmit the data packet to the receiving end under the currently set congestion window; when the sending end receives the ACK message fed back by the receiving end, calculating a loss function of the congestion control model according to the action, the first network state information before the action is executed, the first network state information after the action is executed and the preference; and training the congestion control model by adjusting model parameters of the congestion control model according to the loss function, and determining whether to end the training round, wherein when the training round is determined not to be ended, the step of inputting the preference of the training round to the network transmission performance and the current first network state information into the congestion control model is returned to obtain the predicted action which needs to be executed and is used for adjusting the congestion window size.

Optionally, the first network state information includes at least one of: the size of the congestion window, delay, packet acknowledgement rate, and transmission rate; wherein the delay, the packet acknowledgement rate, and the transmission rate are determined based on the ACK message fed back by the receiving end.

Optionally, the preference for network transmission performance includes a degree of preference for at least one of: throughput, packet loss rate, and latency.

Optionally, the step of determining whether to end the present training round comprises: and determining whether to end the training round according to the change condition of the second network state information.

Optionally, the step of determining whether to end the training round according to the change condition of the second network state information includes: when the second network state information after executing the action meets a first preset condition, determining the action as a winning action; when the second network state information after executing the action meets a second preset condition, determining the action as a failed action; when the continuous times of the winning actions reach the first preset times, determining to end the training round; when the continuous times of failed actions reach a second preset times, determining to end the training round; and when the total number of times of executing the action reaches a third preset number of times, determining to end the training round.

Optionally, the second network state information includes: throughput and delay; the first preset condition is: throughput is 90% -110% of bandwidth, and delay is less than or equal to 0.7×timeout threshold; the second preset condition is: throughput is 50% -70% of bandwidth and delay is ≡0.7Xtimeout threshold.

Optionally, the method further comprises: initializing the size of a congestion window; wherein the step of initializing the size of the congestion window comprises: and estimating the bandwidth of the communication network, and determining the initial size of the congestion window based on the estimated bandwidth.

Optionally, the step of estimating the bandwidth of the communication network comprises: determining the total number of ACK messages fed back by a receiving end aiming at N data packets sent by a sending end; and determining the bandwidth of the communication network according to the average value obtained by dividing the total number by N.

Optionally, when the sending end receives the ACK message fed back by the receiving end, the step of calculating the loss function of the congestion control model according to the action, the first network state information before the action is performed, the first network state information after the action is performed, and the preference includes: when the sending end receives the ACK message fed back by the receiving end, the loss function of the congestion control model is calculated according to the action, the first network state information before the action is executed, the first network state information after the action is executed, the reward function of the action and the preference.

Optionally, the reward function of the action is calculated based on the preference and third network state information after performing the action; wherein the third network state information comprises at least one of: packet loss rate, throughput, and delay.

Optionally, the congestion control model is constructed based on a reinforcement learning algorithm; wherein the value function in the reinforcement learning algorithm is a value function related to an action, first network state information, and a preference for network transmission performance.

Optionally, the congestion control model predicts an action, the probability of having e is one action randomly selected from the action set, and the probability of having 1 e is the optimal action obtained using a value function.

Optionally, the loss function of the congestion control model is based on: loss function L for the purpose of making the value function closer to the maximum bonus function ^S (θ), and auxiliary loss function L ^T And (theta) is calculated.

Optionally, the loss function of the congestion control model is expressed as: (1- ε) L ^S (θ)+ε·L ^T (θ); wherein epsilon is a trade-off index, and the more the motion is predicted later in a training round, the greater the value of epsilon is when calculating the loss function of the congestion control model for that motion, the greater 0 epsilon is less than or equal to 1.

Optionally, the objective function of the congestion control model is a composite objective function with respect to: the reward function, the value function, the first network state information after performing the action, the first network state information before performing the action, the preference of the present training round for network transmission performance, and the best preference under the current network environment.

Optionally, the method further comprises: when the training loop is determined to be ended, determining whether to end the training process of the congestion control model; and when the training process of the congestion control model is not finished, returning to the step of initializing the communication network environment used by the current training round to enter the next training round.

According to a second aspect of the embodiments of the present disclosure, there is provided a congestion control method, including: acquiring current first network state information and preference of current application to network transmission performance; inputting the acquired first network state information and the preferences into a congestion control model to obtain a predicted action to be executed for adjusting the size of a congestion window; the predicted actions are performed to reset the congestion window.

Optionally, the method further comprises: initializing the size of a congestion window; wherein the step of initializing the size of the congestion window comprises: the bandwidth of the communication network is estimated and an initial size of the congestion window is determined based on the estimated bandwidth.

Optionally, the step of estimating the bandwidth of the communication network comprises: determining the total number of ACK messages fed back by a receiving end for the N sent data packets; and determining the bandwidth of the communication network according to the average value obtained by dividing the total number by N.

Optionally, the congestion control model is trained using a training method as described above.

According to a third aspect of embodiments of the present disclosure, there is provided a training apparatus of a congestion control model, including: an environment initializing unit configured to initialize a communication network environment used by a current training round; the prediction unit is configured to input the preference of the training round to the network transmission performance and the current first network state information into the congestion control model to obtain a predicted action which needs to be executed and is used for adjusting the congestion window size; a congestion window setting unit configured to perform a predicted action to reset a congestion window and control a transmitting end to transmit a data packet to a receiving end under the currently set congestion window; a loss function calculating unit configured to calculate, when the transmitting end receives the ACK message fed back by the receiving end, a loss function of the congestion control model according to the action, the first network state information before the action is performed, the first network state information after the action is performed, and the preference; a training unit configured to train the congestion control model by adjusting model parameters of the congestion control model according to the loss function; and the round ending determining unit is configured to determine whether to end the training round, wherein when the training round is determined not to end, the predicting unit inputs the preference of the training round for the network transmission performance and the current first network state information into the congestion control model to obtain a predicted action which needs to be executed and is used for adjusting the congestion window size.

Optionally, the round-ending determining unit is configured to determine whether to end the present training round according to a change situation of the second network state information.

Optionally, the round-trip ending determining unit is configured to determine that the action is a winning action when the second network state information after the action is performed satisfies a first preset condition; when the second network state information after executing the action meets a second preset condition, determining the action as a failed action; when the continuous times of the winning actions reach the first preset times, determining to end the training round; when the continuous times of failed actions reach a second preset times, determining to end the training round; and when the total number of times of executing the action reaches a third preset number of times, determining to end the training round.

Optionally, the apparatus further comprises: a window initialization unit configured to initialize a size of a congestion window; wherein the window initialization unit is configured to estimate a bandwidth of the communication network and to determine an initial size of the congestion window based on the estimated bandwidth.

Optionally, the window initializing unit is configured to determine the total number of ACK messages fed back by the receiving end for the N data packets sent by the sending end; and determining the bandwidth of the communication network according to the average value obtained by dividing the total number by N.

Optionally, the loss function calculation unit is configured to calculate, when the sending end receives the ACK message fed back by the receiving end, a loss function of the congestion control model according to the action, the first network state information before the action is performed, the first network state information after the action is performed, the reward function of the action, and the preference.

Optionally, the apparatus further comprises: and a training end determining unit configured to determine whether to end the training process of the congestion control model when it is determined to end the present training round, wherein when it is determined not to end the training process of the congestion control model, the environment initializing unit initializes the communication network environment used by the current training round to enter the next training round.

According to a fourth aspect of embodiments of the present disclosure, there is provided a congestion control apparatus comprising: an acquisition unit configured to acquire current first network state information and a preference of a current application for network transmission performance; a prediction unit configured to input the acquired first network state information and the preference into a congestion control model, and obtain a predicted action to be performed for adjusting the congestion window size; and a congestion window setting unit configured to perform a predicted action to reset the congestion window.

Optionally, the window initializing unit is configured to determine the total number of ACK messages fed back by the receiving end for the N data packets sent; and determining the bandwidth of the communication network according to the average value obtained by dividing the total number by N.

Optionally, the congestion control model is trained using a training device as described above.

According to a fifth aspect of embodiments of the present disclosure, there is provided an electronic device, comprising: at least one processor; at least one memory storing computer-executable instructions, wherein the computer-executable instructions, when executed by the at least one processor, cause the at least one processor to perform a training method of a congestion control model as described above and/or a congestion control method as described above.

According to a sixth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium, which when executed by at least one processor, causes the at least one processor to perform a training method of a congestion control model as described above and/or a congestion control method as described above.

According to a seventh aspect of embodiments of the present disclosure, there is provided a computer program product comprising computer instructions which, when executed by at least one processor, implement a training method of a congestion control model as described above and/or a congestion control method as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

According to the congestion control model of the exemplary embodiment of the disclosure, the optimal congestion control strategy can be selected according to the preference of the application on the transmission performance, so that the transmission performance requirements of different applications can be met without redesigning an objective function and a training model; the congestion control method according to the exemplary embodiment of the present disclosure is suitable for congestion control of various types of applications, and can realize trade-off among throughput, delay and packet loss, so as to meet transmission performance requirements of different types of applications;

the multi-objective reinforcement learning network of the congestion control model of the exemplary embodiments of the present disclosure can optimize over the entire preference space of congestion control, which enables the trained model to generate an optimal strategy for any given preference, which fundamentally changes the design that the objective function or utility function of the existing protocol is fixed, with great advantages in meeting different types of applications;

by setting different initial congestion window values for different network bandwidths, network convergence can be effectively accelerated;

by the method for ending the training rounds according to the change of the network environment, the training rounds can be ended at proper time according to the change of the network bandwidth utilization rate, the time delay and the throughput, so that the training efficiency of the model is improved;

In addition, a method for improving the training quality of the congestion control model by breaking the training round of win-lose-tie is provided, and the problem that the training is pseudo-broken when multi-objective reinforcement learning is applied to the congestion control problem is solved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 shows a schematic diagram of an implementation scenario of a congestion control method and apparatus according to an exemplary embodiment of the present disclosure;

fig. 2 illustrates a flowchart of a training method of a congestion control model according to an exemplary embodiment of the present disclosure;

fig. 3 illustrates a flow chart of a congestion control method according to an exemplary embodiment of the present disclosure;

fig. 4 illustrates a schematic diagram of a training method of a congestion control model and a congestion control method according to an exemplary embodiment of the present disclosure;

fig. 5 shows a block diagram of a training apparatus of a congestion control model according to an exemplary embodiment of the present disclosure;

Fig. 6 shows a block diagram of a congestion control apparatus according to an exemplary embodiment of the present disclosure;

fig. 7 shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

It should be noted that, in this disclosure, "at least one of the items" refers to a case where three types of juxtaposition including "any one of the items", "a combination of any of the items", "an entirety of the items" are included. For example, "including at least one of a and B" includes three cases side by side as follows: (1) comprises A; (2) comprising B; (3) includes A and B. For example, "at least one of the first and second steps is executed", that is, three cases are juxtaposed as follows: (1) performing step one; (2) executing the second step; (3) executing the first step and the second step.

With the rapid development of mobile internet technology and the increasing number of terminals, many terminal devices have different types of applications installed thereon, including delay-sensitive applications and throughput-sensitive applications, for example, for which a transmission delay as low as several milliseconds is required, such as a network phone or a cloud game, which may not benefit from higher bandwidths; while for throughput sensitive applications, such as video streaming or file sharing applications, high bandwidth is often required to achieve better performance. Depending on the type of application and its requirements for network transmission performance (e.g., high throughput, low latency, and low packet loss), congestion control methods may need to follow disparate policies. As shown in fig. 1, if an application is a file transfer class application sensitive to throughput, the throughput is critical, and the requirement of the application on the throughput of file transfer is high, and the requirement on delay is relatively low; if an application is a real-time streaming application that is delay sensitive, it is crucial to minimize the delay, which requires low transmission delay to reduce video churning, and there is a relatively low packet loss requirement, while the throughput requirement is relatively low. As the most important protocol of the network transmission layer, the computer network congestion control protocol needs to provide high-quality network service for applications with different network performance requirements, that is, the transmission layer needs to adapt to not only changeable network conditions, but also different application requirements, thereby meeting different requirements of users and improving the experience quality of the users.

The method utilizes multi-objective reinforcement learning and preference, can be suitable for different types of applications, and particularly can input preference of application to network transmission performance, current network state information into the congestion control model, and the congestion control model can give an optimal action of adjusting the size of a congestion window according to the application performance requirement.

It should be appreciated that the congestion control method and/or congestion control apparatus according to the present disclosure may be applied not only to the above scenario, but also to other suitable scenarios, to which the present disclosure is not limited.

Fig. 2 shows a flowchart of a training method of a congestion control model according to an exemplary embodiment of the present disclosure.

Referring to fig. 2, in step S101, a communication network environment used for a current training round is initialized.

The communication network environment is used for data transmission between a transmitting end and a receiving end used in the training round.

In addition, as an example, before the training round starts, the sending end and the receiving end used in the training round may be initialized, and the initialized sending end and the initialized receiving end are handshaked to control the sending end to continuously send a data packet to the receiving end in the communication network environment in the training round, and the receiving end sends a response ACK message to the sending end, so that the current network state may be monitored by analyzing the determination message of the data packet from the receiving end.

In step S102, the preference of the present training round for network transmission performance and the current first network state information S are input into the congestion control model, so as to obtain the predicted action a to be performed for adjusting the congestion window size.

As an example, the first network state information may include at least one of: the size of the congestion window, delay, packet acknowledgement rate ack_rate, and transmission rate separation_rate. For example, the delay, packet acknowledgement rate, and transmission rate may be determined based on the ACK message (i.e., acknowledgement message) fed back by the receiving end. For example, the sending end may be controlled to send a data packet to the receiving end under the currently set congestion window size, and when the sending end receives an ACK message fed back by the receiving end, the sending end may determine the current delay, the packet acknowledgement rate, and the sending rate based on the ACK message, so as to obtain the current first network state information.

As an example, the preference for network transmission performance may include a degree of preference for at least one of: throughput, packet loss rate, and latency.

As an example, one sample from the set of preferences may be selected as a preference for network transmission performance for this training round. For example, one sample in the preference set may be: the preference for throughput is 0.7, the preference for packet loss is 0.2, and the preference for time delay is 0.1; another sample in the preference set may be: the preference for throughput is 0.5, the preference for packet loss is 0.1, and the preference for latency is 0.4. It should be appreciated that the set of preferences may be set according to various requirements of different types of applications for network transmission performance.

In step S103, a predicted action is performed to reset the congestion window, and the transmitting end is controlled to transmit a data packet to the receiving end under the currently set congestion window.

As an example, a predicted action may be performed on the current congestion window size to get the size of the congestion window that needs to be set and set.

As an example, the action predicted by the congestion control model may be one action in the action set, and the action set may be { 0.5, -50, -10.0, +0.0, +10.0, +2.0, +50}, for example, +0.5 indicates that the current congestion window size is 0.5, and then the size of the congestion window needs to be set; 10.0 represents the size of the congestion window to be set after the current congestion window size is +10.0; -10.0 represents the size of the congestion window to be set after the current congestion window size is-10.0.

In step S104, when the sender receives the ACK message fed back by the receiver, a loss function of the congestion control model is calculated according to the action a, the first network state information before the action is performed (i.e., the first network state information S input to the congestion control model to predict the action a), the first network state information S' after the action is performed, and the preference of the training round for network transmission performance.

Here, the first network state information S' after performing the action is the first network state information determined based on the ACK message fed back by the receiving side for the data packet transmitted to the receiving side in step S103.

As an example, the loss function of the congestion control model may be calculated from the action a, the first network state information S before performing the action, the first network state information S' after performing the action, the reward function r of the action, and the preference of this training round for network transmission performance.

The Reward function for an action is a Reward function that measures the benefit of the action predicted by the congestion control model. As an example, the reward function of the action may be calculated based on the preference of the present training round for network transmission performance, and the third network state information after performing the action. For example, the third network state information may include at least one of: packet loss rate, throughput, and delay. It should be understood that the third network state information after performing the action, that is, the third network state information determined based on the ACK message fed back by the receiving end for the data packet transmitted to the receiving end in step S103.

As an example, the following three can be basedTuple determines the bonus function of the action [ L (Throughput (t)), L (los_rate (t)), L (Delay (t))]For example, the reward function of an action may be a weighted sum of these three quantities, and the weight of each quantity is related to the preference of the present training round for network transmission performance. Wherein t represents time; l (x) represents an activation function, e.g., L (x) = (-10) ^-x +1; throughput (t) may represent the result of normalizing Throughput, e.g., by dividing Throughput by bandwidth; delay (t) may represent the result of normalizing the Delay, e.g., the Delay may be normalized by dividing the Delay by a timeout threshold timeout; the loss_rate (t) may represent the packet Loss rate itself, i.e., no normalization of the packet Loss rate is required.

As an example, in step S103, the congestion control model predicted action is performed to reset the congestion window, control the transmitting end to transmit a data packet to the receiving end under the newly set congestion window, and wait for the acknowledgement message ACKs returned by the receiving end. In Step S104, after receiving the ACK, the reward function r of Step of the training round may be obtained by calculating rtt, comparing the number of packets sent with the number of acknowledgement messages, and the like, and the observed current first network state information S' (i.e., the first network state information after performing the action). As an example, the state S before this step, action a of this step, and the rewards r obtained after the action was performed and the new state S' transferred to may be saved in vector form to the replay buffer

Is a kind of medium. Accordingly, as an example, the playback buffer may be initialized before each training round starts>

Furthermore, it should be understood that the first network state information after the action is performed in this step may be used as the first network state information before the action is performed in the next step.

As an example, the congestion control model may be constructed based on a reinforcement learning algorithm DQN. By way of example, when the preference for network transmission performance includes a preference level for multiple network transmission performance, the congestion control model is a multi-objective model. As an example, the congestion control model may be a multi-objective reinforcement learning model.

As an example, the value function (Q function) in the reinforcement learning algorithm may be a value function regarding actions, first network state information, and preferences for network transmission performance.

As an example, the congestion control model may sample an action as predicted action a using an e-greyity policy _t An action may be sampled using equation (1), specifically, the probability of the sampled action having e is a randomly selected one of the actions in action set a, the probability of the sampled action having 1 e is the optimal action obtained using the Q function,

where A represents the action set, ω represents the preference of training rounds for network transmission performance, θ represents the parameters of the Q function, s _t Representing current first network state information.

As an example, the congestion control model may be a multi-objective model, in order to mathematically represent the congestion control problem, assuming that there are multiple objectives, each objective may be expressed in terms of an objective function to achieve a different objective function m _i (O) maximizing the overall:

s.t.g _i (O)≤0,i＝1,…,a _g

wherein m is _i (O) represents the objective function of the ith target, i=1, …, m _f ,g _i (O) represents a constraint function of the congestion control problem.

As an example, the objective function of the congestion control model may be a composite objective function with respect to: the reward function, the value function, the first network state information after performing the action, the first network state information before performing the action, the preference of the present training round for network transmission performance, and the best preference under the current network environment.

As an example, the objective function of the congestion control model may be a composite objective function TQ (s, a, ω), which may be expressed as:

wherein, the liquid crystal display device comprises a liquid crystal display device,

r () represents a reward function, γ represents a weight coefficient, Q () represents a value function, s 'represents first network state information after performing an action, s represents first network state information before performing an action, a represents an action, ω represents preference of this training round for network transmission performance, ω' represents optimal preference under the current network environment, and% >

Representing a set of actions, Ω representing a set of preferences.

As an example, the loss function of the congestion control model may be based on: loss function L for the purpose of making the value function closer to the maximum bonus function ^S (θ), and auxiliary loss function L ^T And (theta) is calculated.

Here, the auxiliary loss function is proposed in view of the inclusion of a large number of discrete solutions in the optimal boundary, which results in the curve of the loss function becoming non-smooth.

As an example, the loss function of the congestion control model may be expressed as: (1- ε) L ^S (θ)+ε·L ^T (θ); wherein epsilon is a trade-off index, and the more the motion is predicted later in a training round, the greater the value of epsilon is when calculating the loss function of the congestion control model for that motion, the greater 0 epsilon is less than or equal to 1.

In other words, in each training round, epsilon has an initial value of 0, and epsilon gradually increases from 0 to 1 with increasing number of steps, so that the loss function is increased from L ^S (theta) direction L ^T (θ) migration.

As an example, the loss function L ^S (θ) can be expressed as:

as an example, the auxiliary loss function L ^T (θ) can be expressed as:

r represents a reward function, gamma represents a weight coefficient, θ represents a model parameter, θ _k Representing parameters of the K-th step model, Q () representing a value function, s 'representing first network state information after performing an action, s representing first network state information before performing an action, a representing an action, ω representing a preference of the present training round for network transmission performance, ω' representing an optimal preference under the current network environment.

In step S105, the congestion control model is trained by adjusting model parameters of the congestion control model according to the loss function.

As an example, the parameter θ of the Q-function of the congestion control model may be adjusted according to the loss function.

As an example, a random gradient descent may be performed on the parameter θ of the Q function using equation (3) to update the Q function of the model, where,

represents the gradient magnitude of the parameter theta,

in step S106, it is determined whether to end the present training round, wherein when it is determined not to end the present training round, execution is returned to step S102.

Further, as an example, the training method of the congestion control model according to the exemplary embodiment of the present disclosure may further include: when the training loop is determined to be ended, determining whether to end the training process of the congestion control model; when it is determined that the training process of the congestion control model is not ended, step S101 is performed back to enter the next training round, i.e., the next training round is prepared. When it is determined to end the training process of the congestion control model, training of the congestion control model is stopped, i.e., training of the congestion control model has been completed. For example, whether to end the training process of the congestion control model may be determined based on a predicted effect of the congestion control model or a total length of training, etc. Further, it should be appreciated that the initial communication network environment for the different training rounds may be the same or different and that the preferences of the different training rounds for network transmission performance may be the same or different.

As an example, it may be determined whether to end the present training round based on the change in the second network status information.

As an example, the action may be determined to be a winning action when the second network state information after the action is performed satisfies a first preset condition; when the second network state information after executing the action meets a second preset condition, determining the action as a failed action; when the continuous number Win of winning actions _Num When the first preset times are reached, determining to end the training round; when the continuous number of failed actions Lose _Num When the second preset times are reached, determining to end the training round; and when the total number of times of executing the action reaches a third preset number of times, determining to end the training round. For example, the first preset number M may be set to 50, the second preset number L may be set to 50, and the third preset number X may be set to 200.

As an example, the second network state information may include: throughput and delay.

As an example, the first preset condition may be: throughput is 90% -110% of bandwidth, and delay is less than or equal to 0.7×timeout threshold; the second preset condition may be: throughput is 50% -70% of bandwidth and delay is ≡0.7Xtimeout threshold.

The method for dynamically ending the training round is provided, inspired by the concept of winning or losing and game play and decision making is carried out according to the change of network environment, and the method can end the round at proper time according to the change of network bandwidth utilization rate, time delay and throughput, so that the training efficiency of a model is improved. The present disclosure proposes a method that can end a training round in the most appropriate way. One training round is a complete training process of the reinforcement learning algorithm, in which a series of actions are sequentially selected at each moment according to the network state and congestion control strategy, and the length of the training round has a great influence on whether the optimal model can be learned. As an example, in each training round, win _Num And Lose _Num Each time a predicted action is determined to be a winning action, win is determined to be _Num And will be Lose _Num Reset to 0, and whenever a predicted action is determined to be a failed action, then Lose is determined to be _Num And Win is given a value of +1 _Num The value of (2) is reset to 0; if the continuous Win of the training round _Num If M is not less than the number of consecutive failures, stopping the training round with winning _Num Stopping the training round with failure if L is not less than zero, ending the training round with tie if the current training round has been performed with X steps, in other words, win before the current training round has been performed with X steps _Num Not reach M, lose _Num Nor does L.

In addition, the present disclosure considers that an initial congestion window (init-cwnd, i.e., the size of the congestion window set when a transmitting end starts transmitting data) has a significant effect on the model convergence rate, but the init-cwnd is generally set to a fixed value in the related art, so there is a problem in that rapid convergence cannot be achieved due to different link capacities in different network scenarios. The method and the system consider the problem that the congestion control method cannot realize rapid convergence under different network scenes due to the fact that large differences exist between different links, and propose to dynamically design an initial congestion window according to the bandwidth of the links so as to improve the convergence speed of the congestion control method.

As an example, the training method of the congestion control model according to the exemplary embodiment of the present disclosure may further include: initializing the size of a congestion window; wherein the step of initializing the size of the congestion window may comprise: and estimating the bandwidth of the communication network, and determining the initial size of the congestion window based on the estimated bandwidth.

As an example, the step of estimating the bandwidth of the communication network may comprise: determining the total number Num of ACK messages fed back by a receiving end for N data packets sent by a sending end _ack The method comprises the steps of carrying out a first treatment on the surface of the Num according to the total number _ack The average Num obtained by dividing N _ave Determining a bandwidth bw of the communication network _i 。

As an example, num can be found from a predefined Bandwidth combination Bandwidth by the formula (4) _ave Corresponding estimated network bandwidth bw _i ：

representing a set of characteristic functions defined in equation (5) with respect to the reception rate, (c) _j-1 ,c _j ) Representing a set of received rates of receipt,

as an example, the estimated bandwidth bw may be based on equation (6) _i Determining an initial size W of a congestion window _init-cwnd Where b is a coefficient, for example, b may be set to 2.5 according to experimental data to achieve better fitting rate and learning effect,

W _init-cwnd ＝b*bw _i (6)

as an example, the size of the congestion window may be initialized before starting the present training round to use W when the sender first sends a data packet in the present training round _init-cwnd . As another example, the size of the congestion window may be initialized based on network state information of the first N steps of the present training round.

It should be appreciated that if the communication network environments of multiple training rounds are the same, the multiple training rounds may share the same W _init-cwnd 。

Fig. 3 shows a flowchart of a congestion control method according to an exemplary embodiment of the present disclosure.

Referring to fig. 3, in step S201, current first network state information and preferences of a current application for network transmission performance are acquired.

Here, the application is an application that performs data transmission using the congestion control method.

In step S202, the obtained first network state information and the preferences are input to a congestion control model, resulting in a predicted action to be performed for adjusting the congestion window size.

In step S203, a predicted action is performed to reset the congestion window. It should be appreciated that the congestion control method according to the exemplary embodiments of the present disclosure may be repeatedly performed to adjust the congestion window size in real time according to the network status.

As an example, the first network state information may include at least one of: the size of the congestion window, delay, packet acknowledgement rate, and transmission rate. For example, the delay, packet acknowledgement rate, and transmission rate may be determined based on the ACK message fed back by the receiving end.

As an example, the congestion control model may be constructed based on a reinforcement learning algorithm; wherein the value function in the reinforcement learning algorithm may be a value function regarding actions, first network state information, and preferences for network transmission performance.

As an example, the congestion control model may be trained using the training method as described in the above exemplary embodiments.

As an example, the congestion control method according to an exemplary embodiment of the present disclosure may further include: initializing the size of a congestion window; wherein the step of initializing the size of the congestion window may comprise: the bandwidth of the communication network is estimated and an initial size of the congestion window is determined based on the estimated bandwidth. As an example, the size of the congestion window may be initialized when the application starts data transmission with the receiving end.

As an example, the step of estimating the bandwidth of the communication network may comprise: determining the total number of ACK messages fed back by a receiving end for the N sent data packets; and determining the bandwidth of the communication network according to the average value obtained by dividing the total number by N.

Specific processes in the congestion control method according to the exemplary embodiment of the present disclosure have been described in detail in the above-described embodiments of the training method of the related congestion control model, and will not be described in detail herein.

Fig. 4 shows a schematic diagram of a training method of a congestion control model and a congestion control method according to an exemplary embodiment of the present disclosure.

As shown in fig. 4, when training the congestion control model, the transmitting end may send data to the receiving end based on the initial congestion window size set by the bandwidth to increase the convergence speed of the network when starting a training loop; moreover, training data generated by network environment interaction can be utilized to train a congestion control model of the DQN network through multi-objective reinforcement learning; in addition, a training round interrupt algorithm can be adopted to solve the problem of false interrupt of the training round. By using the trained congestion control model to realize the congestion control method, the optimal congestion control strategy can be generated aiming at any appointed preference, so that the requirements of different types of applications can be met.

According to the training method of the congestion control model, according to the exemplary embodiment of the disclosure, the performance of different network indexes can be improved according to different preference settings without resetting an objective function or a reward function, so that the transmission performance requirements of different types of applications can be met, and the training time and the training cost of the congestion control model are reduced;

In addition, in order to improve the convergence of the model and solve the problem that the convergence speed of the current congestion control model is relatively slow, according to the exemplary embodiment of the present disclosure, a method for dynamically initializing a congestion window is also provided, and different initialized congestion window sizes are set for different network bandwidths, so that the convergence speed of the congestion control model in different network environments can be improved;

in addition, in order to solve the problem of pseudo-interruption of training occurring when multi-objective reinforcement learning is applied to the congestion control problem, according to the exemplary embodiment of the present disclosure, a win-lose-tie interruption round algorithm is also provided to improve the training quality of the congestion control model, so that multi-objective reinforcement learning can be applied to the congestion control problem, and the training efficiency of the algorithm is improved;

in addition, the congestion control method according to the exemplary embodiment of the present disclosure has superior experimental performance. The method achieves a tradeoff between high throughput and low latency and accordingly exhibits excellent congestion control capability in different network environments of 12Mbps and 50 Mbps. For different cellular network scenarios, by setting different preferences, the congestion control method according to the exemplary embodiment of the present disclosure can meet transmission performance requirements of different types of applications, and an optimal trade-off is implemented between different performance indexes, that is, performance requirements of different types of applications in a dynamic network scenario can be met by setting different preferences.

Fig. 5 shows a block diagram of a training apparatus of a congestion control model according to an exemplary embodiment of the present disclosure.

As shown in fig. 5, the training apparatus 10 of the congestion control model according to the exemplary embodiment of the present disclosure includes: an environment initializing unit 101, a predicting unit 102, a congestion window setting unit 103, a loss function calculating unit 104, a training unit 105, and a round end determining unit 106.

Specifically, the environment initializing unit 101 is configured to initialize a communication network environment used by the current training round.

The prediction unit 102 is configured to input the preference of the present training round for network transmission performance, and the current first network state information into the congestion control model, resulting in a predicted action to be performed for adjusting the congestion window size.

The congestion window setting unit 103 is configured to perform a predicted action to reset the congestion window and control the transmitting end to transmit a data packet to the receiving end under the currently set congestion window.

The loss function calculation unit 104 is configured to calculate, when the sender receives the ACK message fed back by the receiver, a loss function of the congestion control model according to the action, the first network state information before the action is performed, the first network state information after the action is performed, and the preference.

The training unit 105 is configured to train the congestion control model by adjusting model parameters of the congestion control model according to the loss function.

The round-ending determining unit 106 is configured to determine whether to end the present training round, wherein when it is determined not to end the present training round, the predicting unit 102 inputs the preference of the present training round for network transmission performance and the current first network state information into the congestion control model, resulting in a predicted action to be performed for adjusting the congestion window size.

As an example, the first network state information may include at least one of: the size of the congestion window, delay, packet acknowledgement rate, and transmission rate; wherein the delay, the packet acknowledgement rate, and the transmission rate are determined based on the ACK message fed back by the receiving end.

As an example, the round-ending determination unit 106 may be configured to determine whether to end the present training round according to the change situation of the second network state information.

As an example, the round-trip end determination unit 106 may be configured to determine that the action is a winning action when the second network state information after performing the action satisfies a first preset condition; when the second network state information after executing the action meets a second preset condition, determining the action as a failed action; when the continuous times of the winning actions reach the first preset times, determining to end the training round; when the continuous times of failed actions reach a second preset times, determining to end the training round; and when the total number of times of executing the action reaches a third preset number of times, determining to end the training round.

As an example, the second network state information may include: throughput and delay; the first preset condition is: throughput is 90% -110% of bandwidth, and delay is less than or equal to 0.7×timeout threshold; the second preset condition is: throughput is 50% -70% of bandwidth and delay is ≡0.7Xtimeout threshold.

As an example, the training device 10 of the congestion control model may further include: a window initialization unit (not shown) configured to initialize the size of the congestion window; wherein the window initialization unit is configured to estimate a bandwidth of the communication network and to determine an initial size of the congestion window based on the estimated bandwidth.

As an example, the window initialization unit may be configured to determine the total number of ACK messages fed back by the receiving end for the N data packets transmitted by the transmitting end; and determining the bandwidth of the communication network according to the average value obtained by dividing the total number by N.

As an example, the loss function calculation unit 104 may be configured to calculate, when the transmitting end receives the ACK message fed back by the receiving end, a loss function of the congestion control model according to the action, the first network state information before the action is performed, the first network state information after the action is performed, the bonus function of the action, and the preference.

As an example, the reward function for the action may be calculated based on the preference and third network state information after performing the action; wherein the third network state information comprises at least one of: packet loss rate, throughput, and delay.

As an example, the congestion control model may be constructed based on a reinforcement learning algorithm; wherein the value function in the reinforcement learning algorithm is a value function related to an action, first network state information, and a preference for network transmission performance.

As an example, the congestion control model predicts an action, with a probability of ε being one action randomly selected from the action set, and with a probability of 1- ε being the optimal action obtained using the value function.

As an example, the loss function L ^S (θ) can be expressed as:

and/or

Auxiliary loss function L ^T (θ) is expressed as:

As an example, the objective function of the congestion control model may be a composite objective function TQ (s, a, ω), expressed as:

Representing a set of actions, Ω representing a set of preferences.

As an example, the training apparatus 10 of the congestion control model according to the exemplary embodiment of the present disclosure may further include: and a training end determining unit (not shown) configured to determine whether to end the training process of the congestion control model when it is determined to end the present training round, wherein the environment initializing unit 101 initializes the communication network environment used by the current training round to enter the next training round when it is determined not to end the training process of the congestion control model.

Fig. 6 shows a block diagram of a congestion control apparatus according to an exemplary embodiment of the present disclosure.

As shown in fig. 6, the congestion control apparatus 20 according to the exemplary embodiment of the present disclosure includes: an acquisition unit 201, a prediction unit 202, and a congestion window setting unit 203.

Specifically, the acquisition unit 201 is configured to acquire current first network state information and a preference of a current application for network transmission performance.

The prediction unit 202 is configured to input the acquired first network state information and the preferences to a congestion control model resulting in a predicted action to be performed for adjusting the congestion window size.

The congestion window setting unit 203 is configured to perform a predicted action to reset the congestion window.

As an example, the congestion control apparatus 20 may further include: a window initialization unit (not shown) configured to initialize the size of the congestion window; wherein the window initialization unit is configured to estimate a bandwidth of the communication network and to determine an initial size of the congestion window based on the estimated bandwidth.

As an example, the window initialization unit may be configured to determine the total number of ACK messages fed back by the receiving end for the N data packets transmitted; and determining the bandwidth of the communication network according to the average value obtained by dividing the total number by N.

As an example, the congestion control model may be trained using the training apparatus 10 of the above-described exemplary embodiment.

The specific manner in which the respective units perform the operations in the apparatus of the above embodiments has been described in detail in relation to the embodiments of the method, and will not be described in detail here.

Further, it should be understood that the respective units in the training apparatus 10 and the congestion control apparatus 20 of the congestion control model according to the exemplary embodiments of the present disclosure may be implemented as hardware components and/or software components. The individual units may be implemented, for example, using a Field Programmable Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC), depending on the processing performed by the individual units as defined.

Fig. 7 shows a block diagram of an electronic device according to an exemplary embodiment of the present disclosure. Referring to fig. 7, the electronic device 30 includes: at least one memory 301 and at least one processor 302, the at least one memory 301 having stored therein a set of computer executable instructions that, when executed by the at least one processor 302, perform the training method and/or the congestion control method of the congestion control model as described in the above exemplary embodiments.

By way of example, electronic device 30 may be a PC computer, tablet device, personal digital assistant, smart phone, or other device capable of executing the above-described set of instructions. Here, the electronic device 30 is not necessarily a single electronic device, but may be any apparatus or a collection of circuits capable of executing the above-described instructions (or instruction sets) individually or in combination. The electronic device 30 may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with either locally or remotely (e.g., via wireless transmission).

In electronic device 30, processor 302 may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processor 302 may also include an analog processor, a digital processor, a microprocessor, a multi-core processor, a processor array, a network processor, and the like.

The processor 302 may execute instructions or code stored in the memory 301, wherein the memory 301 may also store data. The instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.

The memory 301 may be integrated with the processor 302, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. In addition, the memory 301 may include a stand-alone device, such as an external disk drive, a storage array, or any other storage device usable by a database system. The memory 301 and the processor 302 may be operatively coupled or may communicate with each other, for example, through an I/O port, network connection, etc., such that the processor 302 is able to read files stored in the memory.

In addition, the electronic device 30 may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the electronic device 30 may be connected to each other via a bus and/or a network.

According to an exemplary embodiment of the present disclosure, a computer readable storage medium storing instructions may also be provided, wherein the instructions, when executed by at least one processor, cause the at least one processor to perform the training method and/or the congestion control method of the congestion control model as described in the above exemplary embodiments. Examples of the computer readable storage medium herein include: read-only memory (ROM), random-access programmable read-only memory (PROM), electrically erasable programmable read-only memory (EEPROM), random-access memory (RAM), dynamic random-access memory (DRAM), static random-access memory (SRAM), flash memory, nonvolatile memory, CD-ROM, CD-R, CD + R, CD-RW, CD+RW, DVD-ROM, DVD-R, DVD + R, DVD-RW, DVD+RW, DVD-RAM, BD-ROM, BD-R, BD-R LTH, BD-RE, blu-ray or optical disk storage, hard Disk Drives (HDD), solid State Disks (SSD), card memory (such as multimedia cards, secure Digital (SD) cards or ultra-fast digital (XD) cards), magnetic tape, floppy disks, magneto-optical data storage, hard disks, solid state disks, and any other means configured to store computer programs and any associated data, data files and data structures in a non-transitory manner and to provide the computer programs and any associated data, data files and data structures to a processor or computer to enable the processor or computer to execute the programs. The computer programs in the computer readable storage media described above can be run in an environment deployed in a computer device, such as a client, host, proxy device, server, etc., and further, in one example, the computer programs and any associated data, data files, and data structures are distributed across networked computer systems such that the computer programs and any associated data, data files, and data structures are stored, accessed, and executed in a distributed fashion by one or more processors or computers.

According to an exemplary embodiment of the present disclosure, a computer program product may also be provided, the instructions in which are executable by at least one processor to perform the training method and/or the congestion control method of the congestion control model as described in the above exemplary embodiments.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This application is intended to cover any adaptations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for training a congestion control model, comprising:

Initializing a communication network environment used by a current training round;

selecting one sample from the preference set as the preference of the training round on the network transmission performance;

inputting the preference of the training round to the network transmission performance and the current first network state information into a congestion control model to obtain a predicted action which needs to be executed and is used for adjusting the size of a congestion window;

executing the predicted action to reset the congestion window, and controlling the transmitting end to transmit the data packet to the receiving end under the currently set congestion window;

when the sending end receives the ACK message fed back by the receiving end, calculating a loss function of the congestion control model according to the action, the first network state information before the action is executed, the first network state information after the action is executed, the reward function of the action and the preference, wherein the reward function of the action is calculated based on the preference of the training round on the network transmission performance and the third network state information after the action is executed;

training the congestion control model by adjusting model parameters of the congestion control model according to the loss function, and determining whether to end the training round, wherein when the training round is determined not to be ended, the step of inputting the preference of the training round for network transmission performance and the current first network state information into the congestion control model is performed in a returning mode, so that the predicted action required to be performed for adjusting the congestion window size is obtained;

When the training loop is determined to be ended, determining whether to end the training process of the congestion control model;

when it is determined that the training process of the congestion control model is not finished, the steps of initializing the communication network environment used by the current training round and selecting one from the set of preferences as a preference for network transmission performance by the current training round are performed back to enter the next training round,

wherein the preference of different training rounds for network transmission performance is different.

2. The method of claim 1, wherein the first network status information comprises at least one of: the size of the congestion window, delay, packet acknowledgement rate, and transmission rate;

wherein the delay, the packet acknowledgement rate, and the transmission rate are determined based on the ACK message fed back by the receiving end.

3. The method of claim 1, wherein the preference for network transmission performance comprises a preference level for at least one of: throughput, packet loss rate, and latency.

4. The method of claim 1, wherein the step of determining whether to end the present training round comprises:

And determining whether to end the training round according to the change condition of the second network state information.

5. The method of claim 4 wherein the step of determining whether to end the present training round based on the change in the second network status information comprises:

when the second network state information after executing the action meets a first preset condition, determining the action as a winning action;

when the second network state information after executing the action meets a second preset condition, determining the action as a failed action;

when the continuous times of the winning actions reach the first preset times, determining to end the training round;

when the continuous times of failed actions reach a second preset times, determining to end the training round;

and when the total number of times of executing the action reaches a third preset number of times, determining to end the training round.

6. The method of claim 5, wherein the second network status information comprises: throughput and delay;

the first preset condition is: throughput is 90% -110% of bandwidth, and delay is less than or equal to 0.7×timeout threshold;

the second preset condition is: throughput is 50% -70% of bandwidth and delay is ≡0.7Xtimeout threshold.

7. The method according to claim 1, wherein the method further comprises: initializing the size of a congestion window;

wherein the step of initializing the size of the congestion window comprises: and estimating the bandwidth of the communication network, and determining the initial size of the congestion window based on the estimated bandwidth.

8. The method of claim 7, wherein the step of estimating the bandwidth of the communication network comprises:

determining the total number of ACK messages fed back by a receiving end aiming at N data packets sent by a sending end;

and determining the bandwidth of the communication network according to the average value obtained by dividing the total number by N.

9. The method of claim 1, wherein the third network status information comprises at least one of: packet loss rate, throughput, and delay.

10. The method of claim 1, wherein the congestion control model is constructed based on a reinforcement learning algorithm;

wherein the value function in the reinforcement learning algorithm is a value function related to an action, first network state information, and a preference for network transmission performance.

11. The method of claim 10, wherein the congestion control model predicts an action, the probability of having e is a randomly selected one of the actions in the action set, and the probability of having 1 e is an optimal action obtained using a value function.

12. The method of claim 10, wherein the loss function of the congestion control model is based on: loss function L for the purpose of making the value function closer to the maximum bonus function ^S (θ), and auxiliary loss function L ^T And (theta) is calculated.

13. The method of claim 12, wherein the loss function of the congestion control model is expressed as: (1 ε) L ^S (θ)+ε·L ^T (θ)；

Wherein epsilon is a trade-off index, and the more the motion is predicted later in a training round, the greater the value of epsilon is when calculating the loss function of the congestion control model for that motion, the greater 0 epsilon is less than or equal to 1.

14. The method of claim 10, wherein the objective function of the congestion control model is a composite objective function with respect to: the reward function, the value function, the first network state information after performing the action, the first network state information before performing the action, the preference of the present training round for network transmission performance, and the best preference under the current network environment.

15. A congestion control method, comprising:

acquiring current first network state information and preference of current application to network transmission performance;

Inputting the acquired first network state information and the preferences into a congestion control model to obtain a predicted action to be executed for adjusting the size of a congestion window;

the predicted actions are performed to reset the congestion window,

wherein the congestion control model is adapted to different types of applications, different types of applications have different preferences for network transport performance,

wherein the congestion control model is trained using the training method of any of claims 1 to 14.

16. The method of claim 15, wherein the first network status information comprises at least one of: the size of the congestion window, delay, packet acknowledgement rate, and transmission rate;

17. The method of claim 15, wherein the preference for network transmission performance includes a degree of preference for at least one of: throughput, packet loss rate, and latency.

18. The method of claim 15, wherein the method further comprises: initializing the size of a congestion window;

Wherein the step of initializing the size of the congestion window comprises: the bandwidth of the communication network is estimated and an initial size of the congestion window is determined based on the estimated bandwidth.

19. The method of claim 18, wherein the step of estimating the bandwidth of the communication network comprises:

determining the total number of ACK messages fed back by a receiving end for the N sent data packets;

20. The method of claim 15, wherein the congestion control model is constructed based on a reinforcement learning algorithm;

21. A training apparatus for a congestion control model, comprising:

the environment initialization unit is configured to initialize a communication network environment used by the current training round, and select one sample from the preference set as the preference of the current training round on the network transmission performance;

the prediction unit is configured to input the preference of the training round to the network transmission performance and the current first network state information into the congestion control model to obtain a predicted action which needs to be executed and is used for adjusting the congestion window size;

A congestion window setting unit configured to perform a predicted action to reset a congestion window and control a transmitting end to transmit a data packet to a receiving end under the currently set congestion window;

a loss function calculation unit configured to calculate, when the sending end receives the ACK message fed back by the receiving end, a loss function of the congestion control model according to the action, the first network state information before the action is performed, the first network state information after the action is performed, the reward function of the action, and the preference, wherein the reward function of the action is calculated based on the preference of the training round for network transmission performance and the third network state information after the action is performed;

a training unit configured to train the congestion control model by adjusting model parameters of the congestion control model according to the loss function;

a round-ending determining unit configured to determine whether to end the present training round, wherein when it is determined that the present training round is not ended, the predicting unit inputs the preference of the present training round for the network transmission performance and the current first network state information into the congestion control model, to obtain a predicted action to be performed for adjusting the congestion window size;

A training end determination unit configured to determine whether to end the training process of the congestion control model when it is determined to end the present training round, wherein, when it is determined not to end the training process of the congestion control model, the environment initialization unit initializes the communication network environment used by the present training round, and selects one sample from the preference set as a preference of the present training round for network transmission performance to enter the next training round,

22. The apparatus of claim 21, wherein the first network status information comprises at least one of: the size of the congestion window, delay, packet acknowledgement rate, and transmission rate;

23. The apparatus of claim 21, wherein the preference for network transmission performance comprises a preference level for at least one of: throughput, packet loss rate, and latency.

24. The device according to claim 21, wherein the round-ending determining unit is configured to determine whether to end the present training round based on a change in the second network status information.

25. The apparatus according to claim 24, wherein the round-robin end determination unit is configured to determine that the action is a winning action when the second network state information after the action is performed satisfies a first preset condition; when the second network state information after executing the action meets a second preset condition, determining the action as a failed action; when the continuous times of the winning actions reach the first preset times, determining to end the training round; when the continuous times of failed actions reach a second preset times, determining to end the training round; and when the total number of times of executing the action reaches a third preset number of times, determining to end the training round.

26. The apparatus of claim 25, wherein the second network status information comprises: throughput and delay;

27. The apparatus of claim 21, wherein the apparatus further comprises:

a window initialization unit configured to initialize a size of a congestion window;

Wherein the window initialization unit is configured to estimate a bandwidth of the communication network and to determine an initial size of the congestion window based on the estimated bandwidth.

28. The apparatus according to claim 27, wherein the window initialization unit is configured to determine a total number of ACK messages fed back by the receiving end for the N data packets transmitted by the transmitting end; and determining the bandwidth of the communication network according to the average value obtained by dividing the total number by N.

29. The apparatus of claim 21, wherein the third network status information comprises at least one of: packet loss rate, throughput, and delay.

30. The apparatus of claim 21, wherein the congestion control model is constructed based on a reinforcement learning algorithm;

31. The apparatus of claim 30 wherein the action predicted by the congestion control model has a probability of e being a randomly selected one of the actions in the action set and a probability of 1 e being an optimal action obtained using a value function.

32. The apparatus of claim 30, wherein the loss function of the congestion control model is based on: loss function L for the purpose of making the value function closer to the maximum bonus function ^S (θ), and auxiliary loss function L ^T And (theta) is calculated.

33. The apparatus of claim 32, wherein the loss function of the congestion control model is expressed as: (1 ε) L ^S (θ)+ε·L ^T (θ)；

34. The apparatus of claim 30, wherein the objective function of the congestion control model is a composite objective function with respect to: the reward function, the value function, the first network state information after performing the action, the first network state information before performing the action, the preference of the present training round for network transmission performance, and the best preference under the current network environment.

35. A congestion control apparatus, characterized by comprising:

an acquisition unit configured to acquire current first network state information and a preference of a current application for network transmission performance;

A prediction unit configured to input the acquired first network state information and the preference into a congestion control model, and obtain a predicted action to be performed for adjusting the congestion window size;

a congestion window setting unit configured to perform a predicted action to reset the congestion window,

wherein the congestion control model is trained using the training apparatus of any of claims 21 to 34.

36. The apparatus of claim 35, wherein the first network status information comprises at least one of: the size of the congestion window, delay, packet acknowledgement rate, and transmission rate;

37. The apparatus of claim 35, wherein the preference for network transmission performance comprises a preference level for at least one of: throughput, packet loss rate, and latency.

38. The apparatus of claim 35, wherein the apparatus further comprises:

39. The device according to claim 38, wherein the window initialization unit is configured to determine a total number of ACK messages fed back by the receiving end for the N transmitted data packets; and determining the bandwidth of the communication network according to the average value obtained by dividing the total number by N.

40. The apparatus of claim 35, wherein the congestion control model is constructed based on a reinforcement learning algorithm;

41. An electronic device, comprising:

at least one processor;

at least one memory storing computer-executable instructions,

wherein the computer executable instructions, when executed by the at least one processor, cause the at least one processor to perform the training method of the congestion control model of any one of claims 1 to 14 and/or the congestion control method of any one of claims 15 to 20.

42. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by at least one processor, cause the at least one processor to perform the training method of the congestion control model according to any one of claims 1 to 14 and/or the congestion control method according to any one of claims 15 to 20.