CN112822718B

CN112822718B - Packet transmission method and system based on reinforcement learning and stream coding driving

Info

Publication number: CN112822718B
Application number: CN202011620034.2A
Authority: CN
Inventors: 张非凡; 李业
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-10-12
Anticipated expiration: 2040-12-31
Also published as: CN112822718A

Abstract

The invention discloses a packet transmission method and a system based on reinforcement learning and stream coding driving, wherein the packet transmission method specifically comprises the following steps: firstly, initializing relevant parameters of stream coding, then estimating the congestion state of the network and the ordered grouping receiving progress of the receiving end by the sending end according to the feedback of the receiving end, using the series of states as characteristic vectors for the real-time learning of a model, then selecting the current behavior according to a reward function, and finally realizing the on-line training of the sending action of the sending end in the grouping sending process. The grouping system comprises a sending end, a receiving end, a state space unit, a reward function unit, a value fitting unit and an action selection unit. The invention dynamically adjusts the packet sending interval and intelligently selects the sending packet type according to the network condition and the packet loss rate at the moment, realizes the joint optimization of the stream coding code rate control and the congestion control, improves the throughput of the network, reduces the data transmission delay and can adapt to changeable link conditions.

Description

Packet transmission method and system based on reinforcement learning and stream coding driving

Technical Field

The invention belongs to the technical field of wireless communication, and particularly relates to a packet transmission method and system based on reinforcement learning and stream coding driving, in particular to a packet transmission method and system based on reinforcement learning and stream coding driving and oriented to a wireless link with a large time delay bandwidth product.

Background

The wireless long and fat link, namely the wireless link with large time delay bandwidth product, is an important component of the air-ground integrated network in the future. At present, in a long and fat wireless link, the problem of low bandwidth utilization rate generally exists in TCP (transmission control protocol) which is relied on conventionally. Most TCP variants treat data packet loss as a congestion signal and will therefore reduce the transmission rate. In wireless links, however, data packet loss may be due to random link errors rather than congestion, and such implementation may result in unnecessary slowdowns. In many new air-space-ground integrated network scenarios, link layer automatic repeat request (ARQ) cannot be used due to large propagation delay, and therefore, data packet loss due to link error inevitably occurs, so that the problem is particularly serious. Secondly, to avoid congestion, the sending rate of TCP is gradually increased at the beginning of the transmission (called slow start). In a long fat link where both bandwidth and propagation delay are large, it may take a long time to fill the link with data. Especially in short-term data volume connections, this results in a severe drop in link bandwidth utilization.

A number of TCP congestion control variants have been proposed in the art to address these problems, typical examples include TCPWestwood + and Google's BBR, among others. However, the congestion control scheme based on the rules is not enough to meet the high heterogeneous and dynamic characteristics of the future air-space-ground integrated network. In future heterogeneous and large-scale wireless networks, higher flexibility and more stringent throughput/delay requirements are required. Recently, the fast UDP network connection protocol (QUIC) proposed by Google is widely recognized as an alternative to TCP in future network packet transmission. QUIC is based entirely on UDP, takes advantage of the connectionless nature of UDP to reduce the 3-way handshake delay of TCP to establish a connection, takes advantage of the out-of-order nature of UDP to multiplex HTTP streams more efficiently, and the lightweight nature of UDP also gives great flexibility to deployment.

However, in order for UDP-based transport to provide a reliable, orderly application interface like TCP, it is still necessary to add congestion control and reliability mechanisms. However, current QUIC designs still employ mainly existing congestion control and retransmission mechanisms of TCP. In a fat wireless link, the original problems of TCP still exist.

Disclosure of Invention

In view of the above, the present invention aims to provide a packet transmission method based on reinforcement learning and stream coding driving, so as to solve the problem of low bandwidth utilization of packet transmission links in long and fat wireless links in the existing TCP and QUIC technologies.

The invention provides a packet transmission method based on reinforcement learning and stream coding driving, which comprises the following steps:

s1, setting stream coding parameters;

s2, a sending end sends a packet, wherein the packet is an uncoded source packet or a coded repair packet;

s3, the receiving end decodes and recovers the received packets and orderly transmits the packets to an upper layer application, and simultaneously sends feedback information to the sending end, wherein the feedback information comprises decoding progress, the number and the type of the latest received packets, the number of the received source packets and the number of the received repair packets;

s4, the sending end processes the feedback information, determines system state information, calculates reward and punishment values according to a reward function, estimates available bandwidth of a link, determines interval time of sending actions of the sending end according to the available bandwidth of the link, and then conducts reinforcement learning;

the reinforcement learning is executed based on a reinforcement learning model, and the reinforcement learning method comprises the following steps:

s41, outputting a value function after weight updating and the value of each sending action according to the system state information and the reward and punishment values;

s42, selecting an optimal sending action according to the value of each sending action, wherein the optimal sending action is used as the sending action with the maximum value in the current state;

the system state information comprises the ratio of the current packet round-trip delay to the minimum packet round-trip delay, the ratio of the current sending packet action number to the total action number, and the ratio of the current sending source packet number to the total packet number; the sending action is one of sending source packet, sending repair packet and abandoning sending; the reward function is determined according to an optimization objective of packet transmission that maximizes its throughput for each user stream while minimizing latency;

s43, the sending end realizes sending action according to the optimal sending action selected in the step S42;

and S5, repeating the steps S3 and S4 to realize congestion control and stream coding rate control.

Further, the repair packet is a linear combination of source packets that have been previously transmitted, as shown in the following equation:

wherein, c_kDenotes a repair packet numbered k, k being 0,1,2,3, …; g_k，iIs from a finite field

The selected stream coding coefficients; w is a_sFor the number of the oldest source packet in the current transmit queue, w_sIs 0, w_sThe value of (c) is continuously updated according to the feedback information; i.e. i_seqIndicating the number of the last transmitted source packet.

Further, the reward function is expressed as follows:

r (s, a) represents that the system state information is s, and the motion is sent as a reward and punishment value when a is reached; gp is goodput, i.e., the number of ordered source packets received by the receiving end divided by the elapsed time; inp is the number of all packets sent by the sending end divided by the time used; u shape_nAs utility function, U_nLog (gp) - δ log (RTT), RTT being a smoothed estimate of the minimum round trip delay; RTT (round trip time)_ratioThe ratio of the currently and smoothly estimated RTT to the minimum value of RTT; tau is a preset hyper-parameter.

Further, the cost function is obtained by the following steps:

and mapping the system state information into a characteristic vector only containing discrete values 0 and 1 by adopting a tile coding mode, and fitting the characteristic vector in a linear function form by combining the reward and punishment values to obtain a cost function.

Further, the selecting an optimal sending action according to the value of each sending action specifically includes: and selecting the optimal sending action by using an e-greedy strategy.

The invention also provides a packet transmission system based on reinforcement learning and stream coding driving, which comprises:

a transmitting end, configured to transmit a packet, where the packet is an uncoded source packet or a coded repair packet;

the receiving end is used for decoding and recovering the received packets and orderly transmitting the packets to an upper layer application, and simultaneously sending feedback information to the sending end, wherein the feedback information comprises decoding progress, the number and the type of the latest received packets, the number of the received source packets and the number of the received repair packets;

the state space unit is arranged at the sending end and used for processing the feedback information and determining system state information; the system state information comprises the ratio of the current packet round-trip delay to the minimum packet round-trip delay, the ratio of the current sending packet action number to the total action number, and the ratio of the current sending source packet number to the total packet number;

a reward function unit for calculating an output reward penalty value according to a reward function as shown below;

r (s, a) represents that the system state information is s, and the motion is sent as a reward and punishment value when a is reached; gp is goodput, i.e., the number of ordered source packets received by the receiving end divided by the elapsed time; inp is the number of all packets sent by the sending end divided by the time used; u shape_nAs utility function, U_nLog (gp) - δ log (RTT), RTT being a smoothed estimate of the minimum round trip delay; RTT (round trip time)_ratioThe ratio of the currently and smoothly estimated RTT to the minimum value of RTT; tau is a preset hyper-parameter;

the system state information is mapped into a characteristic vector only containing discrete values 0 and 1 by adopting a tile coding mode, then a cost function is obtained by combining the reward and punishment values and fitting in a linear function form of the characteristic vector, and the value of each sending action is output;

and the action selection unit is used for selecting the sending action with the maximum value by adopting an e-greedy strategy according to the value of each sending action output by the value fitting unit and sending the sending action by the sending end.

Compared with the prior art, the invention has the following beneficial effects:

1. on the one hand, the technical scheme of the invention adopts stream coding to realize packet loss recovery and provides a reliability mechanism for UDP. The method has higher throughput than a retransmission scheme and has smaller decoding delay than a block code (block code); on the other hand, the invention is based on a reinforcement learning model, on-line learning is carried out according to the current network condition and the packet loss rate, the packet sending interval is dynamically adjusted, the sending packet type is intelligently selected, the joint optimization of the stream coding code rate (the proportion of two actions of sending source packets and repairing packets) control and congestion control is realized, the throughput of the network is improved, the data transmission delay is reduced, and the invention can adapt to changeable link conditions.

2. A large amount of sample data is not needed, only the information of the external environment (the congestion condition of the network and the ordered packet receiving progress of the receiving end at the moment) is needed to carry out self-learning model on-line training, and the artificial experience and the external data information are rarely relied on.

3. The sending end can learn and make a decision on line according to the network condition, so that the packet sending is more intelligent.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It should be noted that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained by those skilled in the art without inventive exercise.

Fig. 1 is a block diagram of a packet transmission system according to the present invention.

Fig. 2 is a block diagram illustrating a structure of a reinforcement learning model in the packet transmission system according to the present invention.

Fig. 3 is a graph comparing throughput of the transmission method of the present invention with other methods.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it should be noted that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention provides a packet transmission method based on reinforcement learning and stream coding driving, which specifically comprises the following steps:

s1, setting stream coding parameters;

the stream coding parameters are seeds of a pseudo-random number generator used to obtain the stream coding coefficients.

the transmitting end may send two packets, one being an uncoded source packet and the other being a coded repair packet. Let i_seqNumber indicating the most recently transmitted uncoded source packet, initialize i_seq-1, i after each transmission of a source packet_seqAnd adding 1. The repair packet is represented as

Which is a linear combination of source packets that have been previously transmitted. In the formula (1), c_kDenotes a repair packet numbered k, g_k，iIs from a finite field

Where k is 0, and 1,2 … is the number of the repair packet. w is a_sCorresponding to the number of the oldest (old) source packet in the current transmit queue. Initialization w_sAt 0, the original packet acknowledged as received will be removed from the queue according to the feedback from the receiving end, at which time w_sAn update will be made. Let w_e＝i_seq，[w_s，w_e]Referred to as the coding window of the current repair packet.

The receiving end decodes and recovers the received packet and transmits the packet to the upper layer application in order. Let i_ordIndicating the latest in-order transport packet number, initializing i_ordThe decoder initial state is an ordered state-1. If the decoder next receives a packet that is neither true

Nor do they have w_e＝i_ordA repair packet of a nature means that the in-order transmission is interrupted. The decoder enters an out-of-order state where it will buffer the received packet and attempt decoding. The buffered packets are out-of-order source packets (numbered greater than i)_ord+1) or repair packets (where w_e＞i_ord+1). Order to

Make it

The maximum number of the upper bound of the coding window in the buffered repair packets.

Referred to as the decoder current decoding window. As more packets are buffered, the window may expand (i.e., the window may expand)

Growth). The decoder decodes using gaussian elimination, i.e. dynamically constructs a linear system of equations AS ═ B and performs forward elimination on-line, where the rows of a and B are the coding coefficients of the buffered packets (unordered source packets are treated AS special repair packets with coding coefficients having only one non-zero element 1) and the coded information symbols, respectively. When decoding is successful, the decoded source packets in the decoding window are all transmitted to the upper layer application, the decoder is restored to an ordered state, and the source packets are transmitted in order

And the process is restarted.

And S4, the sending end processes the feedback information, determines system state information, calculates reward and punishment values according to a reward function, estimates the available bandwidth of the link, determines the interval time of sending actions of the sending end according to the available bandwidth of the link, and then executes a learning process based on a reinforcement learning model.

In the invention, the system state information is used for representing the network condition, and specifically comprises the ratio of the current packet round-trip delay to the minimum packet round-trip delay, the ratio of the current sending packet action number to the total action number, and the ratio of the current sending source packet number to the total packet number. The transmission action is one of transmission source packet, transmission of repair packet, and abandonment of transmission (backoff). The interval between the sending actions is set to 2/3 packet size divided by the link available bandwidth.

In the invention, the reward function is determined according to the optimization target of the packet transmission, and the optimization target of the packet transmission is set to maximize the throughput of each user flow while reducing the time delay to the maximum extent. Specifically, the embodiments of the present invention design the reward function as follows:

r (s, a) represents that the system state information is s, and the motion is sent as a reward and punishment value when a is reached; gp is goodput, i.e., the number of ordered source packets received by the receiving end divided by the time taken for transmission to date; inp is the number of all packets sent by the sending end divided by the time taken for the current transmission; u shape_nAs utility function, U_nLog (gp) - δ log (RTT), RTT being a smoothed estimate of the minimum round trip delay; RTT (round trip time)_ratioThe ratio of the currently and smoothly estimated RTT to the minimum value of RTT; tau is a preset hyper-parameter. In the embodiment of the present invention, τ is set to 1.2, and it can be seen that the function emphasizes that each user stream should try to maximize its throughput while minimizing the delay. The log function ensures that the network can fairly allocate bandwidth resources when multiple users compete for the same bottleneck link.

Through one action, if the utility function value is increased, a positive prize can be obtainedA penalty value. If the utility function value decreases and RTT_ratio≥τ，RTT_ratioThe ratio of the currently smoothed estimated RTT to the minimum value of RTT, so that RTT_ratioMore than or equal to tau represents congestion, the reward and penalty value is a negative value, and the closer gp/inp is to 1, the smaller the reward and penalty value is. In other cases, the reward penalty value is zero.

The system state with continuous values is mapped into a feature vector only containing discrete values 0 and 1 by adopting a tile coding (telecom) mode. A cost function reflecting the value of each transmission action is then fitted in the form of a linear function of this feature vector. The learning process of reinforcement learning is to obtain a weight of a cost function for each transmission action.

Specifically, the reinforcement learning process of the invention comprises the following steps:

s42, selecting the sending action with the maximum value (namely the optimal sending action) in the current state according to the value of each sending action;

and S43, the sending end realizes the sending action according to the optimal sending action selected in the step S42.

Specifically, the sending end selects whether to send the packet at present when the sending action moment comes; if the packet is determined to be transmitted, it is further determined whether to transmit a new source packet or to generate a repair packet based on a source packet that has been transmitted previously.

In the embodiment of the invention, the optimal sending action is selected by using an e-greedy strategy. The e-greedy strategy is specifically as follows: if the current random probability is lower than oa, an action is selected at random, otherwise the action with the highest value is selected in the current state. Selecting actions according to an e-greedy strategy, and realizing the combined optimization of stream coding rate control and congestion control; and the new packets or the repair packets to be sent are sequentially stored in the UDP transmission buffer to be sent.

And S5, continuously repeating the steps S3 and S4, dynamically adjusting the packet sending interval and intelligently selecting the sent packet type according to the current network condition and the packet loss rate, and realizing the joint optimization of stream coding rate control and congestion control so as to realize congestion control and stream coding rate control.

The code rate of the stream coding is the ratio of two actions of sending source grouping and repairing grouping. R ═ a/(a + b), where a is the number of transmitted source packets and b is the number of transmitted repair packets. Therefore, the technical scheme provided by the invention controls the code rate by controlling the action proportion of the sending source grouping and the repairing grouping.

As shown in fig. 1, the present invention further provides a packet transmission system based on reinforcement learning and stream coding driving, where the packet transmission system includes a transmitting end, a receiving end, a state space unit, a reward function unit, a value fitting unit, and an action selection unit. Wherein the state space unit, the reward function unit, the value fitting unit and the action selection unit constitute a reinforcement learning model as shown in fig. 2.

The system comprises a sending end and a receiving end, wherein the sending end is provided with an encoder, and the encoder sends an uncoded source packet or a coded repair packet;

and the receiving end is used for decoding and recovering the received packets and orderly transmitting the packets to the upper layer application, and simultaneously sending feedback information to the sending end, wherein the feedback information comprises the decoding progress, the number and the type of the latest received packets, the number of the received source packets and the number of the received repair packets.

The state space unit is arranged at the sending end and used for processing the feedback information sent by the receiving end and determining the system state information; the system state information includes the ratio of the current packet round trip delay to the minimum packet round trip delay, the ratio of the number of currently transmitted packet actions to the total number of actions, and the ratio of the number of currently transmitted source packets to the total number of packets.

And the reward function unit is used for calculating an output reward punishment value according to the reward function.

And the value fitting unit is used for mapping the system state information into a feature vector only containing discrete values 0 and 1 in a tile coding mode, then fitting the feature vector in a linear function form by combining the reward and punishment values to obtain a value function, and outputting the value of each sending action.

And the action selection unit is used for selecting the sending action with the maximum value by adopting an e-greedy strategy according to the value of each sending action output by the value fitting unit and sending the sending action by the sending end. And the new packets or the repair packets to be sent are sequentially stored in the UDP transmission buffer to be sent.

In specific application, a receiving end sends a stream coding packet, a receiving end decoder decodes the stream coding packet, and decoding and receiving progress information and congestion indexes are continuously fed back to a sending end. The sending end abstracts the state information according to the feedback information, calculates a reward value according to a reward function, inputs the reward value and the state information into a value fitting unit to obtain corresponding values of all actions, updates related fitting parameters, and finally selects the optimal action according to an action selection unit. The reinforcement learning process is a process of continuously iterating and updating the intelligent agent value fitting function driven by the feedback of the receiving end. The model can be continuously learned along with the progress of the packet transmission process, and the joint optimization of congestion control and stream coding rate control is realized. The network simulation result shows that the method is used under the condition of wireless long and fat links. The throughput obtained by the method of the invention is far better than that obtained by other methods. As shown in fig. 3, the effective throughput (gp) obtained by the scheme of the present invention on a fat wireless link with 1% packet loss rate, 100 ms delay, and 20Mbps bandwidth is much higher than that of existing schemes such as QUIC, TCPBBR, and TCPCUBIC.

Although the present invention has been described in terms of the preferred embodiment, it is not intended that the invention be limited to the embodiment. Any equivalent changes or modifications made without departing from the spirit and scope of the present invention also belong to the protection scope of the present invention. The scope of the invention should therefore be determined with reference to the appended claims.

Claims

1. A packet transmission method based on reinforcement learning and stream coding driving is characterized by comprising the following steps:

s1, setting stream coding parameters;

2. The packet transmission method according to claim 1, wherein the repair packet is a source packet s that has been previously transmitted_iThe linear combination of (a) is specifically represented by the following formula:

wherein, c_kDenotes a repair packet numbered k, k being 0,1,2,3, …; g_k,iIs from a finite field

3. The packet transmission method according to claim 1, wherein the reward function is expressed by the following equation:

4. The packet transmission method according to claim 1, wherein the cost function is obtained by:

5. The packet transmission method according to claim 1, wherein the selecting an optimal sending action according to the value of each sending action specifically comprises: and selecting the optimal sending action by using an e-greedy strategy.