CN114827683B

CN114827683B - Video self-adaptive code rate control system and method based on reinforcement learning

Info

Publication number: CN114827683B
Application number: CN202210405701.8A
Authority: CN
Inventors: 张朝昆; 刘勇; 杨秋敏
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2022-04-18
Filing date: 2022-04-18
Publication date: 2023-11-07
Anticipated expiration: 2042-04-18
Also published as: CN114827683A

Abstract

The invention discloses a video self-adaptive code rate control system and method based on reinforcement learning, comprising throughput prediction, reinforcement learning decision and smooth control, wherein the throughput prediction and buffer zone information are taken into consideration to make an ABR decision, and under a low-delay scene, the throughput is predicted by a neural network and then the reinforcement learning is used for making a decision, so that the smooth control of optimizing self-adaptive code rate adjustment in low-delay video stream is realized. Compared with the prior art, the method does not depend on QoE model in the process of self-adaptive code rate decision, learns according to real-time conditions and continuously adjusts strategies; the method can realize the frequent switching of the relief code rate when the bandwidth fluctuation occurs, and improve the viewing quality of the user, thereby improving the viewing experience of the user.

Description

Video self-adaptive code rate control system and method based on reinforcement learning

Technical Field

The invention relates to the field of streaming media transmission in the Internet, in particular to a video self-adaptive code rate control system and method based on throughput prediction and reinforcement learning.

Background

In the current global internet traffic usage data, video traffic accounts for over 50% of the total and is also increasing year by year. Video-on-demand is indispensable, and demands for live video and various video conferences are increasing. In order to improve the quality of video user experience (quality ofexperience, qoE), adaptive Bitrate (ABR) techniques can effectively improve user QoE by automatically adjusting resolution. In addition, in the ABR technology, the video stream is transmitted in blocks, and the video block code rate at the next moment is selected to be transmitted according to the decision result, so that the self-adaptive effect is realized.

The current ABR technology implementation is mainly divided into three types of algorithms: throughput-based prediction algorithms, buffer-based information algorithms, and hybrid algorithms. The algorithm based on throughput prediction may often fall into a stuck state when the bandwidth environment is not good, the algorithm based on the buffer information may be in a conservative state for a long time, and the two algorithms are not applicable to the ABR any more. The hybrid algorithm uses both throughput prediction and buffer information to make adaptive code rate decisions.

With the development of artificial intelligence, reinforcement learning has been applied to ABR technology. How to combine throughput prediction and buffer information in a hybrid algorithm is a critical technical issue. Aiming at the technical problem, two kinds of information including throughput prediction and buffer information are simultaneously placed in a state space by using a reinforcement learning mode in the existing solution, so that the two kinds of information can be effectively fused. However, with the change of the bandwidth environment and the requirement for low-delay video streaming, the ABR problem can not be perfectly solved by simple reinforcement learning, the continuous switching of the code rate gear can be caused by the fluctuation of the bandwidth environment, so that the user watching experience is poor, the effect brought by the decision of the buffer information is reduced due to the fact that the buffer area is reduced by the low-delay scene, and the effect of simply applying the throughput information to the state space of reinforcement learning is difficult to achieve in the non-low-delay video streaming.

Disclosure of Invention

The invention provides a video self-adaptive code rate decision system and a method based on reinforcement learning, which simultaneously consider throughput prediction and buffer zone information to make an ABR decision, and under a low-delay scene, the throughput is predicted by a neural network and then the reinforcement learning is used for making a decision, so that smooth control of optimizing self-adaptive code rate adjustment in low-delay video stream is realized.

The video self-adaptive code rate control system based on reinforcement learning realizes real-time adjustment of video code rate in low-delay video stream according to network environment, and comprises a throughput prediction module, a reinforcement learning decision module and a smoothing control module; wherein:

the throughput prediction module is used for calculating historical throughput, inputting a throughput sequence into the LSTM neural network for prediction, obtaining a throughput prediction result at a future moment, and taking the throughput prediction result as a parameter of a state space in the reinforcement learning neural network;

the reinforcement learning decision module is used for constructing a reinforcement learning neural network DuelingDQN, taking throughput prediction results, buffer area information and other effective information as a state space, calculating to obtain a preliminary code rate level through an agent network, and continuously updating an agent by a training part in the module according to historical information;

the smoothing control module is used for obtaining the preliminary code rate decision result obtained from the reinforcement learning module and the previous time

The carved code rate information defines the range of the allowable bandwidth fluctuation according to the historical throughput condition, and whether smooth control is used or not is determined according to whether the historical throughput is in the allowable fluctuation condition or not; when the smoothing control is used, the same code rate level as the last time is adopted as a result, and when the last time is the highest code rate, the code rate level of the last time is reduced by one level to be adopted as a result.

The invention discloses a video self-adaptive code rate control method based on reinforcement learning, which comprises the following steps:

step 1: performing throughput prediction, namely predicting the future throughput from the acquired historical throughput by using the LSTM neural network, and taking the future throughput as a parameter of a state space in the reinforcement learning neural network; the method specifically comprises the following steps:

step 1.1: acquiring historical data includes historical transmission video block size Chunk _size And download time Chunk _during ；

Step 1.2: the historical throughput is calculated as follows:

historical throughput = Chunk _size /Chunk _during

Then, the throughput values of the first 30 transmission video blocks in the historical transmission video blocks are added to the historical throughput sequence X, and the historical throughput sequence X is normalized according to the following normalization formula:

wherein X is _norm For the normalized result, X is the original value, X _max For maximum value in the sequence, X _min Is the minimum value in the sequence;

step 1.3: constructing an LSTM neural network as a prediction model, wherein the LSTM neural network structure comprises a forgetting gate, an input gate and an output gate, and as a circulation structure, the prediction result at one moment on the LSTM neural network can be added to the next moment to simultaneously predict the data at the next moment;

step 1.4: results X obtained by normalizing historical throughput sequences _norm As input to the LSTM neural network;

step 1.5: outputting a predicted value, namely a normalized throughput predicted value, as a state space input with the throughput predicted weight of step 2.1;

step 2: constructing a reinforcement learning network; the system structurally comprises an intelligent agent, a state space S of interaction of the intelligent agent and an environment, a selection action a of the intelligent agent to the environment, a storage state (S', a, r, S), an experience pool and a training network, wherein an ABR algorithm of the intelligent agent is applied to bandwidth fluctuation and low-delay video streaming, a decision function is realized, and an ABR strategy is continuously learned and updated through the training network; the method specifically comprises the following steps:

step 2.1: the Reward report formula used by the agent and training network shared network structure in the reinforcement learning network is as follows:

wherein, chunk_num represents the number of transmitted video blocks, quality [ i ] represents the transmission code rate of the ith video block, buffer [ i ] represents the duration of blocking during transmission of the ith video block, and Smooth [ i ] represents the phase difference between the (i+1) th video block code rate and the (i) th video block code rate, and the formula is as follows:

Smooth[i]＝Quality[i+1]-Quality[i]

step 2.2: performing action matching, namely: mapping the output actions of the reinforcement learning network into an ABR algorithm, and using four different code rate levels 360p,480p,720p,1080p corresponding levels= [1,2,3,4]; respectively corresponding to the selected action spaces a= [1,2,3,4];

step 2.3: the state space S is defined as follows:

S＝[X,l _t ,r _t ,c _t ,p _t ,n _t ]

wherein X represents the historical throughput prediction sequence obtained from step 1.2, l _t Representing the download time of the last video block, r _t Indicating the duration of the last time, c _t Representing the code rate, p, selected at the previous time _t Indicating the current buffer size, n _t Representing the size of the video block to be transmitted at the next time;

step 2.4: the intelligent agent decision, namely, corresponding information is obtained from the state space S described in the step 2.3, decision is made in the current intelligent agent neural network, during decision, the intelligent agent can perform Reward matching according to the information of the state space, and then the action with the highest score is the optimal action, so that the current optimal action is obtained and output;

step 2.5: training a network; generating a series of combined information while the intelligent agent continuously makes decisions, wherein the combined information in the step is a storage state T= [ s ', a, r, s ], inputting the storage state T= [ s', a, r, s ] into an experience pool, randomly extracting each piece of combined information from the experience pool by a training network, training according to each piece of combined information at intervals, and updating the intelligent agent network;

wherein s' represents state space information at the previous moment, a represents the optimal action selected at the present time, r represents the Reward value of the previous selection action, and s represents state space information at the present moment;

step 3: performing smooth control of bandwidth fluctuation; the method specifically comprises the following steps:

step 3.1: acquiring the optimal action a and other information obtained by calculation in the step 2.4, wherein the optimal action a corresponds to the video code rate level, and the other information comprises code rate selection and historical throughput sequences at the last moment;

step 3.2: defining a bandwidth fluctuation range: first, the central axis value of the fluctuation limit at the time t is determinedThe formula is as follows:

wherein alpha and beta are weights of the central axis value at the previous moment and throughput at the previous moment respectively,represents the medial axis value at time t-1, bandwidth _t-1 Representing throughput at time t-1;

then, the upper and lower limits of the bandwidth fluctuation range, i.e., the bandwidth fluctuation limit, are defined, and the upper limit is set asThe lower bound is->

Step 3.3: the current bandwidth fluctuation state is obtained, namely: counting throughput number flu in bandwidth fluctuation range _r And the number of throughput flu out of range _f If flu is _r -flu _f > 10, it indicates that the bandwidth fluctuation state is present at this time, and flu_flag=1 is set; otherwise, indicating that bandwidth fluctuation does not occur at the moment, and setting flu_flag=0;

step 3.4: selecting a smoothing control strategy; the method comprises the following specific steps:

step 3.4.1: acquiring the current bandwidth fluctuation state in the step 3.3;

step 3.4.2: judging that if in the bandwidth fluctuation state (flu_flag=1), turning to step 3.4.3, otherwise, the bit=a, a is the optimal action obtained in step 2.4, turning to step 3.4.4;

step 3.4.3: selecting when bandwidth fluctuates according to the effect of the boundary decrease of Mean Opinion Score (MOS)Conservation strategy Slightly omittedI.e. if the code rate level selected at the previous moment is 4 and is currently in a bandwidth fluctuation state, setting the code rate level selected at the current moment to be 4 (bitrate=3), otherwise, setting the code rate level selected at the current moment to be the code rate level selected at the previous moment (bitrate=last_bitrate);

step 3.4.4: and outputting the final code rate grade decision.

Compared with the prior art, the invention has the following beneficial technical effects:

1) In the process of self-adaptive code rate decision, the method does not depend on QoE model, learns according to real-time condition and continuously adjusts strategy;

2) The method can realize the frequent switching of the code rate release when the bandwidth fluctuation occurs, and improve the viewing quality of the user, thereby improving the viewing experience of the user;

3) The method effectively solves the defect of reinforcement learning of the ABR algorithm before the condition that the buffer area is reduced; the throughput prediction is more effectively performed while the reinforcement learning ABR algorithm is used;

4) The range of fluctuations is defined and a fluctuation conservation strategy according to MOS characteristics is used.

Drawings

FIG. 1 is a block diagram of a reinforcement learning-based video adaptive rate control system of the present invention;

FIG. 2 is a flow chart of a video adaptive rate control method based on reinforcement learning according to the present invention;

FIG. 3 is a schematic diagram of a reinforcement learning network structure;

FIG. 4 is a flow chart of bandwidth fluctuation determination and smoothing control;

fig. 5 is a flowchart of the selection smoothing control.

Detailed Description

The technical scheme of the invention is further described in detail below with reference to the attached drawings and specific embodiments.

Fig. 1 is a block diagram of a video adaptive code rate decision system based on reinforcement learning according to the present invention. The system comprises a throughput prediction module, a reinforcement learning decision module and a smoothing control module. Wherein:

the throughput prediction module 100 is configured to calculate throughput of a past period of time, input a past throughput sequence into a prediction neural network (LSTM), obtain a throughput prediction result at a future time through calculation of the prediction neural network, and use the result as a parameter of a state space in the reinforcement learning neural network.

The reinforcement learning decision module 200 is a main module for making a decision on a preliminary code rate, and can realize the function of learning and updating a decision strategy. The reinforcement learning neural network uses a DuelingDQN network, takes throughput prediction results, buffer area information and other effective information as a state space, calculates the initial code rate level through an agent network, and simultaneously, a training part in the module can continuously update the agent according to historical information.

The smoothing control module 300 is configured to obtain a preliminary code rate decision result obtained from the reinforcement learning module and code rate information of a previous time, define a range of allowable fluctuation according to throughput conditions of a past period of time, and determine whether to use smoothing control according to whether the historical throughput is in the condition of allowable fluctuation; when the smoothing control is used, the same code rate level as the last moment is adopted as a result, and when the last moment is the highest code rate (meanwhile, whether the fluctuation in the better network environment is considered) is adopted as a result, the code rate level of the last moment is reduced by one stage.

As shown in fig. 2, in the adaptive code rate transmission method based on reinforcement learning, the LSTM prediction neural network is used to predict throughput based on past throughput sequences, the DuelingDQN reinforcement learning neural network is used to train and make decisions, and the function of optimizing adaptive code rate transmission in low-delay video streams is realized, which comprises the following specific steps:

step 1.1: acquiring historical data, including a size Chunk of a historical transmission video block _size And download time Chunk _during ；

Step 1.2: the historical throughput is calculated as follows:

historical throughput = Chunk _size /Chunk _during

step 2: establishing a reinforcement learning network; the method specifically comprises the following steps:

FIG. 3 is a schematic diagram of a reinforcement learning network structure. The reinforcement learning network is a main body part of the invention, and the structure of the reinforcement learning network comprises an agent, a state space S for interaction of the agent and the environment, a selection action a initiated by the agent to the environment, a storage state (S', a, r, S), an experience pool and a training network. The ABR algorithm of the intelligent agent is applied to bandwidth fluctuation and low-delay video streaming, a decision function is realized, the ABR strategy is continuously learned and updated through a training network, the DuelingDQN network is preferentially selected by the reinforcement learning network, and the state value and the action value are divided into two networks first and then combined to obtain the final DuelingDQN network when the reinforcement learning network is constructed.

Smooth[i]＝Quality[i+1]-Quality[i]

step 2.2: performing action matching, namely: mapping the output actions of the reinforcement learning network into the ABR algorithm, and using four different code rate levels 360p,480p,720p,1080p corresponding to level= [1,2,3,4], and respectively corresponding to the selected action spaces a= [1,2,3,4];

step 2.3: the state space S is defined as follows:

S＝[X,l _t ,r _t ,c _t ,p _t ,n _t ]

step 2.4: the intelligent agent decision, namely, corresponding information is obtained from the state space S described in the step 2.3, decision is made in the current intelligent agent neural network, when decision is made, the intelligent agent can perform reward matching according to the information of the state space, then the action with the highest score is the optimal action, and the current optimal action a is obtained and output;

step 2.5: the network is trained. Generating a series of combined information while the intelligent agent continuously makes decisions, wherein the combined information in the step is a storage state T= [ s ', a, r, s ], inputting the storage state T= [ s', a, r, s ] into an experience pool, randomly extracting each piece of combined information from the experience pool by a training network, training according to each piece of combined information at intervals, and updating the intelligent agent network;

wherein s' represents state space information at the previous moment, a represents the action selected at this time, r represents the Reward value of the previous selection action, and s represents state space information at the present moment;

step 3: performing smooth control of bandwidth fluctuation;

because severe bandwidth fluctuation can cause frequent adjustment of video code rate by the ABR decision algorithm, so that the watching experience of a user is poor, and therefore, the control of the code rate switching frequency along with the bandwidth fluctuation needs to be realized;

as shown in fig. 4, a bandwidth fluctuation determination and smoothing control flow chart is shown.

Step 3.1: acquiring an action a and other information obtained by calculation in the step 2.4, wherein the action a corresponds to the video code rate level, and the other information comprises code rate selection and historical throughput sequences at the last moment;

wherein, alpha and beta are the weights of the axis value at the previous moment and the throughput at the previous moment (alpha=0.2, beta=0.8),represents the medial axis value at time t-1, bandwidth _t-1 Representing throughput at time t-1;

step 3.4: selecting a smoothing control strategy; the method specifically comprises the following steps:

as shown in fig. 5, a smoothing control flow chart is selected.

Step 3.4.1: acquiring the current bandwidth fluctuation state in the step 3.3;

step 3.4.2: determining that if in the bandwidth fluctuation state (flu_flag=1), turning to step 3.4.3, otherwise, the bit=aa is the optimal action obtained in step 2.4, turning to step 3.4.4;

step 3.4.3: according to the boundary decrementing effect of the Mean Opinion Score (MOS), a conservative policy is selected when the bandwidth fluctuates, that is, if the code rate level selected at the previous time is 4 (i.e., the code rate is 1080 p) and is currently in the bandwidth fluctuation state, the code rate level selected at the current time is set to 3 (bitrate=3), otherwise, the code rate level selected at the current time is set to the code rate level selected at the previous time (bitrate=last_bitrate).

Step 3.4.4: and outputting the final code rate grade decision.

The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.

Claims

1. The video self-adaptive code rate control method based on reinforcement learning is characterized by comprising the following steps of:

Step 1.2: the historical throughput is calculated as follows:

historical throughput = Chunk _size /Chunk _during

Smooth[i]＝Quality[i+1]-Quality[i]

step 2.3: the state space S is defined as follows:

S＝[X，l _t ，r _t ，c _t ，p _t ，n _t ]

step 2.4: the intelligent agent decision, namely, corresponding information is obtained from the state space S described in the step 2.3, decision is made in the current intelligent agent neural network, when decision is made, the intelligent agent can perform reward matching according to the information of the state space, and then the action with the highest score is the optimal action, so that the current optimal action a is obtained and output;

step 2.5: training a network; generating a series of combined information while the intelligent agent continuously makes decisions, wherein the combined information in the step is a storage state T= [ s ', a, r, s ], inputting the storage state T= [ s', a, r, s ] into an experience pool, randomly extracting each piece of combined information from the experience pool by a training network, training according to each piece of combined information at intervals, and updating the intelligent agent network; wherein s' represents state space information at the previous moment, a represents the optimal action selected at the present time, r represents the Reward value of the previous selection action, and s represents state space information at the present moment;

wherein alpha and beta respectively represent the weight of the axis value at the previous moment and the throughput at the previous moment,represents the medial axis value at time t-1, bandwidth _t-1 Representing throughput at time t-1;

then, the upper and lower limits of the bandwidth fluctuation range, i.e., the bandwidth fluctuation limit, are defined, and the upper limit is set asThe lower boundary is

Step 3.3: acquiring currentBandwidth fluctuation state of (1), namely: counting throughput number flu in bandwidth fluctuation range _r And the number of throughput flu out of range _f If flu is _r -flu _f > 10, it indicates that the bandwidth fluctuation state is present at this time, and flu_flag=1 is set; otherwise, indicating that bandwidth fluctuation does not occur at the moment, and setting flu_flag=0;

step 3.4.1: acquiring the current bandwidth fluctuation state in the step 3.3;

step 3.4.2: determining to go to step 3.4.3 if in the bandwidth fluctuation state (flu_flag=1); otherwise, the bit=a, a is the optimal action obtained in the step 2.4, and the step 3.4.4 is shifted;

step 3.4.3: a conservation strategy is selected when the bandwidth fluctuates, namely if the code rate level selected at the previous moment is 4 and is in the bandwidth fluctuation state at present, the code rate level selected at the current moment is set to be 3, otherwise, the code rate level selected at the current moment is set to be the code rate level selected at the previous moment;

step 3.4.4: and outputting the final code rate grade decision.

2. A video self-adaptive code rate control system based on reinforcement learning, which is used for executing the video self-adaptive code rate control method based on reinforcement learning according to claim 1, and realizing real-time adjustment of video code rate in a low-delay video stream according to network environment, and is characterized in that the system comprises a throughput prediction module, a reinforcement learning decision module and a smoothing control module; wherein:

the smoothing control module is used for acquiring a preliminary code rate decision result obtained from the reinforcement learning module and code rate information at the last moment, defining a range of allowable bandwidth fluctuation according to historical throughput conditions, and determining whether smoothing control is used according to whether the historical throughput is in the condition of allowable fluctuation; when the smoothing control is used, the same code rate level as the last time is adopted as a result, and when the last time is the highest code rate, the code rate level of the last time is reduced by one level to be adopted as a result.

3. The video adaptive rate control system according to claim 2, wherein the LSTM neural network structure includes a forgetting gate, an input gate and an output gate, and as a loop structure, the prediction result at one time on the LSTM neural network is added again to the next time and the data at the next time are predicted at the same time.

4. The video adaptive rate control system based on reinforcement learning according to claim 2, wherein the reinforcement learning network structure comprises an agent, a state space S in which the agent interacts with the environment, a selection action a initiated by the agent to the environment, a storage state (S', a, r, S), an experience pool, and a training network; the ABR algorithm of the intelligent agent is applied to bandwidth fluctuation and low-delay video streaming, so that a decision function is realized, and the ABR strategy is continuously learned and updated through a training network.

5. The video adaptive bitrate control system based on reinforcement learning of claim 4, wherein the state value and the action value are divided into two networks and then combined to obtain the final DuelingDQN network when constructing the reinforcement learning network.