CN114827683B - Video self-adaptive code rate control system and method based on reinforcement learning - Google Patents
Video self-adaptive code rate control system and method based on reinforcement learning Download PDFInfo
- Publication number
- CN114827683B CN114827683B CN202210405701.8A CN202210405701A CN114827683B CN 114827683 B CN114827683 B CN 114827683B CN 202210405701 A CN202210405701 A CN 202210405701A CN 114827683 B CN114827683 B CN 114827683B
- Authority
- CN
- China
- Prior art keywords
- code rate
- throughput
- reinforcement learning
- network
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000002787 reinforcement Effects 0.000 title claims abstract description 61
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000013528 artificial neural network Methods 0.000 claims abstract description 37
- 239000003795 chemical substances by application Substances 0.000 claims description 40
- 230000009471 action Effects 0.000 claims description 39
- 238000012549 training Methods 0.000 claims description 22
- 230000005540 biological transmission Effects 0.000 claims description 18
- 238000009499 grossing Methods 0.000 claims description 18
- 230000003044 adaptive effect Effects 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 230000000903 blocking effect Effects 0.000 claims description 3
- 238000011217 control strategy Methods 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 abstract description 2
- 230000000694 effects Effects 0.000 description 6
- 238000010586 diagram Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/25—Management operations performed by the server for facilitating the content distribution or administrating data related to end-users or client devices, e.g. end-user or client device authentication, learning user preferences for recommending movies
- H04N21/266—Channel or content management, e.g. generation and management of keys and entitlement messages in a conditional access system, merging a VOD unicast channel into a multicast channel
- H04N21/2662—Controlling the complexity of the video stream, e.g. by scaling the resolution or bitrate of the video stream based on the client capabilities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
The invention discloses a video self-adaptive code rate control system and method based on reinforcement learning, comprising throughput prediction, reinforcement learning decision and smooth control, wherein the throughput prediction and buffer zone information are taken into consideration to make an ABR decision, and under a low-delay scene, the throughput is predicted by a neural network and then the reinforcement learning is used for making a decision, so that the smooth control of optimizing self-adaptive code rate adjustment in low-delay video stream is realized. Compared with the prior art, the method does not depend on QoE model in the process of self-adaptive code rate decision, learns according to real-time conditions and continuously adjusts strategies; the method can realize the frequent switching of the relief code rate when the bandwidth fluctuation occurs, and improve the viewing quality of the user, thereby improving the viewing experience of the user.
Description
Technical Field
The invention relates to the field of streaming media transmission in the Internet, in particular to a video self-adaptive code rate control system and method based on throughput prediction and reinforcement learning.
Background
In the current global internet traffic usage data, video traffic accounts for over 50% of the total and is also increasing year by year. Video-on-demand is indispensable, and demands for live video and various video conferences are increasing. In order to improve the quality of video user experience (quality ofexperience, qoE), adaptive Bitrate (ABR) techniques can effectively improve user QoE by automatically adjusting resolution. In addition, in the ABR technology, the video stream is transmitted in blocks, and the video block code rate at the next moment is selected to be transmitted according to the decision result, so that the self-adaptive effect is realized.
The current ABR technology implementation is mainly divided into three types of algorithms: throughput-based prediction algorithms, buffer-based information algorithms, and hybrid algorithms. The algorithm based on throughput prediction may often fall into a stuck state when the bandwidth environment is not good, the algorithm based on the buffer information may be in a conservative state for a long time, and the two algorithms are not applicable to the ABR any more. The hybrid algorithm uses both throughput prediction and buffer information to make adaptive code rate decisions.
With the development of artificial intelligence, reinforcement learning has been applied to ABR technology. How to combine throughput prediction and buffer information in a hybrid algorithm is a critical technical issue. Aiming at the technical problem, two kinds of information including throughput prediction and buffer information are simultaneously placed in a state space by using a reinforcement learning mode in the existing solution, so that the two kinds of information can be effectively fused. However, with the change of the bandwidth environment and the requirement for low-delay video streaming, the ABR problem can not be perfectly solved by simple reinforcement learning, the continuous switching of the code rate gear can be caused by the fluctuation of the bandwidth environment, so that the user watching experience is poor, the effect brought by the decision of the buffer information is reduced due to the fact that the buffer area is reduced by the low-delay scene, and the effect of simply applying the throughput information to the state space of reinforcement learning is difficult to achieve in the non-low-delay video streaming.
Disclosure of Invention
The invention provides a video self-adaptive code rate decision system and a method based on reinforcement learning, which simultaneously consider throughput prediction and buffer zone information to make an ABR decision, and under a low-delay scene, the throughput is predicted by a neural network and then the reinforcement learning is used for making a decision, so that smooth control of optimizing self-adaptive code rate adjustment in low-delay video stream is realized.
The video self-adaptive code rate control system based on reinforcement learning realizes real-time adjustment of video code rate in low-delay video stream according to network environment, and comprises a throughput prediction module, a reinforcement learning decision module and a smoothing control module; wherein:
the throughput prediction module is used for calculating historical throughput, inputting a throughput sequence into the LSTM neural network for prediction, obtaining a throughput prediction result at a future moment, and taking the throughput prediction result as a parameter of a state space in the reinforcement learning neural network;
the reinforcement learning decision module is used for constructing a reinforcement learning neural network DuelingDQN, taking throughput prediction results, buffer area information and other effective information as a state space, calculating to obtain a preliminary code rate level through an agent network, and continuously updating an agent by a training part in the module according to historical information;
the smoothing control module is used for obtaining the preliminary code rate decision result obtained from the reinforcement learning module and the previous time
The carved code rate information defines the range of the allowable bandwidth fluctuation according to the historical throughput condition, and whether smooth control is used or not is determined according to whether the historical throughput is in the allowable fluctuation condition or not; when the smoothing control is used, the same code rate level as the last time is adopted as a result, and when the last time is the highest code rate, the code rate level of the last time is reduced by one level to be adopted as a result.
The invention discloses a video self-adaptive code rate control method based on reinforcement learning, which comprises the following steps:
step 1: performing throughput prediction, namely predicting the future throughput from the acquired historical throughput by using the LSTM neural network, and taking the future throughput as a parameter of a state space in the reinforcement learning neural network; the method specifically comprises the following steps:
step 1.1: acquiring historical data includes historical transmission video block size Chunk size And download time Chunk during ;
Step 1.2: the historical throughput is calculated as follows:
historical throughput = Chunk size /Chunk during
Then, the throughput values of the first 30 transmission video blocks in the historical transmission video blocks are added to the historical throughput sequence X, and the historical throughput sequence X is normalized according to the following normalization formula:
wherein X is norm For the normalized result, X is the original value, X max For maximum value in the sequence, X min Is the minimum value in the sequence;
step 1.3: constructing an LSTM neural network as a prediction model, wherein the LSTM neural network structure comprises a forgetting gate, an input gate and an output gate, and as a circulation structure, the prediction result at one moment on the LSTM neural network can be added to the next moment to simultaneously predict the data at the next moment;
step 1.4: results X obtained by normalizing historical throughput sequences norm As input to the LSTM neural network;
step 1.5: outputting a predicted value, namely a normalized throughput predicted value, as a state space input with the throughput predicted weight of step 2.1;
step 2: constructing a reinforcement learning network; the system structurally comprises an intelligent agent, a state space S of interaction of the intelligent agent and an environment, a selection action a of the intelligent agent to the environment, a storage state (S', a, r, S), an experience pool and a training network, wherein an ABR algorithm of the intelligent agent is applied to bandwidth fluctuation and low-delay video streaming, a decision function is realized, and an ABR strategy is continuously learned and updated through the training network; the method specifically comprises the following steps:
step 2.1: the Reward report formula used by the agent and training network shared network structure in the reinforcement learning network is as follows:
wherein, chunk_num represents the number of transmitted video blocks, quality [ i ] represents the transmission code rate of the ith video block, buffer [ i ] represents the duration of blocking during transmission of the ith video block, and Smooth [ i ] represents the phase difference between the (i+1) th video block code rate and the (i) th video block code rate, and the formula is as follows:
Smooth[i]=Quality[i+1]-Quality[i]
step 2.2: performing action matching, namely: mapping the output actions of the reinforcement learning network into an ABR algorithm, and using four different code rate levels 360p,480p,720p,1080p corresponding levels= [1,2,3,4]; respectively corresponding to the selected action spaces a= [1,2,3,4];
step 2.3: the state space S is defined as follows:
S=[X,l t ,r t ,c t ,p t ,n t ]
wherein X represents the historical throughput prediction sequence obtained from step 1.2, l t Representing the download time of the last video block, r t Indicating the duration of the last time, c t Representing the code rate, p, selected at the previous time t Indicating the current buffer size, n t Representing the size of the video block to be transmitted at the next time;
step 2.4: the intelligent agent decision, namely, corresponding information is obtained from the state space S described in the step 2.3, decision is made in the current intelligent agent neural network, during decision, the intelligent agent can perform Reward matching according to the information of the state space, and then the action with the highest score is the optimal action, so that the current optimal action is obtained and output;
step 2.5: training a network; generating a series of combined information while the intelligent agent continuously makes decisions, wherein the combined information in the step is a storage state T= [ s ', a, r, s ], inputting the storage state T= [ s', a, r, s ] into an experience pool, randomly extracting each piece of combined information from the experience pool by a training network, training according to each piece of combined information at intervals, and updating the intelligent agent network;
wherein s' represents state space information at the previous moment, a represents the optimal action selected at the present time, r represents the Reward value of the previous selection action, and s represents state space information at the present moment;
step 3: performing smooth control of bandwidth fluctuation; the method specifically comprises the following steps:
step 3.1: acquiring the optimal action a and other information obtained by calculation in the step 2.4, wherein the optimal action a corresponds to the video code rate level, and the other information comprises code rate selection and historical throughput sequences at the last moment;
step 3.2: defining a bandwidth fluctuation range: first, the central axis value of the fluctuation limit at the time t is determinedThe formula is as follows:
wherein alpha and beta are weights of the central axis value at the previous moment and throughput at the previous moment respectively,represents the medial axis value at time t-1, bandwidth t-1 Representing throughput at time t-1;
then, the upper and lower limits of the bandwidth fluctuation range, i.e., the bandwidth fluctuation limit, are defined, and the upper limit is set asThe lower bound is->
Step 3.3: the current bandwidth fluctuation state is obtained, namely: counting throughput number flu in bandwidth fluctuation range r And the number of throughput flu out of range f If flu is r -flu f > 10, it indicates that the bandwidth fluctuation state is present at this time, and flu_flag=1 is set; otherwise, indicating that bandwidth fluctuation does not occur at the moment, and setting flu_flag=0;
step 3.4: selecting a smoothing control strategy; the method comprises the following specific steps:
step 3.4.1: acquiring the current bandwidth fluctuation state in the step 3.3;
step 3.4.2: judging that if in the bandwidth fluctuation state (flu_flag=1), turning to step 3.4.3, otherwise, the bit=a, a is the optimal action obtained in step 2.4, turning to step 3.4.4;
step 3.4.3: selecting when bandwidth fluctuates according to the effect of the boundary decrease of Mean Opinion Score (MOS)Conservation strategy Slightly omittedI.e. if the code rate level selected at the previous moment is 4 and is currently in a bandwidth fluctuation state, setting the code rate level selected at the current moment to be 4 (bitrate=3), otherwise, setting the code rate level selected at the current moment to be the code rate level selected at the previous moment (bitrate=last_bitrate);
step 3.4.4: and outputting the final code rate grade decision.
Compared with the prior art, the invention has the following beneficial technical effects:
1) In the process of self-adaptive code rate decision, the method does not depend on QoE model, learns according to real-time condition and continuously adjusts strategy;
2) The method can realize the frequent switching of the code rate release when the bandwidth fluctuation occurs, and improve the viewing quality of the user, thereby improving the viewing experience of the user;
3) The method effectively solves the defect of reinforcement learning of the ABR algorithm before the condition that the buffer area is reduced; the throughput prediction is more effectively performed while the reinforcement learning ABR algorithm is used;
4) The range of fluctuations is defined and a fluctuation conservation strategy according to MOS characteristics is used.
Drawings
FIG. 1 is a block diagram of a reinforcement learning-based video adaptive rate control system of the present invention;
FIG. 2 is a flow chart of a video adaptive rate control method based on reinforcement learning according to the present invention;
FIG. 3 is a schematic diagram of a reinforcement learning network structure;
FIG. 4 is a flow chart of bandwidth fluctuation determination and smoothing control;
fig. 5 is a flowchart of the selection smoothing control.
Detailed Description
The technical scheme of the invention is further described in detail below with reference to the attached drawings and specific embodiments.
Fig. 1 is a block diagram of a video adaptive code rate decision system based on reinforcement learning according to the present invention. The system comprises a throughput prediction module, a reinforcement learning decision module and a smoothing control module. Wherein:
the throughput prediction module 100 is configured to calculate throughput of a past period of time, input a past throughput sequence into a prediction neural network (LSTM), obtain a throughput prediction result at a future time through calculation of the prediction neural network, and use the result as a parameter of a state space in the reinforcement learning neural network.
The reinforcement learning decision module 200 is a main module for making a decision on a preliminary code rate, and can realize the function of learning and updating a decision strategy. The reinforcement learning neural network uses a DuelingDQN network, takes throughput prediction results, buffer area information and other effective information as a state space, calculates the initial code rate level through an agent network, and simultaneously, a training part in the module can continuously update the agent according to historical information.
The smoothing control module 300 is configured to obtain a preliminary code rate decision result obtained from the reinforcement learning module and code rate information of a previous time, define a range of allowable fluctuation according to throughput conditions of a past period of time, and determine whether to use smoothing control according to whether the historical throughput is in the condition of allowable fluctuation; when the smoothing control is used, the same code rate level as the last moment is adopted as a result, and when the last moment is the highest code rate (meanwhile, whether the fluctuation in the better network environment is considered) is adopted as a result, the code rate level of the last moment is reduced by one stage.
As shown in fig. 2, in the adaptive code rate transmission method based on reinforcement learning, the LSTM prediction neural network is used to predict throughput based on past throughput sequences, the DuelingDQN reinforcement learning neural network is used to train and make decisions, and the function of optimizing adaptive code rate transmission in low-delay video streams is realized, which comprises the following specific steps:
step 1: performing throughput prediction, namely predicting the future throughput from the acquired historical throughput by using the LSTM neural network, and taking the future throughput as a parameter of a state space in the reinforcement learning neural network; the method specifically comprises the following steps:
step 1.1: acquiring historical data, including a size Chunk of a historical transmission video block size And download time Chunk during ;
Step 1.2: the historical throughput is calculated as follows:
historical throughput = Chunk size /Chunk during
Then, the throughput values of the first 30 transmission video blocks in the historical transmission video blocks are added to the historical throughput sequence X, and the historical throughput sequence X is normalized according to the following normalization formula:
wherein X is norm For the normalized result, X is the original value, X max For maximum value in the sequence, X min Is the minimum value in the sequence;
step 1.3: constructing an LSTM neural network as a prediction model, wherein the LSTM neural network structure comprises a forgetting gate, an input gate and an output gate, and as a circulation structure, the prediction result at one moment on the LSTM neural network can be added to the next moment to simultaneously predict the data at the next moment;
step 1.4: results X obtained by normalizing historical throughput sequences norm As input to the LSTM neural network;
step 1.5: outputting a predicted value, namely a normalized throughput predicted value, as a state space input with the throughput predicted weight of step 2.1;
step 2: establishing a reinforcement learning network; the method specifically comprises the following steps:
FIG. 3 is a schematic diagram of a reinforcement learning network structure. The reinforcement learning network is a main body part of the invention, and the structure of the reinforcement learning network comprises an agent, a state space S for interaction of the agent and the environment, a selection action a initiated by the agent to the environment, a storage state (S', a, r, S), an experience pool and a training network. The ABR algorithm of the intelligent agent is applied to bandwidth fluctuation and low-delay video streaming, a decision function is realized, the ABR strategy is continuously learned and updated through a training network, the DuelingDQN network is preferentially selected by the reinforcement learning network, and the state value and the action value are divided into two networks first and then combined to obtain the final DuelingDQN network when the reinforcement learning network is constructed.
Step 2.1: the Reward report formula used by the agent and training network shared network structure in the reinforcement learning network is as follows:
wherein, chunk_num represents the number of transmitted video blocks, quality [ i ] represents the transmission code rate of the ith video block, buffer [ i ] represents the duration of blocking during transmission of the ith video block, and Smooth [ i ] represents the phase difference between the (i+1) th video block code rate and the (i) th video block code rate, and the formula is as follows:
Smooth[i]=Quality[i+1]-Quality[i]
step 2.2: performing action matching, namely: mapping the output actions of the reinforcement learning network into the ABR algorithm, and using four different code rate levels 360p,480p,720p,1080p corresponding to level= [1,2,3,4], and respectively corresponding to the selected action spaces a= [1,2,3,4];
step 2.3: the state space S is defined as follows:
S=[X,l t ,r t ,c t ,p t ,n t ]
wherein X represents the historical throughput prediction sequence obtained from step 1.2, l t Representing the download time of the last video block, r t Indicating the duration of the last time, c t Representing the code rate, p, selected at the previous time t Indicating the current buffer size, n t Representing the size of the video block to be transmitted at the next time;
step 2.4: the intelligent agent decision, namely, corresponding information is obtained from the state space S described in the step 2.3, decision is made in the current intelligent agent neural network, when decision is made, the intelligent agent can perform reward matching according to the information of the state space, then the action with the highest score is the optimal action, and the current optimal action a is obtained and output;
step 2.5: the network is trained. Generating a series of combined information while the intelligent agent continuously makes decisions, wherein the combined information in the step is a storage state T= [ s ', a, r, s ], inputting the storage state T= [ s', a, r, s ] into an experience pool, randomly extracting each piece of combined information from the experience pool by a training network, training according to each piece of combined information at intervals, and updating the intelligent agent network;
wherein s' represents state space information at the previous moment, a represents the action selected at this time, r represents the Reward value of the previous selection action, and s represents state space information at the present moment;
step 3: performing smooth control of bandwidth fluctuation;
because severe bandwidth fluctuation can cause frequent adjustment of video code rate by the ABR decision algorithm, so that the watching experience of a user is poor, and therefore, the control of the code rate switching frequency along with the bandwidth fluctuation needs to be realized;
as shown in fig. 4, a bandwidth fluctuation determination and smoothing control flow chart is shown.
Step 3.1: acquiring an action a and other information obtained by calculation in the step 2.4, wherein the action a corresponds to the video code rate level, and the other information comprises code rate selection and historical throughput sequences at the last moment;
step 3.2: defining a bandwidth fluctuation range: first, the central axis value of the fluctuation limit at the time t is determinedThe formula is as follows:
wherein, alpha and beta are the weights of the axis value at the previous moment and the throughput at the previous moment (alpha=0.2, beta=0.8),represents the medial axis value at time t-1, bandwidth t-1 Representing throughput at time t-1;
then, the upper and lower limits of the bandwidth fluctuation range, i.e., the bandwidth fluctuation limit, are defined, and the upper limit is set asThe lower bound is->
Step 3.3: the current bandwidth fluctuation state is obtained, namely: counting throughput number flu in bandwidth fluctuation range r And the number of throughput flu out of range f If flu is r -flu f > 10, it indicates that the bandwidth fluctuation state is present at this time, and flu_flag=1 is set; otherwise, indicating that bandwidth fluctuation does not occur at the moment, and setting flu_flag=0;
step 3.4: selecting a smoothing control strategy; the method specifically comprises the following steps:
as shown in fig. 5, a smoothing control flow chart is selected.
Step 3.4.1: acquiring the current bandwidth fluctuation state in the step 3.3;
step 3.4.2: determining that if in the bandwidth fluctuation state (flu_flag=1), turning to step 3.4.3, otherwise, the bit=aa is the optimal action obtained in step 2.4, turning to step 3.4.4;
step 3.4.3: according to the boundary decrementing effect of the Mean Opinion Score (MOS), a conservative policy is selected when the bandwidth fluctuates, that is, if the code rate level selected at the previous time is 4 (i.e., the code rate is 1080 p) and is currently in the bandwidth fluctuation state, the code rate level selected at the current time is set to 3 (bitrate=3), otherwise, the code rate level selected at the current time is set to the code rate level selected at the previous time (bitrate=last_bitrate).
Step 3.4.4: and outputting the final code rate grade decision.
The foregoing description of the preferred embodiments of the invention is not intended to limit the invention to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the invention.
Claims (5)
1. The video self-adaptive code rate control method based on reinforcement learning is characterized by comprising the following steps of:
step 1: performing throughput prediction, namely predicting the future throughput from the acquired historical throughput by using the LSTM neural network, and taking the future throughput as a parameter of a state space in the reinforcement learning neural network; the method specifically comprises the following steps:
step 1.1: acquiring historical data includes historical transmission video block size Chunk size And download time Chunk during ;
Step 1.2: the historical throughput is calculated as follows:
historical throughput = Chunk size /Chunk during
Then, the throughput values of the first 30 transmission video blocks in the historical transmission video blocks are added to the historical throughput sequence X, and the historical throughput sequence X is normalized according to the following normalization formula:
wherein X is norm For the normalized result, X is the original value, X max For maximum value in the sequence, X min Is the minimum value in the sequence;
step 1.3: constructing an LSTM neural network as a prediction model, wherein the LSTM neural network structure comprises a forgetting gate, an input gate and an output gate, and as a circulation structure, the prediction result at one moment on the LSTM neural network can be added to the next moment to simultaneously predict the data at the next moment;
step 1.4: results X obtained by normalizing historical throughput sequences norm As input to the LSTM neural network;
step 1.5: outputting a predicted value, namely a normalized throughput predicted value, as a state space input with the throughput predicted weight of step 2.1;
step 2: constructing a reinforcement learning network; the system structurally comprises an intelligent agent, a state space S of interaction of the intelligent agent and an environment, a selection action a of the intelligent agent to the environment, a storage state (S', a, r, S), an experience pool and a training network, wherein an ABR algorithm of the intelligent agent is applied to bandwidth fluctuation and low-delay video streaming, a decision function is realized, and an ABR strategy is continuously learned and updated through the training network; the method specifically comprises the following steps:
step 2.1: the Reward report formula used by the agent and training network shared network structure in the reinforcement learning network is as follows:
wherein, chunk_num represents the number of transmitted video blocks, quality [ i ] represents the transmission code rate of the ith video block, buffer [ i ] represents the duration of blocking during transmission of the ith video block, and Smooth [ i ] represents the phase difference between the (i+1) th video block code rate and the (i) th video block code rate, and the formula is as follows:
Smooth[i]=Quality[i+1]-Quality[i]
step 2.2: performing action matching, namely: mapping the output actions of the reinforcement learning network into an ABR algorithm, and using four different code rate levels 360p,480p,720p,1080p corresponding levels= [1,2,3,4]; respectively corresponding to the selected action spaces a= [1,2,3,4];
step 2.3: the state space S is defined as follows:
S=[X,l t ,r t ,c t ,p t ,n t ]
wherein X represents the historical throughput prediction sequence obtained from step 1.2, l t Representing the download time of the last video block, r t Indicating the duration of the last time, c t Representing the code rate, p, selected at the previous time t Indicating the current buffer size, n t Representing the size of the video block to be transmitted at the next time;
step 2.4: the intelligent agent decision, namely, corresponding information is obtained from the state space S described in the step 2.3, decision is made in the current intelligent agent neural network, when decision is made, the intelligent agent can perform reward matching according to the information of the state space, and then the action with the highest score is the optimal action, so that the current optimal action a is obtained and output;
step 2.5: training a network; generating a series of combined information while the intelligent agent continuously makes decisions, wherein the combined information in the step is a storage state T= [ s ', a, r, s ], inputting the storage state T= [ s', a, r, s ] into an experience pool, randomly extracting each piece of combined information from the experience pool by a training network, training according to each piece of combined information at intervals, and updating the intelligent agent network; wherein s' represents state space information at the previous moment, a represents the optimal action selected at the present time, r represents the Reward value of the previous selection action, and s represents state space information at the present moment;
step 3: performing smooth control of bandwidth fluctuation; the method specifically comprises the following steps:
step 3.1: acquiring the optimal action a and other information obtained by calculation in the step 2.4, wherein the optimal action a corresponds to the video code rate level, and the other information comprises code rate selection and historical throughput sequences at the last moment;
step 3.2: defining a bandwidth fluctuation range: first, the central axis value of the fluctuation limit at the time t is determinedThe formula is as follows:
wherein alpha and beta respectively represent the weight of the axis value at the previous moment and the throughput at the previous moment,represents the medial axis value at time t-1, bandwidth t-1 Representing throughput at time t-1;
then, the upper and lower limits of the bandwidth fluctuation range, i.e., the bandwidth fluctuation limit, are defined, and the upper limit is set asThe lower boundary is
Step 3.3: acquiring currentBandwidth fluctuation state of (1), namely: counting throughput number flu in bandwidth fluctuation range r And the number of throughput flu out of range f If flu is r -flu f > 10, it indicates that the bandwidth fluctuation state is present at this time, and flu_flag=1 is set; otherwise, indicating that bandwidth fluctuation does not occur at the moment, and setting flu_flag=0;
step 3.4: selecting a smoothing control strategy; the method comprises the following specific steps:
step 3.4.1: acquiring the current bandwidth fluctuation state in the step 3.3;
step 3.4.2: determining to go to step 3.4.3 if in the bandwidth fluctuation state (flu_flag=1); otherwise, the bit=a, a is the optimal action obtained in the step 2.4, and the step 3.4.4 is shifted;
step 3.4.3: a conservation strategy is selected when the bandwidth fluctuates, namely if the code rate level selected at the previous moment is 4 and is in the bandwidth fluctuation state at present, the code rate level selected at the current moment is set to be 3, otherwise, the code rate level selected at the current moment is set to be the code rate level selected at the previous moment;
step 3.4.4: and outputting the final code rate grade decision.
2. A video self-adaptive code rate control system based on reinforcement learning, which is used for executing the video self-adaptive code rate control method based on reinforcement learning according to claim 1, and realizing real-time adjustment of video code rate in a low-delay video stream according to network environment, and is characterized in that the system comprises a throughput prediction module, a reinforcement learning decision module and a smoothing control module; wherein:
the throughput prediction module is used for calculating historical throughput, inputting a throughput sequence into the LSTM neural network for prediction, obtaining a throughput prediction result at a future moment, and taking the throughput prediction result as a parameter of a state space in the reinforcement learning neural network;
the reinforcement learning decision module is used for constructing a reinforcement learning neural network DuelingDQN, taking throughput prediction results, buffer area information and other effective information as a state space, calculating to obtain a preliminary code rate level through an agent network, and continuously updating an agent by a training part in the module according to historical information;
the smoothing control module is used for acquiring a preliminary code rate decision result obtained from the reinforcement learning module and code rate information at the last moment, defining a range of allowable bandwidth fluctuation according to historical throughput conditions, and determining whether smoothing control is used according to whether the historical throughput is in the condition of allowable fluctuation; when the smoothing control is used, the same code rate level as the last time is adopted as a result, and when the last time is the highest code rate, the code rate level of the last time is reduced by one level to be adopted as a result.
3. The video adaptive rate control system according to claim 2, wherein the LSTM neural network structure includes a forgetting gate, an input gate and an output gate, and as a loop structure, the prediction result at one time on the LSTM neural network is added again to the next time and the data at the next time are predicted at the same time.
4. The video adaptive rate control system based on reinforcement learning according to claim 2, wherein the reinforcement learning network structure comprises an agent, a state space S in which the agent interacts with the environment, a selection action a initiated by the agent to the environment, a storage state (S', a, r, S), an experience pool, and a training network; the ABR algorithm of the intelligent agent is applied to bandwidth fluctuation and low-delay video streaming, so that a decision function is realized, and the ABR strategy is continuously learned and updated through a training network.
5. The video adaptive bitrate control system based on reinforcement learning of claim 4, wherein the state value and the action value are divided into two networks and then combined to obtain the final DuelingDQN network when constructing the reinforcement learning network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210405701.8A CN114827683B (en) | 2022-04-18 | 2022-04-18 | Video self-adaptive code rate control system and method based on reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210405701.8A CN114827683B (en) | 2022-04-18 | 2022-04-18 | Video self-adaptive code rate control system and method based on reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114827683A CN114827683A (en) | 2022-07-29 |
CN114827683B true CN114827683B (en) | 2023-11-07 |
Family
ID=82537363
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210405701.8A Active CN114827683B (en) | 2022-04-18 | 2022-04-18 | Video self-adaptive code rate control system and method based on reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114827683B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116916113B (en) * | 2023-09-06 | 2023-12-22 | 联通(江苏)产业互联网有限公司 | Data stream smoothing method based on 5G video customer service |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109218744A (en) * | 2018-10-17 | 2019-01-15 | 华中科技大学 | A kind of adaptive UAV Video of bit rate based on DRL spreads transmission method |
CN112508172A (en) * | 2020-11-23 | 2021-03-16 | 北京邮电大学 | Space flight measurement and control adaptive modulation method based on Q learning and SRNN model |
CN113596785A (en) * | 2021-07-26 | 2021-11-02 | 吉林大学 | D2D-NOMA communication system resource allocation method based on deep Q network |
WO2022000298A1 (en) * | 2020-06-30 | 2022-01-06 | Microsoft Technology Licensing, Llc | Reinforcement learning based rate control |
EP3968648A1 (en) * | 2020-01-16 | 2022-03-16 | Beijing Dajia Internet Information Technology Co., Ltd. | Bitrate decision model training method and electronic device |
-
2022
- 2022-04-18 CN CN202210405701.8A patent/CN114827683B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109218744A (en) * | 2018-10-17 | 2019-01-15 | 华中科技大学 | A kind of adaptive UAV Video of bit rate based on DRL spreads transmission method |
EP3968648A1 (en) * | 2020-01-16 | 2022-03-16 | Beijing Dajia Internet Information Technology Co., Ltd. | Bitrate decision model training method and electronic device |
WO2022000298A1 (en) * | 2020-06-30 | 2022-01-06 | Microsoft Technology Licensing, Llc | Reinforcement learning based rate control |
CN112508172A (en) * | 2020-11-23 | 2021-03-16 | 北京邮电大学 | Space flight measurement and control adaptive modulation method based on Q learning and SRNN model |
CN113596785A (en) * | 2021-07-26 | 2021-11-02 | 吉林大学 | D2D-NOMA communication system resource allocation method based on deep Q network |
Non-Patent Citations (1)
Title |
---|
基于HTTP流化的自适应码率混合控制算法;陈立伟;李国平;滕国伟;赵海武;王国中;上海大学学报. 自然科学版;第20卷(第3期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114827683A (en) | 2022-07-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110248247B (en) | Embedded dynamic video playing control method and device based on network throughput | |
CN112291620A (en) | Video playing method and device, electronic equipment and storage medium | |
CN111669617B (en) | Live video stream transmission method based on intelligent edge | |
CN114827683B (en) | Video self-adaptive code rate control system and method based on reinforcement learning | |
US11470372B2 (en) | Adaptive bitrate adjustment method for multi-user interactive live broadcast | |
CN112437321B (en) | Adaptive code rate calculation method based on live broadcast streaming media | |
CN112714315B (en) | Layered buffering method and system based on panoramic video | |
CN113259657A (en) | DPPO code rate self-adaptive control system and method based on video quality fraction | |
Gao et al. | Content-aware personalised rate adaptation for adaptive streaming via deep video analysis | |
TWI789581B (en) | Reinforcement learning method for video encoder | |
Xiao et al. | Traffic-aware rate adaptation for improving time-varying QoE factors in mobile video streaming | |
CN117749775A (en) | Real-time communication system and method suitable for non-stationary network environment | |
CN116320620A (en) | Stream media bit rate self-adaptive adjusting method based on personalized federal reinforcement learning | |
CN113872873A (en) | Multi-scene cross-layer congestion control method suitable for 5G new application | |
CN114040257A (en) | Self-adaptive video stream transmission playing method, device, equipment and storage medium | |
CN114363677A (en) | Mobile network video code rate real-time adjustment method and device based on deep learning | |
Ghosh et al. | MO-QoE: Video QoE using multi-feature fusion based optimized learning models | |
CN114071240A (en) | Mobile video QoE (quality of experience) evaluation method based on self-adaptive degree | |
CN117412076A (en) | Panoramic video transmission method based on DASH | |
Martín et al. | Q-learning based control algorithm for HTTP adaptive streaming | |
CN104702974B (en) | Bit rate control method and method for video coding based on fuzzy logic | |
WO2023103200A1 (en) | Video code rate control method and apparatus, and computer-readable storage medium | |
Liu et al. | Throughput Prediction-Enhanced RL for Low-Delay Video Application | |
CN116347170A (en) | Adaptive bit rate control method based on sequential causal modeling | |
CN113411628A (en) | Code rate self-adaption method and device of live video, electronic equipment and readable medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |