CN111031387B

CN111031387B - Method for controlling video coding flow rate of monitoring video sending end

Info

Publication number: CN111031387B
Application number: CN201911145837.4A
Authority: CN
Inventors: 张旭; 赵阳超; 马展
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2020-12-04
Anticipated expiration: 2039-11-21
Also published as: CN111031387A

Abstract

The invention discloses a method for controlling video coding flow rate of a monitoring video sending end, which mainly comprises the following steps: (1) collecting a real-time available bandwidth data set of a video transmission scene network; (2) the real bandwidth data is utilized to construct a simulation training environment of a monitoring video sending end, the training environment determines the highest available bandwidth of the monitoring video sending end in real time according to the real bandwidth data to serve as the sending rate of the video, and the coding rate of an encoder is adjusted by receiving the code rate selected by the deep reinforcement learning model; (3) constructing a continuous action output deep reinforcement learning model based on a trust domain, and training the model by utilizing a simulation environment; (4) putting the trained model into a monitoring video, integrating the trained model into a real environment, and performing online training optimization; (5) and integrating the optimized deep learning model to a monitoring video sending end to make a coding rate decision of a coder at the sending end. The invention solves the problem of controlling the coding flow rate of the monitoring video sending end by utilizing deep reinforcement learning.

Description

Method for controlling video coding flow rate of monitoring video sending end

Technical Field

The invention relates to the field of real-time video transmission, in particular to a method for controlling video coding flow rate of a monitoring video monitoring end.

Background

The monitoring video generally has higher requirements in the aspects of real-time performance, fluency, video picture quality and the like. However, in an actual monitoring environment, from a monitoring video acquisition end to a receiving end, such as a monitoring room, video transmission often passes through a complex network environment, and the complex network environment causes conditions of limited bandwidth and time delay fluctuation, thereby affecting real-time performance, smoothness and definition of a monitoring video playing end (receiving end). In order to ensure the transmission effect of the surveillance video and improve the viewing experience of the surveillance video, all links in the transmission process of the surveillance video need to be optimized in a targeted manner, and particularly a coding flow rate control part of a video sending end.

The monitoring video sending end needs accurate coding flow rate control, and the main reasons are as follows: on one hand, the sending rate of the monitoring video is determined by a complex network environment, and has the characteristics of rapid change, difficult prediction and the like; on the other hand, the rate of the monitoring video sending end in the coding stage after video acquisition can be controlled by manually adjusting the coding parameters of the coder, in the sending process of the video, the fluency of video sending is ensured by passing through a video sending buffer area from the coding of the video coder to the sending process of the code stream through the network, the consumption speed of the video sending buffer area is determined by the real-time available bandwidth of the network, namely the actual sending rate, the increasing speed of the video sending buffer area is determined by the coding rate of the coder, and therefore the problem that the video coding rate is not matched with the video sending rate can occur.

If the coding rate of the monitoring video sending end and the sending rate of the video do not match, the phenomenon of video sending buffer overflow or 'starvation' of the video sending end can be caused. The overflow of the video sending buffer is that the number of video frames in the video buffer has reached the upper limit of the capacity of the buffer, and if the video frames coded by the coder are to be stored again, the video frames entering the buffer at the earliest are required to be removed, which results in the frame loss phenomenon in the video transmission process. On the other hand, the starvation phenomenon of the video sending buffer is mainly caused by that the video sending buffer is empty and the code rate of video coding is often lower than the real-time available bandwidth of video sending for a long time, which shows that the utilization rate of the bandwidth is too low in the video sending process, a large amount of available bandwidth resources are wasted, and meanwhile, the video definition of the monitoring video receiving end also has a further improved space.

Therefore, the video coding flow rate control of the monitoring video sending end mainly aims to achieve the aim that the coding rate of the monitoring video coder is matched with the video sending rate, when the real-time available bandwidth is large, the video sending rate is large, and the coding rate of the video coder can be properly increased; when the real-time available bandwidth is reduced, the video sending rate is reduced, the coding rate of the coder should be reduced in time, and the phenomenon of video frame loss caused by overflow of a video sending buffer is avoided.

The most intuitive way to achieve the rate matching is to predict the real-time available bandwidth of the video transmission at the next time in advance, and then adjust the coding rate of the encoder at the next time according to the level of the predicted real-time available bandwidth. However, in an actual environment, since the change of the actual available bandwidth is not regular in general, it is quite difficult to estimate the actual available bandwidth at the next time, so that the current transmission environment can only be roughly estimated by observing some measurable characteristic parameters in the video transmission process, and the encoder coding rate at the next time is selected according to the observed characteristic parameters. The difficulty in rate matching through the measurable characteristic parameters is how to accurately judge the characteristics of the current video transmission environment, particularly the real-time available bandwidth of the current network, according to the measured parameters.

Disclosure of Invention

Aiming at the problem of code rate control during the sending of the monitoring video, the invention provides a monitoring video sending end video coding flow rate control method based on deep reinforcement learning.

The technical scheme adopted by the invention is as follows:

a method for controlling video coding flow rate at a monitoring video sending end comprises the following steps:

step 1, collecting real bandwidth change data of an actual transmission environment by using an equal-interval sampling mode, and making a video transmission scene network real-time available bandwidth data set for training;

step 2, a simulation training environment of the monitoring video sending end is constructed by using the real bandwidth data collected in the step 1, the training environment determines the highest available bandwidth sent by the monitoring video in real time according to the real bandwidth data to be used as the video sending rate, and the code rate selected by the deep reinforcement learning model is received and set as the coding code rate of the encoder in the next time period;

step 3, constructing a continuous action output depth reinforcement learning model based on a trust domain, designing a target reward function required by model training, and training the model by using the simulation training environment in the step 2; the model takes various data output by the simulation training environment in the step 2 as input, the coding code rate of the monitoring video sending end at the next moment is selected, and the goal of the training model is a target reward function set to the maximum;

step 4, integrating the model trained in the step 3 into a real environment for interaction, and performing on-line training optimization;

and 5, integrating the optimized deep reinforcement learning model to a monitoring video sending end to select a sending code rate.

Further, in step 1, the real bandwidth change data includes real-time available bandwidth change data when the surveillance video is transmitted and an existing public bandwidth change data set. The real-time available bandwidth data when the monitoring video is sent is as follows: the network available bandwidth for the collected video transmission is sampled at different time intervals.

Further, in step 2, a simulation training environment of the monitoring video sending end is constructed, and the specific process is as follows:

step 21, constructing a video encoder simulation module, wherein the input of the video encoder simulation module is some fixed encoding parameters of the monitoring video, including the frame rate of the video, the size of the video image group and the selected video encoding code rate; the output of the video encoder simulation module is the data size of one video frame; according to the input fixed coding parameters, the data size of a video frame is determined by using a uniform distribution:

wherein sample () operation represents sampling from a probability distribution, U (a, b) represents a uniform distribution over the interval [ a, b ]; the video encoder simulation module adds video frames with the size of FS to a buffer area in the video sending buffer area simulation module at regular time according to frame intervals determined by the frame rate of the video;

step 22, constructing a video transmission buffer simulation module, the main body of which is a simulated video transmission buffer, and the maximum frame number which can be accommodated by the buffer needs to be specified, when the buffer is full, if the simulation module of the encoder has a new incoming video frame, the existing earliest incoming video frame in the buffer needs to be cleared, and the new incoming video frame is added into the buffer;

step 23, constructing a video network transmission simulation module, wherein the input of the video network transmission simulation module is the real bandwidth change data of the actual transmission environment obtained in the step 1, and the available bandwidth is used as the video transmission rate to consume the video frame from the video transmission buffer area in the video transmission buffer area simulation module; if the available bandwidth is maintained at BW for the Δ t time interval, the total amount of data D transmitted over the network during the Δ t time interval is:

D＝Δt*BW

the total number of data amounts of frames in the buffer that should be cleared out of the zone is of size D.

Further, in step 3, a confidence domain-based continuous action output deep reinforcement learning model is constructed, and the specific implementation process is as follows: step 31, processing the output of the simulation training environment in the step 2 as the input of the deep reinforcement learning model, wherein the main processing process is as follows: firstly, respectively normalizing all parameters of historical k time nodes, wherein the parameters comprise the coding code rate of an encoder, the length of a video sending buffer area, the change value of the video sending buffer area and the historical sending average rate of the video; then storing the normalized value of the parameter in an input matrix state matrix;

step 32, building a neural network part of a continuous action output deep reinforcement learning model based on a trust domain, wherein the neural network part comprises a deep neural network operator and a deep neural network critic, and building training optimization targets of two deep neural networks, namely respective loss functions;

step 33, designing and training a reward function of the depth-enhanced learning model based on the continuous action of the trust domain, wherein the reward function gives a higher reward value to the selection action of the encoder code rate for keeping the video sending buffer at a normal level and the selection action for keeping the encoder code rate stable, and gives a lower reward value to the action for causing the length of the video sending buffer to deviate from the normal level;

step 34, inputting the matrix state matrix of step 31 into the network operator and the network criticic of step 32, performing forward calculation of the neural network to obtain the output of the network operator and the network criticic, then obtaining the video encoder coding rate at the next moment according to the output of the neural network, calculating the reward function constructed in step 33, finally calculating the corresponding training optimization target according to the value of the reward function and the output of the two neural networks, performing back propagation of the neural network to update the neural network parameters, and setting the encoder coding rate obtained by the output of the neural network as a new encoder coding rate, wherein the encoding rate will affect the matrix state at the next moment;

step 35, repeat step 34 until the resulting reward function no longer rises.

Further, in the step 5, the deep reinforcement learning model optimized in the step 4 is integrated to a monitoring video sending end, and only a network operator in the model needs to be deployed to the sending end, and the specific process is as follows:

step 51, deploying a lightweight operating environment of the selected deep learning framework at a monitoring video sending end;

step 52, converting the network operator in the deep reinforcement learning model optimized in the step 4 into a mobile lightweight model;

and step 53, calling the mobile lightweight model generated in the step 52 by using the operating environment configured in the step 51 to perform forward calculation to obtain a code rate to be selected, setting the code rate of the encoder, directly collecting characteristic parameters from the system and calculating a matrix state according to the mode of the step 4, continuously calculating the code rate of the encoder at the next moment by taking the new matrix state as the input of the lightweight model, and repeating the interactive process.

The invention solves the problem of controlling the coding flow rate of the monitoring video sending end by utilizing deep reinforcement learning on the basis of a large amount of bandwidth change data of the actual transmission environment. In order to achieve the best control effect, on one hand, the invention selects a prior strengthening learning method based on a trust domain; on the other hand, in order to ensure the response speed of the coding flow rate control and the continuous change range of the flow rate control, the invention selects a reinforced learning model of continuous action output, and the model of the continuous action output can directly output the selected code rate value instead of the preset code rate grade. Secondly, after the deep reinforcement model is trained in a simulation environment under the support of a large amount of data, the model is further deployed to an actual system to perform online optimization training of the model, and the performance of the model in a specific actual scene is improved on the premise of ensuring the generalization capability of the model.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a video sender simulation environment;

FIG. 3 is a continuous motion output deep reinforcement learning model based on confidence domains.

Detailed Description

The invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, a method for controlling video encoding flow rate at a monitoring video sending end in this embodiment specifically includes the following steps:

step 1, collecting real-time available bandwidth change data during monitoring video transmission and collecting an existing public bandwidth change data set by using an equal-interval sampling mode, and making a video transmission scene network real-time available bandwidth data set for training.

Setting the time t of sampling_sample150ms, if the sampling time in the disclosed data set is not 150ms, the sampling time can be uniformly modified to 150ms, the network bandwidth data is stored in a plurality of text files, each line of the text files has two values, wherein, the 1 st value represents the current time stamp, the time stamp starts from 0 and is t_sampleThe 2 nd numerical value represents for interval incrementsThe available bandwidth value of the current time, and the two values are separated by a tab. The network available bandwidth of video transmission is sampled and collected by setting different time intervals, and the purpose is to simulate the network bandwidth of various change speeds and increase data diversity.

Step 2, constructing a simulation training environment of the monitoring video sending end by using the real bandwidth data collected in the step 1:

and step 21, constructing a video encoder simulation module. The simulation module of the video encoder inputs some fixed encoding parameters of the monitoring video, the Frame Rate (FR) of the video is 25FPS, the size (GOP) of a video image group is 3s and is correspondingly 75 frames, and the selected video encoding rate (BR), wherein the video encoding rate parameters are obtained by calculating the output of the operator network constructed in the step 32 and are direct control quantity of video encoding flow rate control. The output of the video encoder emulation module is primarily the data size (FS) of one video frame. The uniform distribution is used to determine, in this embodiment:

where sample () operation represents sampling from a probability distribution, U (a, b) represents a uniform distribution over the interval [ a, b ], and FS is in MB when the unit of the selected video coding rate is Mbps.

And step 22, constructing a video sending buffer area simulation module. The main function of the video transmission buffer is to maintain a first-in-first-out video frame queue, each video frame in the queue will have corresponding frame data size information (FS), the maximum number of frames that can be accommodated by the specified buffer is 125 frames, the corresponding video duration is 5s, when the queue is full, if there is a new incoming video frame from the encoder simulation module, it is necessary to clear the existing earliest incoming video frame in the queue and add the new incoming video frame to the queue.

And step 23, constructing a video network transmission simulation module. The input of the video network transmission simulation module is the data of the timestamp and the bandwidth value read from the text file storing the network real-time available bandwidth data obtained in the step 1, and the current bandwidth value is used as the transmission rate from the current timestamp to the next timestamp to consume a video frame from a video transmission buffer area in the video transmission buffer area simulation module; if the available bandwidth is maintained at BW for the Δ t interval, which is 150ms in this embodiment, the total amount of data D transmitted over the network during this interval is:

D＝0.15*BW

the video frames should be sequentially consumed from the end of the video transmission buffer until the sum of the data amounts of the consumed video frames reaches D.

Step 3, constructing a continuous action output deep reinforcement learning model based on a trust domain, and training the model by utilizing a simulation training environment:

step 31, processing the output of the simulation training environment in the step 2 as the input of the deep reinforcement learning model, wherein the main processing process is as follows: firstly, respectively normalizing all parameters of historical 2 time nodes, including the coding code rate BR (unit: Mbps) of an encoder, the length BL (unit: frame) of a video sending buffer area, the variation value delta B (unit: frame) of the video sending buffer area and the historical sending average rate TH (unit: Mbps) of a video; a total of 4 parameters would have 8 values. And then storing the normalized values in an input state matrix, wherein each column of the state matrix is a vector with the length of 2, and 4 columns in total represent the four characteristic parameters, so that the dimension of the state matrix is 2 multiplied by 4.

Step 32, building a neural network part of a continuous action output deep reinforcement learning model based on a trust domain by using the existing popular deep learning framework tensorflow, building an operator network and a critic network respectively, and building a training Optimization target of the operator network and the critic network, namely respective loss functions.

Step 33, designing and training a reward function of the confidence domain-based continuous action output deep reinforcement learning model, wherein the reward function is mainly used for considering whether the selection of the model can maintain the proper range of a video sending buffer area and whether the model keeps consistent with the previous selection as much as possible, and the reward function has the specific form:

BR and lastBR are respectively the code rate selected by the current decision and the code rate selected by the last decision, BL is the video transmission buffer length observed in the next decision after the current decision, and the unit is converted into the corresponding time length from the video frame number.

Step 34, inputting the state matrix of step 31 into the operator network and the criticic network of step 32, performing forward calculation of the neural network to obtain outputs of the operator network and the criticic network, then sampling from normal distribution constructed by mean values and variances output by the operator network to obtain the coding rate of the video encoder at the next moment, calculating the reward function constructed in step 33, finally calculating corresponding training optimization targets according to the values of the reward function and the outputs of the two neural networks, performing back propagation of the neural networks to update parameters of the neural networks, and taking the coding rate of the encoder obtained by the output of the neural networks as the coding rate of a new encoder, wherein the coding rate will influence the state matrix at the next moment.

And step 35, repeating the step 34, wherein the time interval of each decision is 1s until the obtained reward function does not rise any more.

And 4, integrating the model trained in the step 3 into a real system for interaction, wherein the whole process is consistent with the step 3, the difference is that the construction parts of the steps 32 and 33 are not needed any more, the deep reinforcement learning model and the reward function in the step 3 are directly used, and the difference is that the four characteristic parameters in the step 31 are directly collected from the real system every 1s to form a state matrix as the training input of the neural network, and after the newly selected coding rate is obtained, the coding rate is directly set as the coding rate of a system coder.

Step 5, integrating the optimized deep reinforcement learning model to a monitoring video sending end, and selecting a sending code rate:

step 51, deploying the lightweight running environment of the selected deep learning framework at the monitoring video sending end, selecting the tenserflow as the deep learning framework in the embodiment, compiling a static library deployed at the tenserflow-lite mobile end, and deploying the compiled model at the monitoring video sending end to be directly called.

And step 52, converting the actor network in the deep reinforcement learning model optimized in the step 4 into a tensoflow-lite model.

And step 53, calling the tensoflow-lite model generated in the step 52 by using the tensoflow-lite static library configured in the step 51 to perform forward calculation to obtain a code rate to be selected, setting the coding code rate of the encoder, directly collecting characteristic parameters from the system according to the mode of the step 4 to calculate a state matrix, calculating a new state matrix every 1s as the input of the lightweight model to calculate the code rate of the encoder at the next moment, and repeating the interaction process.

Claims

1. A method for controlling video coding flow rate at a monitoring video sending end is characterized by comprising the following steps:

the specific process of constructing the simulation training environment of the monitoring video sending end is as follows:

D＝Δt*BW

the total number of data amount of frames in the buffer that should be cleared out of the zone is D;

the specific implementation process for constructing the confidence domain-based continuous action output deep reinforcement learning model comprises the following steps:

step 31, processing the output of the simulation training environment in the step 2 as the input of the deep reinforcement learning model, wherein the main processing process is as follows: firstly, respectively normalizing all parameters of historical k time nodes, wherein the parameters comprise the coding code rate of an encoder, the length of a video sending buffer area, the change value of the video sending buffer area and the historical sending average rate of the video; then storing the normalized value of the parameter in an input matrix state matrix;

step 35, repeating step 34 until the obtained reward function does not rise any more;

2. The method for controlling video coding flow rate at a sending end of surveillance video according to claim 1, wherein in step 1, the real bandwidth change data includes real-time available bandwidth change data at the sending end of the surveillance video and an existing public bandwidth change data set.

3. The method according to claim 2, wherein in step 1, the real-time available bandwidth data during the transmission of the surveillance video is: the network available bandwidth for the collected video transmission is sampled at different time intervals.

4. The method for controlling the video coding flow rate at the transmitting end of the surveillance video according to claim 1, wherein in the step 5, the deep reinforcement learning model optimized in the step 4 is integrated into the transmitting end of the surveillance video, and only a network operator in the model needs to be deployed to the transmitting end, and the specific process is as follows: