CN116320620A

CN116320620A - Stream media bit rate self-adaptive adjusting method based on personalized federal reinforcement learning

Info

Publication number: CN116320620A
Application number: CN202310349691.5A
Authority: CN
Inventors: 李文中; 徐业婷; 陆桑璐
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2023-04-04
Filing date: 2023-04-04
Publication date: 2023-06-23

Abstract

The invention discloses a stream media bit rate self-adaptive adjustment method based on personalized federal reinforcement learning, which is characterized in that in a dynamic self-adaptive stream system based on HTTP, the bit rate self-adaptive process is formalized and represented by establishing a Markov decision process based on the personalized federal learning and the deep reinforcement learning. The user learns the bit rate adaptation strategy locally using reinforcement learning. The objective equation is to maximize the quality of experience for the user. The global model is trained using federal learning to coordinate users and a central server, and the personalized model is trained on the global model using local data for each user. After extensive training, the user may use a personalized model to select the bit rate to achieve the maximum value of the target equation under the current network conditions. The invention solves the heavy tail characteristics of network environment and user behavior while protecting privacy from disclosure.

Description

Stream media bit rate self-adaptive adjusting method based on personalized federal reinforcement learning

Technical Field

The invention relates to application of a machine learning technology in the field of video streaming media, in particular to a streaming media bit rate self-adaptive adjustment method based on personalized federal reinforcement learning.

Background

Reinforcement learning is a branch of machine learning, meaning that intelligent systems learn the mapping from environment to behavior by trial and error, thereby obtaining the greatest rewards. Reinforcement learning is typically described by a Markov Decision Process (MDP) as shown in fig. 1, where the machine is in an environment E and the state space is S, where each state S E S is an observation of the environment perceived by an agent, and actions that the machine can take constitute an action space a, and when executing action a on a current state S, the environment will transition the current state to another state with some probability according to a state transition function P, and a reward is fed back to the machine according to a reward function R. Reinforcement learning can be used to solve the markov decision process MDP, which can be better solved using Deep Reinforcement Learning (DRL) when the state or action space is large.

Today, video streaming occupies more than 80% of the internet traffic. HTTP adaptive bit rate streaming (HAS) is the mainstream video streaming solution. It breaks the video content into a number of 2 to 4 second tiles and provides the tiles to the user like ordinary web content via the HTTP protocol. Dynamic adaptive streaming over HTTP (DASH) is the first open source solution for adaptive bitrate streaming over HTTP. In a DASH system, each tile has a representation of one or more bit rates. An Adaptive Bitrate (ABR) algorithm is an algorithm that dynamically selects a video block bitrate for a client video player to maximize quality of experience (QoE) of the user, based on network conditions, device capabilities, and user preferences. However, selecting an appropriate bit rate in a dynamic network is challenging because network bandwidth is limited and individual indicators of quality of user experience (QoE) are conflicting and need to be weighed against each other.

As an important algorithm in video streaming systems, ABR algorithm has been widely studied. Conventional model-based ABR algorithms are network bandwidth-based algorithms such as FESTIVE, and user buffer occupancy-based algorithms such as BOLA, BBA, etc. These algorithms select the bit rate by modeling the network conditions, by estimating the network bandwidth at the current time or based solely on the occupancy of the current user play-out buffer. These algorithms do not achieve optimal performance because they do not use all useful information to make bit rate decisions. The MPC uses a model predictive control algorithm in combination with network throughput estimation and buffer occupancy information to select bit rates to achieve QoE maximization over the range of several future video blocks. However, fixed control rules prevent MPC from adapting to the wide range of network conditions and different QoE objectives in the real world. Recent work, such as Stick, fugu, etc., uses machine learning algorithms to generate ABR policies. The machine learning based ABR algorithm uses the observed raw data (e.g., network throughput, play-out buffer occupancy, video block size) as a neural network input, outputting the predicted bandwidth, download time, or bit rate of the next video block. Despite the flexibility and effectiveness of ABR algorithms based on deep reinforcement learning, various challenges remain in applying them to practical video streaming systems. On the one hand, it is difficult to collect training data in a video streaming session, which often involves the privacy of the user, and uploading video viewing information of the user in various network environments to a central server for reinforcement learning training may cause serious user privacy problems. On the other hand, training a unified DRL model for heterogeneous clients is not feasible due to the complexity and diversity of network conditions, and it is also difficult to handle user behavior that is subject to time-dependent changes.

Disclosure of Invention

The invention aims to: in order to overcome the defects of the existing bit rate self-adaptive algorithm, the invention provides a stream media bit rate self-adaptive adjustment method based on personalized federal reinforcement learning so as to improve the experience quality of users.

The technical scheme is as follows: in order to achieve the above object, the streaming media bit rate adaptive adjustment method based on personalized federal reinforcement learning of the present invention comprises the following steps:

(1) Establishing a reinforcement learning model, wherein the environmental state comprises: network throughput measurement of a plurality of video blocks in the past, downloading time of a plurality of video blocks in the past, size of a next video block under various bit rate conditions, current buffer occupancy, number of blocks remained in a current video, bit rate of a video block downloaded last time; the action of the client selects the bit rate of the next video block; the rewards returned by the environment are the contributions of the currently selected video blocks to QoE; the objective of the objective equation is to maximize the quality of experience for the user, including video sharpness, client rebuffering time, and bit rate fluency of the video;

(2) The client locally performs deep reinforcement learning, and the process is as follows: the client collects the current state, inputs the current state into a strategy network of the reinforcement learning model, returns a selected action, interacts the obtained action with the environment, thereby obtaining { state-action-rewards } pairs, trains a value function network by using the { state-action-rewards } pairs, trains the strategy network by using the value function, and then continuously repeats the above operations until the model converges;

(3) Federal learning is performed between the client and the central server, and the process is as follows: the client returns the locally trained model to the central server, the central server receives the models sent by the clients and aggregates the models into a new global model, the central server sends the new global model to the clients, the clients continue the operations of local training and uploading, and finally a learning result is a global model and a plurality of personalized models, and represents the mapping rule of the state to the action;

(4) Inputting the client state into the trained personalized model to obtain the bit rate of maximizing the QoE of the user.

Further, the objective equation is expressed as:

wherein N represents the total number of blocks of the current video; r is R _n A video bit rate representing each block n; t (T) _n Representing the rebuffering time for each block n; q (R) _n ) Is a bit rate R _n A function mapped to user perceived video quality; μ and τ correspond to non-negative weighting parameters of rebuffering time and fluency in video quality change, respectively.

Further, the reinforcement learning method is an Actor-Critic algorithm, and a context neural network module is added on the basis of the reinforcement learning method, the context neural network module uses states of clients and rewards returned by environments as inputs, and the potential representation vectors of the current environments are output, and finally the vectors are used as inputs of a value function network to guide the learning of the value function network.

Further, the update equation of the context network and the value function network in reinforcement learning training is:

wherein V (s, z) represents a value function, θ _v Representing parameters, θ, of a value function network _c Parameters representing the context network, gamma being a discount factor for future rewards, r _t Indicating the return at time t.

Further, in reinforcement learning training, the update equation of the policy network is:

wherein θ represents a parameter of the policy network, R _t Representing an expected discount prize at time point t; alpha is the learning rate; gamma is a discount factor for future rewards; beta is a training parameter; a, a _t Representing actions of the reinforcement learning model; s is(s) _t Representing client state of the reinforcement learning model; z _t Representing an output of the contextual network module; pi _θ Representing a policy network; />

Representing gradients of the update policy network parameters; h (·) is an entropy term.

Further, the federal learning method used is a FedAvg algorithm, and a weighted average algorithm is used to calculate a global model for the models sent back to the central server by each client.

Further, the equation for federal learning model aggregation is:

wherein n is _i Representing the amount of local training data of client i,/->

C represents the total number of clients participating in federal learning, < >>

Representing the global model of the t-th round of training sent by client i.

The beneficial effects are that: the invention provides a streaming media bit rate self-adaption method based on personalized federal learning. The user learns the bit rate adaptation strategy locally using reinforcement learning. And representing the state of the current environment by using the information such as the client network throughput measurement, the downloading time of a plurality of video blocks in the past, the occupancy rate of the current buffer area and the like. The bit rate selection of the next video block is defined as an action. A target equation is established with the aim of maximizing the quality of experience for the user. The global model is trained using federal learning to coordinate users and a central server, and the personalized model is trained on the global model using local data for each user. After extensive training, the user may use a personalized model to select the bit rate to achieve the maximum value of the target equation under the current network conditions. The invention solves the problem of privacy disclosure of the learning-based algorithm by using federal learning. Meanwhile, in order to cope with the heavy tail characteristics of the network environment and the user behaviors, a personalized local model is trained for each user on the basis of a global model obtained by federal learning, so that better performance can be achieved in the local environment. Moreover, the newly added user can also directly download the trained global model and train the global model by using a small amount of local data, so as to obtain a local model which is more suitable for the local environment.

Drawings

FIG. 1 is a reinforcement learning illustration;

FIG. 2 is a schematic diagram of a personalized federal reinforcement learning framework of the present invention;

FIG. 3 is a flow chart of a method of adaptive streaming media bit rate adjustment based on personalized federal reinforcement learning in accordance with the present invention;

fig. 4 is a diagram of a reinforcement learning model for learning according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

The invention provides a stream media bit rate self-adaptive adjustment method based on personalized federal reinforcement learning, which is characterized in that in a dynamic self-adaptive stream system based on HTTP, the bit rate self-adaptive process is formalized and represented by establishing a Markov decision process based on personalized federal learning and deep reinforcement learning. Referring to fig. 2, users (i.e., clients) learn bit rate adaptation strategies locally using reinforcement learning, multiple users and a central server train a global model through federal learning, and each user trains a personalized model on the basis of the global model using local data. Through extensive training, the user may use a personalized model to select the bit rate to achieve the goal of maximizing the quality of user experience under current network conditions.

Referring to fig. 3, the training framework of the bit stream adaptation method of the present invention mainly comprises three phases: a local training phase, a federal aggregation phase and a personalized adaptation phase. And training by using a deep reinforcement learning algorithm Actor-Critic in a local training stage to finish local model updating, performing model aggregation by using a FedAVg algorithm in a federal aggregation stage, downloading a global model in a personalized adaptation stage, and performing personalized adaptation locally by using a fine adjustment method. The specific flow is as follows:

(1) First, the central server initializes the global model w ₀ And distributes it to clients.

(2) At round t, client c receives the global model w _t Then, the model is trained by using the local data, and after the model is updated locally for several times, the updated model is updated

And sent back to the central server.

(3) Local model for central server to receive these clients to send back

And aggregate these models to obtain a new global model w _t+1 。

(4) And (3) repeating the step (2) and the step (3) until the model converges. After the model converges, step (5) can be performed.

(5) The global model is fine-tuned using client local data to obtain a personalized model. For new clients that do not participate in federal training, the trained global model may be downloaded before viewing video begins and then trimmed with local data.

The global model and the local model in the process are both reinforcement learning models, namely models for short, and model structures are the same, but parameters of the models are different. The basic principles of the reinforcement learning model have been described above in connection with fig. 1. Reinforcement learning models are often described by a markov process, sometimes referred to as a markov model, whose basic elements include states, actions, rewards. According to an embodiment of the present invention, the following Markov model (i.e., reinforcement learning model) is built:

the environmental state observed by the client is defined as { x } _t ,τ _t ,n _t ,b _t ,c _t ,l _t The meaning of each parameter is as follows: t represents the current point in time. X is x _t Refers to network throughput measurement of the past k video blocks. τ _t Refers to the download time of the past k video blocks. n is n _t Refers to the size of the next video block under various bit rate conditions. b _t Refers to the current buffer occupancy, b when video is rebuffering _t Is 0.c _t Refers to the number of blocks remaining for the current video. l (L) _t Refers to the bit rate of the last downloaded video block. In the embodiment, k is set to 7. The state of the client not only reflects the current network state of each client, such as the past network throughput measurement, but also reflects the state of the client watching the video, such as the number of blocks remaining in the current video, the size of the video blocks under various bit rate conditions, and the like.

The action of the client is defined as a _t Representing the bit rate at which the next video block is selected. The range of bit rates selected includes, but is not limited to, 300,750,1200,1850,2850,4300kbps, etc.

The rewards returned by the environment are defined as r_t and represent the QoE rewards brought by the selection of the current block.

Defining a target equation as:

where N represents the total number of blocks of the current video; r is R _n A video bit rate representing each block n; t (T) _n Representing the rebuffering time for each block n; q (R) _n ) Is a bit rate R _n A function mapped to user perceived video quality; μ and τ correspond to non-negative weighting parameters of rebuffering time and fluency in video quality change, respectively. A relatively small mu indicates that the user is not particularly concerned with the buffering time.

The objective of the objective equation is to maximize the quality of experience for the user, including video sharpness, user buffer occupancy, and video smoothness, with different user preferences for these three quality of experience. The experience of video sharpness in the present invention can be expressed in terms of video bit rate.

In the embodiment of the invention, the learning process is realized by modeling the network environment. The modeling network environment can set different parameters according to different environment requirements so as to learn different rules for different network environments, and therefore ideal results can be obtained in different networks.

The network environment needs to consider the following factors: network bandwidth of each user participating in federal learning, transmission delay of each network. Since network bandwidth and transmission delay are two important factors commonly used to measure network conditions and are also data that can be measured, the present invention uses both data to describe the network environment of a user. Establishing a video stream client model: the analog video streaming application requests the video block from the server, and if the video block does not arrive in time, the video block is rebuffered, and after the video is finished, the next video is automatically restarted to be watched.

In the embodiment of the invention, the users participating in federal learning are set as 3, the network bandwidths of all the users are different, and the FCC data set, the HSDPA data set and the Sydney data set are respectively used for simulating WiFi, 3G and 4G environments. The transmission delay of each network is set to 80 milliseconds. In the video streaming client model, the client randomly selects network trace data each time to simulate the current network bandwidth.

In the local training phase, the client performs deep reinforcement learning, and the process is as follows: the client observes the network environment and the video playing information as the state of the current environment, inputs the state into the strategy network firstly, and outputs an action according to the current state, wherein the action is the bit rate of the next video block; the obtained action is interacted with the environment, the environment can give QoE rewards contributed by the video block, so that { state-action-rewards } pairs corresponding to the current video block are obtained, and a value function network is trained according to { state-action-rewards } pairs, so that the value function network can better predict the accumulated return average value of different states; the advantage of the action output by the policy network is calculated using the trained value function network, thereby updating the policy network, and then the above operations are repeated until the model converges.

The reinforcement learning model for learning in particular according to the present invention is shown in fig. 4. It should be noted that the reinforcement learning model used in the present invention adds a context module implemented with long-short-term memory (LSTM) on the basis of the Actor-Critic, which uses the state of the client as input, and also uses the rewards returned by the environment as input, and the output of the context module is a potential representation vector of the current environment, which finally serves as a part of the input of the value function network to guide the learning of the value function network. When a new client joins or a change in the client network environment occurs, the context network may examine the output of a new representation vector based on the change.

Referring to fig. 4, the detailed procedure of reinforcement learning of the present invention is as follows:

s1, the client collects the current state, including network environment, video playing information and the like.

S2, inputting the current state into a strategy network of an Actor-Critic model, so that the strategy network returns the selected action, namely the bit rate of the next video block.

S3, the obtained action interacts with the environment, and the environment can give the QoE rewards contributed by the video block, so that the { state-action-rewards } pair corresponding to the current video block is obtained.

S4, training a context network module and a value function network by using { state-action-return } pairs, so that the value function network can better predict the accumulated return average value of different states. The update equation is:

wherein->

Representing a value function, θ _v Representing parameters, θ, of a value function network _c Parameters representing the context network, gamma being a discount factor for future rewards, r _t Indicating the return at time t. In the present invention, the context network, the context neural network module, and the context network module refer to the same concept and may be used interchangeably. The Actor-Critic model includes an Actor (Actor network) and Critic (Critic network)) The implementation of the two parts is respectively illustrated through a strategy network and a value function network at the algorithm implementation level, namely, an actor network learns the strategy, and a criticizing home network can evaluate the value of the state in the environment. Thus, actor networks and strategy networks are used interchangeably herein, and reviewer networks and value function networks are used interchangeably.

S5, calculating the advantages of the action output by the strategy network by using the value network function, so as to update the strategy network:

wherein θ represents a parameter of the policy network, R _t Representing an expected discount prize at time point t; alpha is the learning rate; gamma is a discount factor for future rewards, here set to 0.99; h (-) is an entropy term that encourages decision making by agents to try diversity. The parameter β is set to a larger value at the beginning of training to encourage exploration and decrease over time to highlight rewards for the environment. s is(s) _t Representing the client state of the reinforcement learning model. Referring to FIG. 4, z _t Is the output of the context network module. Pi _θ Representing an actor network, i.e., a strategy network, where θ represents a parameter of the actor network. />

Representing gradients that update actor network parameters.

In the federation aggregation phase, the central server global model uses w _t The subscript of the model represents the number of federally trained rounds. The specific process of federal learning is as follows:

(1) The central server first initializes the global model w _t At this time, the 0 th round, namely, the w for initializing the model ₀ Representing and transmitting the global model to each user.

(2) User i takes the initialized global model w _t Then, local training is carried out, and the trained model is used

Representing and returning the trained model to the central server.

(3) The central server receives the models sent by the users and aggregates the models into a new global model w _t+1 . The equation for model aggregation is:

wherein n is _i Representing the number of local training data of user i, < >>

C represents the total number of users participating in federal learning.

(4) The central server performs a new round of federal training, i.e., sends a new global model to each user, who resumes the operation of step (2).

The result of federal learning is a global model and a plurality of personalized models, and finally, the client state is input into the personalized models to obtain the bit rate selection for maximizing the QoE of the user.

In the invention, the personalized model is a local model of a user, namely the personalized is reflected in two aspects, namely, the model of the client is updated by carrying out local training (i.e. fine tuning) again on the basis of the global model after the convergence of training, and the stage is also called a personalized stage; secondly, on the whole model design, by adding a context network module, not only the state of a client is used as input, but also the state transition after the intelligent agent operates and rewards under different states and operations are used as input, so that the context network can be utilized to abstract out the specific characteristics of a state transition equation and a rewarding function related to the current environment. By inputting a series of { state-action-rewards } pairs into the context network, the environmental characteristics of different users can be captured, resulting in a personalized model. And these { state-action-rewards } pairs are obtained by local training of the client during the personalization phase.

The preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to the specific details of the above embodiments, and various equivalent changes can be made to the technical solution of the present invention within the scope of the technical concept of the present invention, and all the equivalent changes belong to the protection scope of the present invention.

Claims

1. A streaming media bit rate self-adaptive adjustment method based on personalized federal reinforcement learning, characterized in that the method comprises the following steps:

2. The method of claim 1, wherein the objective equation is expressed as:

3. The method of claim 1, wherein the reinforcement learning method used is an Actor-Critic algorithm, and a contextual neural network module is added on the basis of the Actor-Critic algorithm, and the contextual neural network module uses states of clients and rewards returned by environments as inputs, outputs potential representation vectors of the current environment, and the vectors are finally used as inputs of a value function network to guide learning of the value function network.

4. A method according to claim 3, wherein the update equations for the context network and the value function network in reinforcement learning training are:

5. A method according to claim 3, wherein in reinforcement learning training, the policy network is furtherThe new equation is:

6. The method of claim 1, wherein the federal learning method used is a FedAvg algorithm, and wherein the global model is calculated using a weighted average algorithm on the models sent back to the central server by the respective clients.

7. The method of claim 1, wherein the federal learning model aggregate equation is:

wherein n is _i Representing the amount of local training data of client i,/->

Representing the global model of the t-th round of training sent by client i.