CN116320620A - Stream media bit rate self-adaptive adjusting method based on personalized federal reinforcement learning - Google Patents

Stream media bit rate self-adaptive adjusting method based on personalized federal reinforcement learning Download PDF

Info

Publication number
CN116320620A
CN116320620A CN202310349691.5A CN202310349691A CN116320620A CN 116320620 A CN116320620 A CN 116320620A CN 202310349691 A CN202310349691 A CN 202310349691A CN 116320620 A CN116320620 A CN 116320620A
Authority
CN
China
Prior art keywords
network
bit rate
model
reinforcement learning
video
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310349691.5A
Other languages
Chinese (zh)
Inventor
李文中
徐业婷
陆桑璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202310349691.5A priority Critical patent/CN116320620A/en
Publication of CN116320620A publication Critical patent/CN116320620A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/442Monitoring of processes or resources, e.g. detecting the failure of a recording device, monitoring the downstream bandwidth, the number of times a movie has been viewed, the storage space available from the internal hard disk
    • H04N21/44209Monitoring of downstream path of the transmission network originating from a server, e.g. bandwidth variations of a wireless network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/466Learning process for intelligent management, e.g. learning user preferences for recommending movies
    • H04N21/4662Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms
    • H04N21/4666Learning process for intelligent management, e.g. learning user preferences for recommending movies characterized by learning algorithms using neural networks, e.g. processing the feedback provided by the user
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer And Data Communications (AREA)

Abstract

The invention discloses a stream media bit rate self-adaptive adjustment method based on personalized federal reinforcement learning, which is characterized in that in a dynamic self-adaptive stream system based on HTTP, the bit rate self-adaptive process is formalized and represented by establishing a Markov decision process based on the personalized federal learning and the deep reinforcement learning. The user learns the bit rate adaptation strategy locally using reinforcement learning. The objective equation is to maximize the quality of experience for the user. The global model is trained using federal learning to coordinate users and a central server, and the personalized model is trained on the global model using local data for each user. After extensive training, the user may use a personalized model to select the bit rate to achieve the maximum value of the target equation under the current network conditions. The invention solves the heavy tail characteristics of network environment and user behavior while protecting privacy from disclosure.

Description

Stream media bit rate self-adaptive adjusting method based on personalized federal reinforcement learning
Technical Field
The invention relates to application of a machine learning technology in the field of video streaming media, in particular to a streaming media bit rate self-adaptive adjustment method based on personalized federal reinforcement learning.
Background
Reinforcement learning is a branch of machine learning, meaning that intelligent systems learn the mapping from environment to behavior by trial and error, thereby obtaining the greatest rewards. Reinforcement learning is typically described by a Markov Decision Process (MDP) as shown in fig. 1, where the machine is in an environment E and the state space is S, where each state S E S is an observation of the environment perceived by an agent, and actions that the machine can take constitute an action space a, and when executing action a on a current state S, the environment will transition the current state to another state with some probability according to a state transition function P, and a reward is fed back to the machine according to a reward function R. Reinforcement learning can be used to solve the markov decision process MDP, which can be better solved using Deep Reinforcement Learning (DRL) when the state or action space is large.
Today, video streaming occupies more than 80% of the internet traffic. HTTP adaptive bit rate streaming (HAS) is the mainstream video streaming solution. It breaks the video content into a number of 2 to 4 second tiles and provides the tiles to the user like ordinary web content via the HTTP protocol. Dynamic adaptive streaming over HTTP (DASH) is the first open source solution for adaptive bitrate streaming over HTTP. In a DASH system, each tile has a representation of one or more bit rates. An Adaptive Bitrate (ABR) algorithm is an algorithm that dynamically selects a video block bitrate for a client video player to maximize quality of experience (QoE) of the user, based on network conditions, device capabilities, and user preferences. However, selecting an appropriate bit rate in a dynamic network is challenging because network bandwidth is limited and individual indicators of quality of user experience (QoE) are conflicting and need to be weighed against each other.
As an important algorithm in video streaming systems, ABR algorithm has been widely studied. Conventional model-based ABR algorithms are network bandwidth-based algorithms such as FESTIVE, and user buffer occupancy-based algorithms such as BOLA, BBA, etc. These algorithms select the bit rate by modeling the network conditions, by estimating the network bandwidth at the current time or based solely on the occupancy of the current user play-out buffer. These algorithms do not achieve optimal performance because they do not use all useful information to make bit rate decisions. The MPC uses a model predictive control algorithm in combination with network throughput estimation and buffer occupancy information to select bit rates to achieve QoE maximization over the range of several future video blocks. However, fixed control rules prevent MPC from adapting to the wide range of network conditions and different QoE objectives in the real world. Recent work, such as Stick, fugu, etc., uses machine learning algorithms to generate ABR policies. The machine learning based ABR algorithm uses the observed raw data (e.g., network throughput, play-out buffer occupancy, video block size) as a neural network input, outputting the predicted bandwidth, download time, or bit rate of the next video block. Despite the flexibility and effectiveness of ABR algorithms based on deep reinforcement learning, various challenges remain in applying them to practical video streaming systems. On the one hand, it is difficult to collect training data in a video streaming session, which often involves the privacy of the user, and uploading video viewing information of the user in various network environments to a central server for reinforcement learning training may cause serious user privacy problems. On the other hand, training a unified DRL model for heterogeneous clients is not feasible due to the complexity and diversity of network conditions, and it is also difficult to handle user behavior that is subject to time-dependent changes.
Disclosure of Invention
The invention aims to: in order to overcome the defects of the existing bit rate self-adaptive algorithm, the invention provides a stream media bit rate self-adaptive adjustment method based on personalized federal reinforcement learning so as to improve the experience quality of users.
The technical scheme is as follows: in order to achieve the above object, the streaming media bit rate adaptive adjustment method based on personalized federal reinforcement learning of the present invention comprises the following steps:
(1) Establishing a reinforcement learning model, wherein the environmental state comprises: network throughput measurement of a plurality of video blocks in the past, downloading time of a plurality of video blocks in the past, size of a next video block under various bit rate conditions, current buffer occupancy, number of blocks remained in a current video, bit rate of a video block downloaded last time; the action of the client selects the bit rate of the next video block; the rewards returned by the environment are the contributions of the currently selected video blocks to QoE; the objective of the objective equation is to maximize the quality of experience for the user, including video sharpness, client rebuffering time, and bit rate fluency of the video;
(2) The client locally performs deep reinforcement learning, and the process is as follows: the client collects the current state, inputs the current state into a strategy network of the reinforcement learning model, returns a selected action, interacts the obtained action with the environment, thereby obtaining { state-action-rewards } pairs, trains a value function network by using the { state-action-rewards } pairs, trains the strategy network by using the value function, and then continuously repeats the above operations until the model converges;
(3) Federal learning is performed between the client and the central server, and the process is as follows: the client returns the locally trained model to the central server, the central server receives the models sent by the clients and aggregates the models into a new global model, the central server sends the new global model to the clients, the clients continue the operations of local training and uploading, and finally a learning result is a global model and a plurality of personalized models, and represents the mapping rule of the state to the action;
(4) Inputting the client state into the trained personalized model to obtain the bit rate of maximizing the QoE of the user.
Further, the objective equation is expressed as:
Figure BDA0004161028020000031
wherein N represents the total number of blocks of the current video; r is R n A video bit rate representing each block n; t (T) n Representing the rebuffering time for each block n; q (R) n ) Is a bit rate R n A function mapped to user perceived video quality; μ and τ correspond to non-negative weighting parameters of rebuffering time and fluency in video quality change, respectively.
Further, the reinforcement learning method is an Actor-Critic algorithm, and a context neural network module is added on the basis of the reinforcement learning method, the context neural network module uses states of clients and rewards returned by environments as inputs, and the potential representation vectors of the current environments are output, and finally the vectors are used as inputs of a value function network to guide the learning of the value function network.
Further, the update equation of the context network and the value function network in reinforcement learning training is:
Figure BDA0004161028020000032
wherein V (s, z) represents a value function, θ v Representing parameters, θ, of a value function network c Parameters representing the context network, gamma being a discount factor for future rewards, r t Indicating the return at time t.
Further, in reinforcement learning training, the update equation of the policy network is:
Figure BDA0004161028020000033
wherein θ represents a parameter of the policy network, R t Representing an expected discount prize at time point t; alpha is the learning rate; gamma is a discount factor for future rewards; beta is a training parameter; a, a t Representing actions of the reinforcement learning model; s is(s) t Representing client state of the reinforcement learning model; z t Representing an output of the contextual network module; pi θ Representing a policy network; />
Figure BDA0004161028020000034
Representing gradients of the update policy network parameters; h (·) is an entropy term.
Further, the federal learning method used is a FedAvg algorithm, and a weighted average algorithm is used to calculate a global model for the models sent back to the central server by each client.
Further, the equation for federal learning model aggregation is:
Figure BDA0004161028020000041
wherein n is i Representing the amount of local training data of client i,/->
Figure BDA0004161028020000042
C represents the total number of clients participating in federal learning, < >>
Figure BDA0004161028020000043
Representing the global model of the t-th round of training sent by client i.
The beneficial effects are that: the invention provides a streaming media bit rate self-adaption method based on personalized federal learning. The user learns the bit rate adaptation strategy locally using reinforcement learning. And representing the state of the current environment by using the information such as the client network throughput measurement, the downloading time of a plurality of video blocks in the past, the occupancy rate of the current buffer area and the like. The bit rate selection of the next video block is defined as an action. A target equation is established with the aim of maximizing the quality of experience for the user. The global model is trained using federal learning to coordinate users and a central server, and the personalized model is trained on the global model using local data for each user. After extensive training, the user may use a personalized model to select the bit rate to achieve the maximum value of the target equation under the current network conditions. The invention solves the problem of privacy disclosure of the learning-based algorithm by using federal learning. Meanwhile, in order to cope with the heavy tail characteristics of the network environment and the user behaviors, a personalized local model is trained for each user on the basis of a global model obtained by federal learning, so that better performance can be achieved in the local environment. Moreover, the newly added user can also directly download the trained global model and train the global model by using a small amount of local data, so as to obtain a local model which is more suitable for the local environment.
Drawings
FIG. 1 is a reinforcement learning illustration;
FIG. 2 is a schematic diagram of a personalized federal reinforcement learning framework of the present invention;
FIG. 3 is a flow chart of a method of adaptive streaming media bit rate adjustment based on personalized federal reinforcement learning in accordance with the present invention;
fig. 4 is a diagram of a reinforcement learning model for learning according to the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
The invention provides a stream media bit rate self-adaptive adjustment method based on personalized federal reinforcement learning, which is characterized in that in a dynamic self-adaptive stream system based on HTTP, the bit rate self-adaptive process is formalized and represented by establishing a Markov decision process based on personalized federal learning and deep reinforcement learning. Referring to fig. 2, users (i.e., clients) learn bit rate adaptation strategies locally using reinforcement learning, multiple users and a central server train a global model through federal learning, and each user trains a personalized model on the basis of the global model using local data. Through extensive training, the user may use a personalized model to select the bit rate to achieve the goal of maximizing the quality of user experience under current network conditions.
Referring to fig. 3, the training framework of the bit stream adaptation method of the present invention mainly comprises three phases: a local training phase, a federal aggregation phase and a personalized adaptation phase. And training by using a deep reinforcement learning algorithm Actor-Critic in a local training stage to finish local model updating, performing model aggregation by using a FedAVg algorithm in a federal aggregation stage, downloading a global model in a personalized adaptation stage, and performing personalized adaptation locally by using a fine adjustment method. The specific flow is as follows:
(1) First, the central server initializes the global model w 0 And distributes it to clients.
(2) At round t, client c receives the global model w t Then, the model is trained by using the local data, and after the model is updated locally for several times, the updated model is updated
Figure BDA0004161028020000051
And sent back to the central server.
(3) Local model for central server to receive these clients to send back
Figure BDA0004161028020000052
And aggregate these models to obtain a new global model w t+1
(4) And (3) repeating the step (2) and the step (3) until the model converges. After the model converges, step (5) can be performed.
(5) The global model is fine-tuned using client local data to obtain a personalized model. For new clients that do not participate in federal training, the trained global model may be downloaded before viewing video begins and then trimmed with local data.
The global model and the local model in the process are both reinforcement learning models, namely models for short, and model structures are the same, but parameters of the models are different. The basic principles of the reinforcement learning model have been described above in connection with fig. 1. Reinforcement learning models are often described by a markov process, sometimes referred to as a markov model, whose basic elements include states, actions, rewards. According to an embodiment of the present invention, the following Markov model (i.e., reinforcement learning model) is built:
the environmental state observed by the client is defined as { x } tt ,n t ,b t ,c t ,l t The meaning of each parameter is as follows: t represents the current point in time. X is x t Refers to network throughput measurement of the past k video blocks. τ t Refers to the download time of the past k video blocks. n is n t Refers to the size of the next video block under various bit rate conditions. b t Refers to the current buffer occupancy, b when video is rebuffering t Is 0.c t Refers to the number of blocks remaining for the current video. l (L) t Refers to the bit rate of the last downloaded video block. In the embodiment, k is set to 7. The state of the client not only reflects the current network state of each client, such as the past network throughput measurement, but also reflects the state of the client watching the video, such as the number of blocks remaining in the current video, the size of the video blocks under various bit rate conditions, and the like.
The action of the client is defined as a t Representing the bit rate at which the next video block is selected. The range of bit rates selected includes, but is not limited to, 300,750,1200,1850,2850,4300kbps, etc.
The rewards returned by the environment are defined as r_t and represent the QoE rewards brought by the selection of the current block.
Defining a target equation as:
Figure BDA0004161028020000061
where N represents the total number of blocks of the current video; r is R n A video bit rate representing each block n; t (T) n Representing the rebuffering time for each block n; q (R) n ) Is a bit rate R n A function mapped to user perceived video quality; μ and τ correspond to non-negative weighting parameters of rebuffering time and fluency in video quality change, respectively. A relatively small mu indicates that the user is not particularly concerned with the buffering time.
The objective of the objective equation is to maximize the quality of experience for the user, including video sharpness, user buffer occupancy, and video smoothness, with different user preferences for these three quality of experience. The experience of video sharpness in the present invention can be expressed in terms of video bit rate.
In the embodiment of the invention, the learning process is realized by modeling the network environment. The modeling network environment can set different parameters according to different environment requirements so as to learn different rules for different network environments, and therefore ideal results can be obtained in different networks.
The network environment needs to consider the following factors: network bandwidth of each user participating in federal learning, transmission delay of each network. Since network bandwidth and transmission delay are two important factors commonly used to measure network conditions and are also data that can be measured, the present invention uses both data to describe the network environment of a user. Establishing a video stream client model: the analog video streaming application requests the video block from the server, and if the video block does not arrive in time, the video block is rebuffered, and after the video is finished, the next video is automatically restarted to be watched.
In the embodiment of the invention, the users participating in federal learning are set as 3, the network bandwidths of all the users are different, and the FCC data set, the HSDPA data set and the Sydney data set are respectively used for simulating WiFi, 3G and 4G environments. The transmission delay of each network is set to 80 milliseconds. In the video streaming client model, the client randomly selects network trace data each time to simulate the current network bandwidth.
In the local training phase, the client performs deep reinforcement learning, and the process is as follows: the client observes the network environment and the video playing information as the state of the current environment, inputs the state into the strategy network firstly, and outputs an action according to the current state, wherein the action is the bit rate of the next video block; the obtained action is interacted with the environment, the environment can give QoE rewards contributed by the video block, so that { state-action-rewards } pairs corresponding to the current video block are obtained, and a value function network is trained according to { state-action-rewards } pairs, so that the value function network can better predict the accumulated return average value of different states; the advantage of the action output by the policy network is calculated using the trained value function network, thereby updating the policy network, and then the above operations are repeated until the model converges.
The reinforcement learning model for learning in particular according to the present invention is shown in fig. 4. It should be noted that the reinforcement learning model used in the present invention adds a context module implemented with long-short-term memory (LSTM) on the basis of the Actor-Critic, which uses the state of the client as input, and also uses the rewards returned by the environment as input, and the output of the context module is a potential representation vector of the current environment, which finally serves as a part of the input of the value function network to guide the learning of the value function network. When a new client joins or a change in the client network environment occurs, the context network may examine the output of a new representation vector based on the change.
Referring to fig. 4, the detailed procedure of reinforcement learning of the present invention is as follows:
s1, the client collects the current state, including network environment, video playing information and the like.
S2, inputting the current state into a strategy network of an Actor-Critic model, so that the strategy network returns the selected action, namely the bit rate of the next video block.
S3, the obtained action interacts with the environment, and the environment can give the QoE rewards contributed by the video block, so that the { state-action-rewards } pair corresponding to the current video block is obtained.
S4, training a context network module and a value function network by using { state-action-return } pairs, so that the value function network can better predict the accumulated return average value of different states. The update equation is:
Figure BDA0004161028020000071
Figure BDA0004161028020000072
wherein->
Figure BDA0004161028020000073
Representing a value function, θ v Representing parameters, θ, of a value function network c Parameters representing the context network, gamma being a discount factor for future rewards, r t Indicating the return at time t. In the present invention, the context network, the context neural network module, and the context network module refer to the same concept and may be used interchangeably. The Actor-Critic model includes an Actor (Actor network) and Critic (Critic network)) The implementation of the two parts is respectively illustrated through a strategy network and a value function network at the algorithm implementation level, namely, an actor network learns the strategy, and a criticizing home network can evaluate the value of the state in the environment. Thus, actor networks and strategy networks are used interchangeably herein, and reviewer networks and value function networks are used interchangeably.
S5, calculating the advantages of the action output by the strategy network by using the value network function, so as to update the strategy network:
Figure BDA0004161028020000081
wherein θ represents a parameter of the policy network, R t Representing an expected discount prize at time point t; alpha is the learning rate; gamma is a discount factor for future rewards, here set to 0.99; h (-) is an entropy term that encourages decision making by agents to try diversity. The parameter β is set to a larger value at the beginning of training to encourage exploration and decrease over time to highlight rewards for the environment. s is(s) t Representing the client state of the reinforcement learning model. Referring to FIG. 4, z t Is the output of the context network module. Pi θ Representing an actor network, i.e., a strategy network, where θ represents a parameter of the actor network. />
Figure BDA0004161028020000082
Representing gradients that update actor network parameters.
In the federation aggregation phase, the central server global model uses w t The subscript of the model represents the number of federally trained rounds. The specific process of federal learning is as follows:
(1) The central server first initializes the global model w t At this time, the 0 th round, namely, the w for initializing the model 0 Representing and transmitting the global model to each user.
(2) User i takes the initialized global model w t Then, local training is carried out, and the trained model is used
Figure BDA0004161028020000083
Representing and returning the trained model to the central server.
(3) The central server receives the models sent by the users and aggregates the models into a new global model w t+1 . The equation for model aggregation is:
Figure BDA0004161028020000084
wherein n is i Representing the number of local training data of user i, < >>
Figure BDA0004161028020000085
C represents the total number of users participating in federal learning.
(4) The central server performs a new round of federal training, i.e., sends a new global model to each user, who resumes the operation of step (2).
The result of federal learning is a global model and a plurality of personalized models, and finally, the client state is input into the personalized models to obtain the bit rate selection for maximizing the QoE of the user.
In the invention, the personalized model is a local model of a user, namely the personalized is reflected in two aspects, namely, the model of the client is updated by carrying out local training (i.e. fine tuning) again on the basis of the global model after the convergence of training, and the stage is also called a personalized stage; secondly, on the whole model design, by adding a context network module, not only the state of a client is used as input, but also the state transition after the intelligent agent operates and rewards under different states and operations are used as input, so that the context network can be utilized to abstract out the specific characteristics of a state transition equation and a rewarding function related to the current environment. By inputting a series of { state-action-rewards } pairs into the context network, the environmental characteristics of different users can be captured, resulting in a personalized model. And these { state-action-rewards } pairs are obtained by local training of the client during the personalization phase.
The preferred embodiments of the present invention have been described in detail above, but the present invention is not limited to the specific details of the above embodiments, and various equivalent changes can be made to the technical solution of the present invention within the scope of the technical concept of the present invention, and all the equivalent changes belong to the protection scope of the present invention.

Claims (7)

1. A streaming media bit rate self-adaptive adjustment method based on personalized federal reinforcement learning, characterized in that the method comprises the following steps:
(1) Establishing a reinforcement learning model, wherein the environmental state comprises: network throughput measurement of a plurality of video blocks in the past, downloading time of a plurality of video blocks in the past, size of a next video block under various bit rate conditions, current buffer occupancy, number of blocks remained in a current video, bit rate of a video block downloaded last time; the action of the client selects the bit rate of the next video block; the rewards returned by the environment are the contributions of the currently selected video blocks to QoE; the objective of the objective equation is to maximize the quality of experience for the user, including video sharpness, client rebuffering time, and bit rate fluency of the video;
(2) The client locally performs deep reinforcement learning, and the process is as follows: the client collects the current state, inputs the current state into a strategy network of the reinforcement learning model, returns a selected action, interacts the obtained action with the environment, thereby obtaining { state-action-rewards } pairs, trains a value function network by using the { state-action-rewards } pairs, trains the strategy network by using the value function, and then continuously repeats the above operations until the model converges;
(3) Federal learning is performed between the client and the central server, and the process is as follows: the client returns the locally trained model to the central server, the central server receives the models sent by the clients and aggregates the models into a new global model, the central server sends the new global model to the clients, the clients continue the operations of local training and uploading, and finally a learning result is a global model and a plurality of personalized models, and represents the mapping rule of the state to the action;
(4) Inputting the client state into the trained personalized model to obtain the bit rate of maximizing the QoE of the user.
2. The method of claim 1, wherein the objective equation is expressed as:
Figure FDA0004161028000000011
wherein N represents the total number of blocks of the current video; r is R n A video bit rate representing each block n; t (T) n Representing the rebuffering time for each block n; q (R) n ) Is a bit rate R n A function mapped to user perceived video quality; μ and τ correspond to non-negative weighting parameters of rebuffering time and fluency in video quality change, respectively.
3. The method of claim 1, wherein the reinforcement learning method used is an Actor-Critic algorithm, and a contextual neural network module is added on the basis of the Actor-Critic algorithm, and the contextual neural network module uses states of clients and rewards returned by environments as inputs, outputs potential representation vectors of the current environment, and the vectors are finally used as inputs of a value function network to guide learning of the value function network.
4. A method according to claim 3, wherein the update equations for the context network and the value function network in reinforcement learning training are:
Figure FDA0004161028000000021
Figure FDA0004161028000000022
wherein V (s, z) represents a value function, θ v Representing parameters, θ, of a value function network c Parameters representing the context network, gamma being a discount factor for future rewards, r t Indicating the return at time t.
5. A method according to claim 3, wherein in reinforcement learning training, the policy network is furtherThe new equation is:
Figure FDA0004161028000000023
wherein θ represents a parameter of the policy network, R t Representing an expected discount prize at time point t; alpha is the learning rate; gamma is a discount factor for future rewards; beta is a training parameter; a, a t Representing actions of the reinforcement learning model; s is(s) t Representing client state of the reinforcement learning model; z t Representing an output of the contextual network module; pi θ Representing a policy network; />
Figure FDA0004161028000000027
Representing gradients of the update policy network parameters; h (·) is an entropy term.
6. The method of claim 1, wherein the federal learning method used is a FedAvg algorithm, and wherein the global model is calculated using a weighted average algorithm on the models sent back to the central server by the respective clients.
7. The method of claim 1, wherein the federal learning model aggregate equation is:
Figure FDA0004161028000000024
wherein n is i Representing the amount of local training data of client i,/->
Figure FDA0004161028000000025
C represents the total number of clients participating in federal learning, < >>
Figure FDA0004161028000000026
Representing the global model of the t-th round of training sent by client i.
CN202310349691.5A 2023-04-04 2023-04-04 Stream media bit rate self-adaptive adjusting method based on personalized federal reinforcement learning Pending CN116320620A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310349691.5A CN116320620A (en) 2023-04-04 2023-04-04 Stream media bit rate self-adaptive adjusting method based on personalized federal reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310349691.5A CN116320620A (en) 2023-04-04 2023-04-04 Stream media bit rate self-adaptive adjusting method based on personalized federal reinforcement learning

Publications (1)

Publication Number Publication Date
CN116320620A true CN116320620A (en) 2023-06-23

Family

ID=86799658

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310349691.5A Pending CN116320620A (en) 2023-04-04 2023-04-04 Stream media bit rate self-adaptive adjusting method based on personalized federal reinforcement learning

Country Status (1)

Country Link
CN (1) CN116320620A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117376661A (en) * 2023-12-06 2024-01-09 山东大学 Fine-granularity video stream self-adaptive adjusting system and method based on neural network

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117376661A (en) * 2023-12-06 2024-01-09 山东大学 Fine-granularity video stream self-adaptive adjusting system and method based on neural network
CN117376661B (en) * 2023-12-06 2024-02-27 山东大学 Fine-granularity video stream self-adaptive adjusting system and method based on neural network

Similar Documents

Publication Publication Date Title
CN108063961B (en) Self-adaptive code rate video transmission method and system based on reinforcement learning
CN109768940B (en) Flow distribution method and device for multi-service SDN
Zhang et al. A multi-agent reinforcement learning approach for efficient client selection in federated learning
CN112486690B (en) Edge computing resource allocation method suitable for industrial Internet of things
CN110087109B (en) Video code rate self-adaption method and device, electronic equipment and storage medium
CN107948083B (en) SDN data center congestion control method based on reinforcement learning
CN113242469A (en) Self-adaptive video transmission configuration method and system
CN113114581A (en) TCP congestion control method and device based on multi-agent deep reinforcement learning
Cui et al. TCLiVi: Transmission control in live video streaming based on deep reinforcement learning
CN116320620A (en) Stream media bit rate self-adaptive adjusting method based on personalized federal reinforcement learning
Feng et al. Vabis: Video adaptation bitrate system for time-critical live streaming
Huo et al. A meta-learning framework for learning multi-user preferences in QoE optimization of DASH
CN111211984A (en) Method and device for optimizing CDN network and electronic equipment
Hafez et al. Reinforcement learning-based rate adaptation in dynamic video streaming
CN112866756B (en) Code rate control method, device, medium and equipment for multimedia file
CN114051252A (en) Multi-user intelligent transmitting power control method in wireless access network
CN115695390B (en) Mine safety monitoring system mass video data self-adaptive streaming method based on mobile edge calculation
Naresh et al. Sac-abr: Soft actor-critic based deep reinforcement learning for adaptive bitrate streaming
Bhattacharyya et al. QFlow: A learning approach to high QoE video streaming at the wireless edge
CN116094983A (en) Intelligent routing decision method, system and storage medium based on deep reinforcement learning
Huo et al. Optimizing QoE of multiple users over DASH: A meta-learning approach
CN116455820A (en) Multi-transmission path adjustment system and method based on congestion avoidance
CN115834924A (en) Interactive video-oriented loosely-coupled coding rate-transmission rate adjusting method
Naresh et al. PPO-ABR: Proximal Policy Optimization based Deep Reinforcement Learning for Adaptive BitRate streaming
CN114173132A (en) Adaptive bit rate selection method and system for dynamic bit rate video

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination