CN117768451A

CN117768451A - Video communication resource allocation decision method and system

Info

Publication number: CN117768451A
Application number: CN202311815947.3A
Authority: CN
Inventors: 牛冠冲; 贺国栋; 李晓辉; 黄振江; 颜斌
Original assignee: Guangzhou Tongze Kangwei Technology Co ltd; Guangzhou Institute of Technology of Xidian University
Current assignee: Guangzhou Tongze Kangwei Technology Co ltd; Guangzhou Institute of Technology of Xidian University
Priority date: 2023-12-26
Filing date: 2023-12-26
Publication date: 2024-03-26
Anticipated expiration: 2043-12-26
Also published as: CN117768451B

Abstract

The invention discloses a video communication resource allocation decision method and a system, comprising the following steps: constructing an intelligent agent with the maximum expected return value as a target; taking a plurality of experience samples extracted from the experience buffer as learning samples, and calculating to obtain time difference errors of the learning samples according to multi-step return values corresponding to the learning samples; based on the time difference errors of all the learning samples, updating the expected return values corresponding to all the experience samples in the experience buffer zone, and distributing sampling priority to all the learning samples to finish the optimization of the intelligent agent; and acquiring the system state in real time, acquiring the optimal action strategy at the current moment through the optimized agent, and realizing the distribution of network communication resources based on the optimal action strategy. By adopting the embodiment of the invention, the dynamic change of the network is adapted in real time, and the optimal action strategy with the maximum expected return value is selected so as to improve the distribution effect of network communication resources.

Description

Video communication resource allocation decision method and system

Technical Field

The present invention relates to the field of communication resource allocation, and in particular, to a method and system for deciding allocation of video communication resources.

Background

In the digital age, real-time communication technology has penetrated our daily lives, works and study. The WebRTC (Web Real-Time Communication is an open standard that allows Web applications to communicate Real-time audio and video, and has the core advantage that it is point-to-point, and can directly transmit data between two communication terminals without intervention of a central server.

At present, the common resource allocation strategies of WebRTC-SVC are as follows. The first is static resource allocation, which is performed according to a predefined setting or configuration, in which policy the behavior of the sender is fixed irrespective of the change in network conditions, and although this method is simple and easy to implement, it lacks flexibility and cannot adapt well to the change in network conditions. The second is based on dynamic adjustment of feedback, which depends on RTCP feedback received from the receiving end, the feedback providing information about the network conditions, such as packet loss rate, j itter, etc., and the transmitting end dynamically adjusts the video layer sent according to these feedback, and although the method can adapt to network changes, there may be response delay or response overstress, and cannot immediately adapt to rapid changes of the network, thereby reducing user experience. The third is that the transmitting end decides which video layers to transmit by considering the hardware and software capabilities of the receiving end, which only ensures that the receiving end can smoothly play video, but may not fully utilize the available network bandwidth. Conventional resource allocation strategies tend to be challenging in balancing fairness, efficiency, or adaptation to different user needs and device capabilities.

Disclosure of Invention

The embodiment of the invention provides a video communication resource allocation decision method and a video communication resource allocation decision system, which are suitable for dynamic changes of a network in real time, and select an optimal action strategy with the maximum expected return value so as to optimize allocation of network communication resources and improve efficiency and quality of network communication.

In order to solve the above technical problems, an embodiment of the present invention provides a method for deciding allocation of video communication resources, including:

constructing an intelligent agent with the maximum expected return value as a target; the expected return value is obtained by converting a communication index of video data transmitted to the receiving end;

taking a plurality of experience samples extracted from an experience buffer zone as learning samples, and calculating to obtain time difference errors of the learning samples according to multi-step return values corresponding to the learning samples; each learning sample comprises an expected return value obtained by the intelligent agent executing different actions under different system states and a state to which the intelligent agent is transferred by executing different actions under different system states, wherein the multi-step return value corresponding to the learning sample is the maximum value of the expected return values obtained by the intelligent agent executing different multi-step actions from different system states;

based on the time difference error of each learning sample, updating the expected return values corresponding to all the experience samples in the experience buffer zone, and distributing sampling priority to all the learning samples so as to finish the optimization of the intelligent agent;

and acquiring the system state in real time, acquiring the optimal action strategy at the current moment through the optimized agent, and realizing the distribution of network communication resources based on the optimal action strategy.

By implementing the embodiment of the invention, the intelligent agent is constructed with the maximum expected return value as a target, and the expected return value is converted from the communication index of the video data transmitted to the receiving end, which means that the intelligent agent selects the optimal action strategy according to the communication index of the video data so as to achieve the target of maximizing the expected return value. Then, a plurality of experience samples in the experience buffer are extracted to serve as learning samples, the effect of different movement strategies can be evaluated more accurately according to the multi-step return values of the learning samples, the expected return values corresponding to all the experience samples in the experience buffer are updated based on the time difference errors of the learning samples, sampling priorities are distributed to all the learning samples based on the time difference errors, and reliable reference data is provided for optimizing the movement strategies. Finally, the optimized agent can adapt to the dynamic change of the network in real time according to the system state at the current moment and all learned experiences, predicts the expected return value of the future network condition according to the history data in the experience buffer zone, accurately selects the optimal action strategy with the maximum expected return value, and applies the optimal action strategy to the distribution of network communication resources, thereby being beneficial to improving the efficiency and quality of network communication and providing better quality service for users.

Preferably, before the plurality of experience samples extracted from the experience buffer are taken as learning samples, and the time difference error of each learning sample is calculated according to the multi-step return value corresponding to each learning sample, the method further comprises:

extracting a plurality of experience samples from the experience buffer as learning samples using a priority experience playback mechanism;

calculating a multi-step return value corresponding to each learning sample according to the following formula:

wherein G is _t State s representing slave time t _t A first N-step return value; gamma represents a discount factor for reducing the value of future rewards; r is (r) _t+i Indicating the rewards of step i starting at time t; a represents an action taken; n represents the number of steps considered;representing slave state s _t The maximum expected return value obtained by performing all possible N steps of action is started.

By implementing the preferred scheme of the embodiment of the invention, the preferential experience playback mechanism is used, the sample with the preferential learning more important or higher learning effect on the current learning target is selected, repeated learning of the same or similar experience can be avoided, limited training resources are more effectively utilized, and the long-term influence of different actions in the current state can be more accurately estimated by calculating the multi-step return value from the current state, so that an intelligent agent can be helped to find an effective action strategy more quickly and accelerate the learning process.

As a preferred solution, the updating the expected return values corresponding to all the experience samples in the experience buffer based on the time difference error of each of the learning samples, and distributing sampling priorities to all the learning samples to complete the optimization of the agent is specifically as follows:

decomposing the expected return value of each learning sample into a state value function and an advantage function based on a lasting Network structure;

respectively updating the state value function and the dominance function of each experience sample by using the time difference error of each learning sample, and respectively updating the expected return value corresponding to each experience sample by using the updated state value function and the updated dominance function;

and arranging all the learning samples according to the sequence of the time difference errors from large to small, and sequentially distributing sampling priorities from high to low to the learning samples according to the arrangement result so as to finish the optimization of the intelligent agent.

According to the preferred scheme of the embodiment of the invention, based on the lasting Network structure, the expected return value of each learning sample is decomposed into the state value function and the dominant function, so that interference and redundant information in the learning process are reduced, and the learning efficiency is improved. Additionally, the learning samples are ordered according to the sequence from big to small of time difference errors, and different sampling priorities are allocated, so that the samples with the greatest influence on the long-term strategy can be learned preferentially, and the intelligent agent can converge to the optimal action strategy more quickly.

As a preferred solution, the plurality of experience samples extracted from the experience buffer are taken as learning samples, and according to the multi-step return value corresponding to each learning sample, a time difference error of each learning sample is calculated, which specifically includes:

taking a plurality of experience samples extracted from the experience buffer as learning samples;

according to a preset error algorithm, calculating a time difference error of each experience sample based on multi-step return values of a plurality of the study samples extracted from the experience buffer zone and the expected return values obtained from each study sample; the preset error algorithm specifically comprises the following steps:

in delta _t Representing a time differential error of the learning samples; g _t State s representing slave time t _t A first N-step return value; q(s) _t ，a _t ) Representing that the agent is in state s _t Execution of action a _t Is a desired return value for (1).

By implementing the preferred scheme of the embodiment of the invention, the difference between the expected return value and the actual return value of the intelligent agent under the current strategy can be estimated by calculating the time difference error of the experience sample, so that the intelligent agent can be assisted to better explore and adjust the action strategy by using the time difference error of the experience sample later, and the robustness and the stability of the algorithm are improved.

Preferably, the obtaining of the expected return value specifically includes:

when the intelligent agent performs the action a in the system state s, acquiring a communication index of video data transmitted to a receiving end; wherein the communication index comprises video quality, smoothness and delay;

the expected return value R (s, a) obtained by the agent performing action a in system state s is calculated according to the following formula:

R(s，a)＝f(video quality，smoothness，latency)；

Q(s，a)＝R(s，a)+γmax _a′ Q(s′，a′)；

wherein R (s, a) represents a reward obtained by the agent performing action a in system state s; video quality, smoothness and latency are input parameters for function f; wherein video quality represents video quality, smoothness represents smoothness, latency represents delay; gamma is a discount factor representing the value of the future reward; s' represents the new state after performing action a; max (max) _a′ Q (s ', a ') represents the maximum expected return value obtained by performing all possible actions in the new state s '.

By implementing the preferred scheme of the embodiment of the invention, the expected return value obtained by the action of the intelligent agent under the current system state can be comprehensively and intuitively evaluated by acquiring the video quality, fluency and delay of the video data transmitted to the receiving end, thereby helping to realize more fair and comprehensive performance evaluation.

As a preferred scheme, the system state is obtained in real time, and the optimal action strategy at the current moment is obtained through the optimized agent, and the allocation of network communication resources is realized based on the optimal action strategy, specifically:

in a WebRTC-SVC environment, acquiring a system state at the current moment;

the optimal action strategy at the current moment is obtained through analysis based on the system state at the current moment by the intelligent agent for completing optimization, so that the distribution of network communication resources is realized; wherein the optimal action policy is an action performed so that the optimized agent can reach the target in the system state at the current time.

By implementing the preferred scheme of the embodiment of the invention, the system state at the current moment is obtained in the WebRTC-SVC environment, and the optimal behavior strategy at the current moment is obtained through analysis by the optimized agent, so that the network communication resources are reasonably distributed to optimize the resource utilization, reduce the delay and the like, thereby improving the performance and the user experience of the system.

In order to solve the same technical problem, the embodiment of the present invention further provides a video communication resource allocation decision system, including:

the intelligent agent construction module is used for constructing an intelligent agent with the maximum expected return value as a target; the expected return value is obtained by converting a communication index of video data transmitted to the receiving end;

the error analysis module is used for taking a plurality of experience samples extracted from the experience buffer zone as learning samples and calculating the time difference error of each learning sample according to the multi-step return value corresponding to each learning sample; each learning sample comprises an expected return value obtained by the intelligent agent executing different actions under different system states and a state to which the intelligent agent is transferred by executing different actions under different system states, wherein the multi-step return value corresponding to the learning sample is the maximum value of the expected return values obtained by the intelligent agent executing different multi-step actions from different system states;

the intelligent agent optimizing module is used for updating expected return values corresponding to all experience samples in the experience buffer zone based on time difference errors of the learning samples, and distributing sampling priorities to all the learning samples so as to finish optimizing the intelligent agent;

the allocation decision module is used for acquiring the system state in real time, acquiring the optimal action strategy at the current moment through the optimized agent, and realizing the allocation of network communication resources based on the optimal action strategy.

Preferably, the video communication resource allocation decision system further comprises:

a multi-step return analysis module for extracting a plurality of experience samples from the experience buffer using a priority experience playback mechanism as learning samples; calculating a multi-step return value corresponding to each learning sample according to the following formula:

As a preferred solution, the agent optimization module specifically includes:

the decomposition unit is used for decomposing the expected return value of each learning sample into a state value function and a dominance function based on the lasting Network structure;

the updating unit is used for respectively updating the state value function and the dominance function of each experience sample by utilizing the time difference error of each learning sample, and respectively updating the expected return value corresponding to each experience sample by utilizing the updated state value function and the updated dominance function;

and the grading unit is used for arranging all the learning samples according to the sequence of the time difference errors from large to small, and sequentially distributing sampling priorities from high to low to the learning samples according to the arrangement result so as to finish the optimization of the intelligent agent.

the expected return analysis module is used for acquiring a communication index of video data transmitted to the receiving end when the intelligent agent performs the action a in the system state s; wherein the communication index comprises video quality, smoothness and delay; the expected return value R (s, a) obtained by the agent performing action a in system state s is calculated according to the following formula:

R(s，a)＝f(video quality，smoothness，latency)；

Q(s，a)＝R(s，a)+γmax _a′ Q(s′，a′)；

wherein R (s, a) represents a reward obtained by the agent performing action a in system state s; video quality, smoothness and lasensy are input parameters for function f; wherein video quality represents video quality, smoothness represents smoothness, latency represents delay; gamma is a discount factor representing the value of the future reward; s' represents the new state after performing action a; max (max) _a′ Q (s ', a ') represents the maximum expected return value obtained by performing all possible actions in the new state s '.

Drawings

Fig. 1: a flow chart of a video communication resource allocation decision method provided in the first embodiment of the present invention;

fig. 2: a schematic structural diagram of a video communication resource allocation decision system is provided in the first embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Embodiment one:

in real-time video communication, particularly in a scenario using WebRTC-SVC technology, network conditions, receiver hardware conditions, and user requirements are constantly changing. Therefore, how to intelligently allocate resources based on these dynamically changing conditions to provide an optimal user experience is a key issue. WebRTC-SVC provides scalable video coding that allows coded video streams to easily adapt to different network and device conditions. In this context, embodiments of the present invention may consider the resource allocation problem as a reinforcement learning problem, where an agent needs to learn how to optimally allocate resources in interactions with the environment.

Referring to fig. 1, a decision method for video communication resource allocation provided in an embodiment of the present invention includes steps S1 to S4, where each step is specifically as follows:

step S1, constructing an intelligent agent with the maximum expected return value as a target.

The expected return value is converted from the communication index of the video data transmitted to the receiving end.

It should be noted that, the system architecture and environment definitions, and the reinforcement learning model definitions are specifically as follows:

in this scenario, the environment includes the communication network of the WebRTC-SVC, the transmitting end, the receiving end, the network status, the hardware and software capabilities of the receiving end, the quality requirements of the user, and the like.

The agent (A) is a reinforcement learning model running on the transmitting end, which needs to decide how to perform video coding and transmission according to the state of the environment to provide the best video quality.

State (S): referring to equation (1), the state is a vector describing the current system condition. Including current network bandwidth, network delay, data packet loss rate, decoding and display capabilities of the receiving end, current video dynamics and complexity, and the like. Each state corresponds to a particular system condition.

S＝(s ₁ ,s ₂ ,...,s _n )； (1)

Action (a): referring to equation (2), the actions describe resource allocation policies that the agent may take. For example, the agent may decide which SVC layers to send, whether to adjust the coding rate, whether to change the video resolution, etc.

A＝(a ₁ ,a ₂ ,...,a _m )； (2)

Rewards (R) are a number representing rewards obtained after an agent takes action a in state s. This is a value that measures how good or good an action is taken. For example, if the user obtains a smooth, high quality video, the smart agent will obtain a positive reward; if the video is stuck or degraded, the smart agent will get a negative prize.

In this embodiment, to maximize long-term rewards, the agent learns using the Q-learning algorithm. The Q value represents the expected return value for taking some action in a given state. The goal of the agent is to find a strategy that maximizes the expected return value Q. Then, training the reinforcement learning model through the steps 1) to 5), thereby obtaining the agent described in the step S1.

1) Data initialization:

a Q value is initialized for each possible state-action pair (s, a) to form a Q value table. These initial values may be random or set to a constant, such as 0.

2) Action selection:

in each state s, action a is selected based on the current Q value and an ε -greedy policy. Specifically, the agent randomly selects an action with a probability of ε, and selects an action with a maximum Q value with a probability of 1 ε.

3) Iterative learning:

(1) in state s, the agent performs the selected action a and acquires the instant prize r and the new state s' after the execution;

(2) please refer to equation (3), the Q value is updated.

Wherein, alpha represents the learning rate and determines the influence degree of the new estimated value on the old estimated value; gamma is the discount factor for future rewards, determining the importance of future rewards.

4) Policy extraction and application:

based on the learned values, an action policy may be generated for each state. Where a policy is the action of choosing to maximize the value, the policy can then be applied in the actual task to guide the decision of the agent.

5) Evaluation and refinement:

to verify the validity of the learned strategy, an assessment should be made in the test environment. Depending on the evaluation, it may be necessary to adjust certain parameters, such as learning rate and exploration probability, and repeat the learning process to optimize the performance of the strategy.

Preferably, the process of obtaining the expected return value mentioned in step S1 includes steps S01 to S02, and each step is specifically as follows:

step S01, when an intelligent agent performs action a in a system state S, acquiring a communication index of video data transmitted to a receiving end; the communication indexes comprise video quality, smoothness and delay.

In step S02, please refer to equations (4) (5), the expected return value Q (S, a) obtained by the agent performing action a in the system state S is calculated.

R(s，a)＝f(video quality，smoothness，latency)； (4)

Q(s，a)＝R(s，a)+γmax _a′ Q(s′，a′)； (5)

The smart will choose to act according to the Q value. Specifically, referring to equation (6), in a given state s, the best action a is selected ^* The expected return value Q (s, a) is maximized to obtain the maximum prize over a long period of time. Intelligent bodyConstantly interacting with the environment, collecting new states, performing actions, and rewarding data for updating values to improve resource allocation policies. To avoid prematurely sinking into local optima, better accommodating changes in the environment, the smart will employ epsilon-greedy strategies, i.e., randomly choose actions with a certain probability to maintain a certain exploratory.

a ^* ＝argmax _a Q(s，a)； (6)

Preferably, before executing step S2, the method further includes an error analysis flow, where the flow includes steps S03 to S04, and the steps are specifically as follows:

in step S03, a plurality of experience samples are extracted from the experience buffer as learning samples using a priority experience playback mechanism.

Step S04, please refer to equation (7), calculate the multi-step return value corresponding to each learning sample:

And S2, taking a plurality of experience samples extracted from the experience buffer as learning samples, and calculating to obtain the time difference error of each learning sample according to the multi-step return value corresponding to each learning sample.

Each learning sample comprises an expected return value obtained by an agent executing different actions in different system states and a state to which the agent is transferred by executing different actions in different system states, wherein the multi-step return value corresponding to the learning sample is the maximum value of the expected return values obtained by the agent executing different multi-step actions from different system states.

Preferably, step S2 includes steps S21 to S22, and each step is specifically as follows:

step S21, a plurality of experience samples extracted from the experience buffer are taken as learning samples.

Step S22, calculating a time difference error of each experience sample based on multi-step return values of a plurality of study samples extracted from the experience buffer zone and expected return values obtained from each study sample according to a preset error algorithm; the predetermined error algorithm is shown in equation (8).

δ _t ＝G _t -Q(s _t ，a _t )； (8)

And step S3, based on the time difference errors of the learning samples, updating the expected return values corresponding to all the experience samples in the experience buffer zone, and distributing sampling priorities to all the learning samples so as to finish the optimization of the intelligent agent.

Preferably, step S3 includes steps S31 to S32, and each step is specifically as follows:

step S31, based on the lasting Network structure, the expected return value of each learning sample is decomposed into a state value function V (S) and a dominance function A (S, a).

Step S32, please refer to equations (9) and (10), respectively update the state value function V (S) and the dominance function A (S, a) of each of the empirical samples by using the time difference errors of each of the learning samples, and refer to equation (11), respectively update the expected return values corresponding to each of the empirical samples by using the updated state value function and the updated dominance function.

V(s)←V(s)+α×IS weight×δ _t ； (9)

A(s，a)←A(s，a)+α×IS weight×δ _t ； (10)

In delta _t Representing a time differential error of the learning samples; q (s, a) represents the predicted return for taking action a in state s; v(s) represents the predictive value of state s; a (s, a) represents the advantage of action a over the average action in state s; a (s, a ') represents the advantage of action a' over the average action in state s; the |a| represents the size of the motion space; alpha represents learning rate and determines learning speed; IS weight, which represents the importance sample weight in priority experience playback, IS used to correct the bias caused by priority sampling.

And step S33, arranging all the learning samples according to the sequence of the time difference errors from large to small, sequentially distributing high-to-low sampling priorities to the learning samples in the experience buffer according to the arrangement result, updating the sampling priorities of the learning samples in the experience buffer, and completing the optimization of the intelligent agent.

In this embodiment, the learning samples/experience samples with larger time difference errors are considered to be more "unexpected" with higher sampling probability, i.e., the larger the time difference error of the learning samples/experience samples, the higher their sampling priority.

And S4, acquiring the system state in real time, acquiring the optimal action strategy at the current moment through the optimized agent, and realizing the distribution of network communication resources based on the optimal action strategy.

Referring to fig. 2, a schematic structural diagram of a video communication resource allocation decision system provided by an embodiment of the present invention includes an agent building module M1, an error analysis module M2, an agent optimizing module M3, and an allocation decision module M4, where each module is specifically as follows:

the intelligent agent constructing module M1 is used for constructing an intelligent agent with the maximum expected return value as a target; the expected return value is obtained by converting a communication index of video data transmitted to the receiving end;

the error analysis module M2 is used for taking a plurality of experience samples extracted from the experience buffer as learning samples, and calculating the time difference error of each learning sample according to the multi-step return value corresponding to each learning sample; each learning sample comprises an expected return value obtained by an agent executing different actions in different system states and a state to which the agent is transferred by executing different actions in different system states, wherein the multi-step return value corresponding to the learning sample is the maximum value in the expected return value obtained by the agent executing different multi-step actions from different system states;

the intelligent agent optimizing module M3 is used for updating expected return values corresponding to all experience samples in the experience buffer zone based on time difference errors of all the learning samples, and distributing sampling priorities for all the learning samples so as to finish optimizing the intelligent agent;

the allocation decision module M4 is used for acquiring the system state in real time, acquiring the optimal action strategy at the current moment through the optimized agent, and realizing the allocation of network communication resources based on the optimal action strategy.

As a preferred solution, referring to fig. 2, the video communication resource allocation decision system provided in the embodiment of the present invention further includes a multi-step report analysis module M5, which specifically includes:

a multi-step return analysis module M5 for extracting a plurality of experience samples from the experience buffer using a preferential experience playback mechanism as learning samples; calculating multi-step return values corresponding to each learning sample according to the following formula:

wherein G is _t State s representing slave time t _t A first N-step return value; gamma represents a discount factor for reducing the value of future rewards; r is (r) _t+i Indicating the rewards of the ith step starting from time tThe method comprises the steps of carrying out a first treatment on the surface of the a represents an action taken; n represents the number of steps considered;representing slave state s _t The maximum expected return value obtained by performing all possible N steps of action is started.

As a preferred solution, the agent optimization module M3 specifically includes a decomposition unit 31, an update unit 32, and a classification unit 33, where each unit specifically includes:

a decomposition unit 31, configured to decompose the expected return value of each learning sample into a state value function and a dominance function based on the lasting Network structure;

an updating unit 32, configured to update the state value function and the dominance function of each experience sample respectively by using the time difference error of each learning sample, and update the expected return value corresponding to each experience sample respectively by using the updated state value function and the updated dominance function;

and the grading unit 33 is used for arranging all the learning samples according to the sequence from the big time difference error to the small time difference error, and sequentially distributing sampling priorities from high to low to each learning sample according to the arrangement result so as to finish the optimization of the intelligent agent.

As a preferred solution, referring to fig. 2, the video communication resource allocation decision system provided in the embodiment of the present invention further includes an expected return analysis module M6, which is specifically as follows:

the expected return analysis module M6 is configured to obtain a communication index of video data transmitted to the receiving end when the agent performs the action a in the system state s; wherein, the communication index comprises video quality, fluency and delay; the expected return value R (s, a) obtained by the agent performing action a in system state s is calculated as follows:

R(s，a)＝f(video quality，smoothness，latency)；

Q(s，a)＝R(s，a)+γmax _a′ Q(s′，a′)；

wherein R (s, a) represents a reward obtained by the agent performing action a in system state s;video quality, smoothness and latency are input parameters for function f; wherein video quality represents video quality, smoothness represents smoothness, latency represents delay; gamma is a discount factor representing the value of the future reward; s' represents the new state after performing action a; max (max) _a′ Q (s ', a ') represents the maximum expected return value obtained by performing all possible actions in the new state s '.

It will be clear to those skilled in the art that, for convenience and brevity of description, reference may be made to the corresponding process in the foregoing method embodiment for the specific working process of the above-described system, which is not described herein again.

Compared with the prior art, the embodiment of the invention has the following beneficial effects:

the invention provides a video communication resource allocation decision method and a system, which aim at the maximum expected return value, construct an intelligent body, and the expected return value is converted from the communication index of video data transmitted to a receiving end, which means that the intelligent body can select the optimal action strategy according to the communication index of the video data so as to achieve the aim of maximizing the expected return value. Then, a plurality of experience samples in the experience buffer are extracted to serve as learning samples, the effect of different movement strategies can be evaluated more accurately according to the multi-step return values of the learning samples, the expected return values corresponding to all the experience samples in the experience buffer are updated based on the time difference errors of the learning samples, sampling priorities are distributed to all the learning samples based on the time difference errors, and reliable reference data is provided for optimizing the movement strategies. Finally, the optimized agent can adapt to the dynamic change of the network in real time according to the system state at the current moment and all learned experiences, predicts the expected return value of the future network condition according to the history data in the experience buffer zone, accurately selects the optimal action strategy with the maximum expected return value, and applies the optimal action strategy to the distribution of network communication resources, thereby being beneficial to improving the efficiency and quality of network communication and providing better quality service for users.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not to be construed as limiting the scope of the invention. It should be noted that any modifications, equivalent substitutions, improvements, etc. made by those skilled in the art without departing from the spirit and principles of the present invention are intended to be included in the scope of the present invention.

Claims

1. A method for deciding allocation of video communication resources, comprising:

2. The method of claim 1, further comprising, before the plurality of experience samples extracted from the experience buffer are taken as learning samples and the time difference error of each learning sample is calculated according to the multi-step return value corresponding to each learning sample:

3. The method of claim 1, wherein the updating the expected return values corresponding to all the experience samples in the experience buffer based on the time difference error of each of the learning samples, and assigning sampling priorities to all the learning samples, to complete the optimization of the agent comprises:

4. The method for deciding video communication resource allocation according to claim 1, wherein the plurality of experience samples extracted from the experience buffer are taken as learning samples, and the time difference error of each learning sample is calculated according to the multi-step return value corresponding to each learning sample, specifically:

δ _t ＝G _t -Q(s _t ,a _t )；

in delta _t Representing a time differential error of the learning samples; g _t State s representing slave time t _t A first N-step return value; q(s) _t ,a _t ) Representing that the agent is in state s _t Execution of action a _t Is a desired return value for (1).

5. The method for deciding a video communication resource allocation as claimed in claim 1, wherein the obtaining of the expected return value is specifically:

the expected return value Q (s, a) obtained by the agent performing action a in system state s is calculated as follows:

R(s,a)＝f(video quality,smoothness,latency)；

Q(s,a)＝R(s,a)+γmax _a′ Q(s′,a′)；

6. The method for deciding allocation of video communication resources according to claim 1, wherein said acquiring system status in real time, and acquiring the current optimal action policy by the agent performing optimization, and based on said optimal action policy, implementing allocation of network communication resources, specifically:

in a WebRTC-SVC environment, acquiring a system state at the current moment;

7. A video communication resource allocation decision system, comprising:

8. The video communication resource allocation decision system of claim 7, further comprising:

wherein G is _t State s representing slave time t _t A first N-step return value; gamma represents a discount factor for reducing the value of future rewards; r is (r) _t+i Indicating the rewards of step i starting at time t; a represents an action taken; n represents the number of steps considered;representing slave state s _t The most obtained by starting to perform all possible N steps of actionsThe expected return value is large.

9. The video communication resource allocation decision system of claim 7, wherein said agent optimization module comprises:

10. The video communication resource allocation decision system of claim 7, further comprising:

R(s,a)＝f(video quality,smoothness,latency)；

Q(s,a)＝R(s,a)+γmax _a′ Q(s′,a′)；

wherein R (s, a) represents a reward obtained by the agent performing action a in system state s; video quality, smoothness and latency are input parameters for function f; wherein video quality represents video quality, smoothness represents smoothness, latency represents delay; gamma is a discount factor representing the value of the future reward; s'Representing a new state after performing action a; max (max) _a′ Q (s ', a ') represents the maximum expected return value obtained by performing all possible actions in the new state s '.