CN115344510A

CN115344510A - High-dimensional video cache selection method based on deep reinforcement learning

Info

Publication number: CN115344510A
Application number: CN202211270042.8A
Authority: CN
Inventors: 周剑; 陈然; 张伯雷; 严筱永; 李鑫
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-10-18
Filing date: 2022-10-18
Publication date: 2022-11-15
Anticipated expiration: 2042-10-18
Also published as: CN115344510B

Abstract

The high-dimensional video cache selection method based on the deep reinforcement learning applies the deep reinforcement learning to the video cache selection of the edge server, considers the dynamic property and high-dimensional property of the video cache selection, and realizes the high-efficiency video cache of the edge server; the decoder is used for improving the DDPG, so that the edge server can select a proper video for caching, and the time delay of video transmission and the flow cost spent by a user are reduced; when the edge server selects video cache from massive videos, the calculation overhead is greatly reduced, excessive pressure on the edge server with limited resources is avoided, and the calculation cost is saved.

Description

High-dimensional video cache selection method based on deep reinforcement learning

Technical Field

The invention belongs to the technical field of computer application, and particularly relates to a high-dimensional video cache selection method based on deep reinforcement learning.

Background

With the advancement of science and technology, multimedia services and applications thereof have been rapidly developed. The video quantity is more and more, the video quality is higher and more, and the video traffic is larger and larger. The huge video traffic puts pressure on the backbone network. The edge calculation makes the data processing closer to the user, and can improve the quality of the multimedia service. Especially in 5G networks, base stations have been equipped with edge servers to provide storage and computing power. The video cache is combined with the edge calculation, and the edge server selects the video cache which is relatively more for watching by the user, so that the time delay of video transmission and the flow cost spent by the user can be reduced.

The edge server equipped in the base station selects the video cache, and can provide the cached video for a plurality of users in the coverage area of the edge server. When the video to be watched by the user is cached by the edge server, the video can be directly obtained from the edge server. Otherwise, the video is acquired from a backbone network such as a wireless network.

The popularity of the video changes along with time, and the edge server can select different videos to cache, so the video cache selection is dynamic. Due to the limitation of self-caching capacity, the edge server needs to select a cache part video from a large number of videos, and therefore the video cache selection is high in dimensionality. The dynamic and high-dimensional characteristics of the video cache selection bring challenges to the efficient video caching of the edge server.

In the traditional video cache selection method, the dynamics of video cache selection is mostly considered, and the video cache is performed by using reinforcement learning and deep reinforcement learning, but the high-dimensional performance of the video cache selection is not considered. When the edge server selects video cache from massive videos, the calculation cost is high, and pressure is brought to the edge server with limited resources.

Disclosure of Invention

Aiming at the defects in the background technology, the invention provides a high-dimensional video cache selection method based on deep reinforcement learning, and the deep reinforcement learning is applied to the video cache selection of the edge server. The DDPG is improved by using decoder, so that the edge server can select proper video for caching, and the time delay of video transmission and the flow cost spent by a user are reduced.

The high-dimensional video cache selection method based on deep reinforcement learning comprises the following steps:

step S1: performing system modeling aiming at a high-dimensional video cache problem, and then establishing a high-dimensional video cache action selection model based on an improved depth certainty strategy gradient DDPG;

step S2: training network parameters of a high-dimensional video caching action selection model based on the improved DDPG through an Adam algorithm;

and step S3: and the edge server selects videos to cache by using the trained high-dimensional video caching action selection model based on the improved DDPG.

Further, in step S1, the specific steps are as follows:

step S1-1: formalizing the high-dimensional video caching problem of the edge server:

setting the number of users in the coverage range of the edge server as U, the number of videos as N, the time length as T, the maximum storage capacity of the edge server as C, and the unit time delay and the unit flow cost from the local user to the edge server as l and p respectively; the video cache selection strategy of the edge server is set as

，

；

Wherein the content of the first and second substances,

when the time step t is represented, the edge server performs video caching action;

(ii) a Since the number of videos N is huge, so

Are high-dimensional discrete; when the temperature is higher than the set temperature

When it is, the edge server caches the video with the mark j at the time step t, otherwise,

(ii) a If the edge server only selects one video cache per time step, then

；

When the time step t is set, the situation that the user k watches the video is

，

；

Is a high-dimensional vector when

When it is, represents the user

The video numbered j is viewed at time t, otherwise,

；

when the time step t is set, the condition that the edge server caches the video is

，

，

Is also a high-dimensional vector when

When, it means that the edge server has cached the video with reference number j at time step t, otherwise,

；

let the memory size occupied by the video with reference number j be

The instant reward obtained after the edge server caches the video at the time step t is

(ii) a Video cache selection strategy for solving edge server as optimization target of whole problem

To maximize the cumulative revenue of the edge server, i.e. to minimize the time delay of the video transmission and the traffic cost spent by the user:

wherein the content of the first and second substances,

representing a degree of interest in future rewards for a discount rate;

when the time step is t, because the user k watches the video, the edge server obtains instant rewards; e is a positive value instant reward obtained by the edge server when the video to be watched by the user is cached by the edge server;

to favor the time delay of video transmission and the traffic cost spent by the user,

is in the range of 0 to 1; c is the maximum storage capacity of the edge server; u is the number of users in the coverage range of the edge server; in the formula

Represents and;

step S1-2: describing the above problem model as a Markov decision process

Represents; wherein S is a state space storing states observable by the edge server;

the method comprises the steps that a high-dimensional action space is used for storing original high-dimensional discrete video caching actions executable by an edge server;

the video caching method comprises the following steps that a low-dimensional action space is formed, and low-dimensional continuous video caching actions which can be selected by an edge server are stored;

is a reward space for storing the instant rewards obtained by the edge server;

the state transition probability space represents the distribution condition of the edge server in a certain state and entering the next state after executing actions;

step S1-3: the high-dimensional video caching action selection model based on the improved DDPG combines the DDPG with a trained decoder; the DDPG comprises an operator, a critic and a replay cache region;

the actor is divided into an online actor network and a target actor network which are both deep fully-connected neural networks with 4 layers, and the network parameters are respectively

And

(ii) a Import to online actor network is the state observed by the edge server

Output as low-dimensional continuous video caching action

(ii) a The target operator network is used for updating network parameters of the online operator network;

decoder is a deep fully-connected neural network with 6 layers, and the network parameters are

(ii) a Input of decoder is low-dimensional continuous video caching action

Video caching action output as original high-dimensional dispersion

；

Replay buffers store states observed by the server

Low-dimensional continuous video caching action

Edge server action based

Instant reward obtained after selecting video to cache

And the state of the next time step observed by the edge server

I.e. by

；

The critic is divided into an online critic network and a target critic network, both are deep full-connection neural networks with 4 layers, and the network parameters are respectively

And

(ii) a The input to the online critic network is the data sampled from the replay buffer

And outputting the state action value after the video cache is selected for the edge server

Namely, the estimation of the accumulated income obtained by the edge server; the target critic network is used for updating network parameters of the online critic network.

Further, in step S1-2, the defined states are:

at t-1 time step, the watched situation of each video is

，

；

Is calculated according to the following formula:

handle

And

as the state observed by the current edge server, i.e.

(ii) a In light of the above-described description,

and

are all high-dimensional vectors of dimension N, and thus

Is a high dimensional state with dimension 2N.

Further, in step S1-2, actions are defined as:

will have a high dimensional motion space

Reducing the dimension of the video caching action in the middle to obtain a low-dimensional action space

Of dimension of

(ii) a Then at time step t, the edge server can select the low-dimensional continuous video caching action as

，

；

Video caching actions that need to be restored to original high-dimensional dispersion

Of dimension of

，

。

Further, in step S1-2, the state transition probability is defined as:

in MDP, the edge server is in state

According to the action

The result of selecting video for caching is

To decide.

Further, in step S1-2, the instant prize is defined as:

the edge server obtains instant reward after caching the video at time step t

(ii) a The fringe server gets the accumulated reward at time step t

The formula is as follows:

the goal of the edge server is to maximize the cumulative revenue, i.e., the expectation of the cumulative rewards, the formula is as follows:

the optimization objective is converted into the optimal video caching action of solving the edge server in the time step t

To maximize the cumulative revenue of the edge servers.

Further, in step S2, network parameters of the high-dimensional video caching action selection model based on the modified DDPG are trained through Adam algorithm, and the training is performedThe training process is based on training samples; before training the decoder, an encoder and a deep fully-connected neural network need to be trained; the encoder is a deep full-connection neural network with 6 layers, and the network parameters are

(ii) a encoder input is original high-dimensional discrete video caching action

Output as low-dimensional continuous video caching action

(ii) a The network parameters of the deep fully-connected neural network are

The number of the network layers is 5; the input of the deep fully-connected neural network is the state observed by the edge server

And low-dimensional continuous video caching action

The output is the state of the next time

。

Further, the step S2 specifically includes the following steps:

step S2-1: respectively randomly initializing network parameters of encoder, deep fully-connected neural network and decoder

、

And

；

step S2-2: encoder caching original high-dimensional discrete video

Dimension reduction into low-dimensional continuous video caching action

；

Step S2-3: caching low-dimensional continuous video

States observed with edge servers

Inputting the data into a deep full-connection neural network, and outputting the data to obtain the state of the next moment

；

Step S2-4: minimizing loss of encoder and deep fully-connected neural networks

To update the parameters of the encoder and the deep fully-connected neural network

And

the formula is as follows:

wherein, the first and the second end of the pipe are connected with each other,

is a formula of

Calculating expectation;

as a policy

Distribution of lower state transition probability;

the parameters for the deep fully-connected neural network are

Time, input

And

then, the network outputs

The probability of (d);

as an encoder parameter is

Time, input

Then, outputting the encoder;

is KL divergence, represents

And Gaussian distribution

The difference between them;

is the weight of the KL divergence;

step S2-5: repeating steps S2-2 to S2-4 until

Converging to finish the training of the encoder and the deep fully-connected neural network;

step S2-6: will be provided with

Inputting the video data into decoder, and outputting the video data to obtain original high-dimensional discrete video cache action

The formula is as follows:

wherein the content of the first and second substances,

is a parameter of decoder

Time, input

Then, outputting the decoder;

step S2-7: minimizing the distance between two low-dimensional consecutive video buffering actions

To update decoder parameters

The formula is as follows:

wherein the content of the first and second substances,

as an encoder parameter is

Inputting the output of the decoder after the action of caching the original high-dimensional discrete video output by the decoder, and outputting the encoder; first item

Ensuring that decoder is a one-sided inverse of encoder, i.e.

However, but

(ii) a Second item

Ensure that

Is the only minimum;

is the weight of the second term;

step S2-8: repeating steps S2-6 to S2-7 until

Converging to finish the decoder training;

step S2-9: the network parameters of the on-line actor network and the target actor network are respectively

And

the network parameters of the online critic network and the target critic network are respectively

And

(ii) a Random initialization

And

then respectively connect

And

is assigned to

And

；

step S2-10: edge server observed state

Then, low-dimensional continuous video caching action is selected according to the online operator network and random noise

The formula is as follows:

wherein the content of the first and second substances,

the parameters for an online actor network are

Time, input state

Then, the output of the network;

random noise is used for increasing the exploration of video caching action;

step S2-11: decoder will

Restore to original high-dimensional discrete video caching action

；

Step S2-12: the edge server according to the action

Instant reward is obtained after the video is selected and cached

And observe the state of the next time step

；

Step S2-13: will be provided with

Storing into a replay buffer;

step S2-14: randomly sampling M pieces of data from a replay buffer

；

Step S2-15: minimizing loss in an online critic network using Adam's algorithm

To update its parameters

The formula is as follows:

based on actions for edge servers

Selecting a video to cache and then obtaining instant rewards;

representing a degree of interest in future rewards for a discount rate;

the parameter for the target critic network is

Time, input

And

the latter state action value;

the parameter for the online critic network is

Time, input

And

the latter state action value;

step S2-16: online actor network computation strategy gradient

Thereafter, parameters are updated using the Adam algorithm

The formula is as follows:

is a state action value

About actions

A gradient of (a);

as a parameter of the online operator network is

The output action is related to

A gradient of (a);

is prepared from radix GinsengNumber of

The update step size of (c);

step S2-17: updating parameters according to soft mode

And

the formula is as follows:

wherein the content of the first and second substances,

as a parameter

And

delay update step size of;

step S2-18: repeating steps S2-10 to S2-17 until the loss function

And converging to finish training.

Further, in step S3, the high dimensional state observed by the edge server is input

Outputting low-dimensional continuous video caching action by utilizing a trained high-dimensional video caching action selection model based on improved DDPG, reducing the action into the original high-dimensional discrete video caching action by a decoder, and performing edge serviceAnd the device selects the video to be cached according to the original high-dimensional discrete video caching action.

Further, the step S3 specifically includes the following steps:

step S3-1: edge server observes high dimensional state

Then, will

Inputting the video into a trained online actor network, and outputting low-dimensional continuous video caching action

The formula is as follows:

wherein the content of the first and second substances,

is the parameter of the on-line actor network obtained after the training of the step S2;

as a parameter of the online operator network is

Time, input state

Then, outputting the network;

step S3-2: input the method

Decoder output original high-dimensional discrete video caching action

The formula is as follows:

is the parameter of decoder obtained after the training of step S2;

as the parameter of decoder is

Time, input

Then, outputting the decoder;

step S3-3: edge server based on

Selecting a video from a plurality of videos; if the cached video of the edge server does not contain the video and the residual storage capacity of the edge server is enough to cache the video, caching the video into the edge server; otherwise, deleting the earliest cached video in the edge server in sequence until the remaining storage capacity of the edge server is enough to cache the video, and then caching the video in the edge server.

The invention has the beneficial effects that:

1) The deep reinforcement learning is applied to the video cache selection of the edge server, and the high-efficiency video cache of the edge server is realized by considering the dynamic property and high-dimensional property of the video cache selection.

2) The decoder is used for improving the DDPG, so that the edge server can select a proper video to be cached, and the time delay of video transmission and the flow cost spent by a user are reduced.

3) When the edge server selects video cache from a large number of videos, the calculation overhead is greatly reduced, excessive pressure on the edge server with limited resources is avoided, and the calculation cost is saved.

Drawings

Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention.

FIG. 2 is a detailed flow chart of an embodiment of the present invention.

Fig. 3 is a schematic diagram of a high-dimensional video caching action selection model based on the improved DDPG according to an embodiment of the present invention.

Fig. 4 is a schematic flow chart of an Adam-based training algorithm according to an embodiment of the present invention.

FIG. 5 is a graph showing the experimental results of the embodiment of the present invention.

In fig. 1, 1-edge server, 2-base station, 3-subscriber, 4-backbone.

Detailed Description

The technical scheme of the invention is further explained in detail by combining the drawings in the specification.

As shown in fig. 1, a system architecture diagram of the embodiment of the present invention is specifically described as follows: the edge server 1, with which the base station 2 is equipped, selects a video cache, and may provide the cached video to a plurality of users 3 within its coverage area. When the video to be viewed by the user has been cached by the edge server 1, the video can be directly obtained therefrom. Otherwise, the video is acquired from the backbone network 4 such as a wireless network. The arrows in the figure indicate the acquisition of video.

As shown in fig. 2, the overall flow of the embodiment of the present invention includes:

step S1: performing system modeling for a high-dimensional video cache problem, and then establishing a high-dimensional video cache action selection model based on an improved Depth Deterministic Policy Gradient (DDPG), as shown in fig. 3, a schematic diagram of the high-dimensional video cache action selection model based on the improved DDPG according to the embodiment of the present invention includes the following specific steps:

setting the number of users in the coverage range of the edge server as U, the number of videos as N, the time length as T and the maximum storage of the edge serverThe capacity is C, the unit time delay and the unit traffic cost from the user's local to the edge server are l and p, respectively. The video cache selection strategy of the edge server is set as

，

. Wherein, the first and the second end of the pipe are connected with each other,

and when the time step t is represented, the video caching action is performed by the edge server.

. Since the number of videos N is extremely large, it is very difficult to determine the number of videos N

Are high-dimensional discrete. When in use

And (3) indicating that the edge server caches the video with the reference number j at the time step t. If not, then,

. If the edge server only selects one video cache per time step, then

. When the time step t is set, the situation that the user k watches the video is

，

。

Is a high-dimensional vector. When in use

When representing the user

The video referenced j is viewed at time t. If not, then,

. When the time step t is set, the condition that the edge server caches the video is

，

. It is clear that,

also a high-dimensional vector. When in use

Time, it indicates that the edge server has cached the video with reference number j at time step t. If not, then,

. Let the memory size occupied by the video with reference number j be

. Video cache selection strategy for solving edge server as optimization target of whole problem

wherein the content of the first and second substances,

for discount rates, indicating a degree of interest in future rewards;

when the time step is t, because the user k watches the video, the edge server obtains instant rewards; e is the real-time reward of the positive value obtained by the edge server when the video to be watched by the user is cached by the edge server;

to look atTime delays for the transmission of the frequency and the preference of the traffic cost spent by the user,

is in the range of 0 to 1.

Step S1-2: describing the problem model into a Markov decision process

And (4) showing. Where S is a state space, storing states that can be observed by the edge server.

Is a high-dimensional action space storing the original high-dimensional discrete video caching actions that the edge server can execute.

The method is a low-dimensional action space, and stores low-dimensional continuous video caching actions selectable by an edge server.

Is a reward space that stores the instant rewards obtained by the fringe server.

The state transition probability space represents the distribution situation of the edge server in a certain state and entering the next state after executing the action.

(1) The state is as follows:

at t-1 time step, the watched situation of each video is

，

。

Is calculated according to the following formula. Handle

And

as the state currently observed by the edge server, i.e.

. In accordance with the above-described description,

and

are all high-dimensional vectors of dimension N, and thus

Is a high dimensional state with dimension 2N.

(2) The actions are as follows:

will be high dimensional motion space

Reducing the dimension of the video caching action in the middle to obtain a low-dimension action space

Of dimension of

. Then at time step t, the edge server can select the low-dimensional continuous video caching action as

，

。

Of dimensions of

，

。

(3) Probability of state transition:

in MDP, the edge server is in state

According to the actions

The result of selecting video to cache is

To determine.

(4) Instant reward:

the edge server obtains instant reward after caching the video at time step t

. The fringe server gets the accumulated reward at time step t

The formula is as follows:

the goal of the edge server is to maximize the cumulative revenue, i.e., the expectation of a cumulative prize, the formula is as follows:

To maximize the cumulative revenue of the edge servers.

Step S1-3: the high-dimensional video caching action selection model based on the improved DDPG combines the DDPG with a trained decoder. The DDPG includes an operator, a critic and a playback buffer.

And

. Import to online actor network is the state observed by the edge server

Output as low-dimensional continuous video caching action

. The target actor network is used for updating network parameters of the online actor network. The decoder is a deep fully-connected neural network with 6 layers, and the network parameters are

. Input of decoder is low-dimensional continuous video caching action

Video caching action output as original high-dimensional dispersion

. Replay buffers store states observed by edge servers

Low-dimensional continuous video caching action

Edge server based on actions

Instant reward obtained after selecting video to cache

And the state of the next time step observed by the edge server

I.e. by

. The critic is divided into an online critic network and a target critic network, both are deep full-connection neural networks with 4 layers, and the network parameters are respectively

And

. The input to the online critic network is the data sampled from the replay buffer

I.e. an estimate of the accumulated revenue obtained by the edge server. The target critic network is used for updating network parameters of the online critic network.

Step S2: network parameters of a high-dimensional video caching action selection model based on the improved DDPG are trained through an Adam algorithm, the training process is based on a designed training sample, the training sample is generated in the interaction process of an edge server and a video caching environment, and the training sample comprises a state observed by the edge server, an original high-dimensional discrete video caching action, an instant reward obtained after the edge server selects videos for caching and a state of the next time step observed by the edge server. Before training the decoder, the encoder and a deep fully-connected neural network are required to be trained. The encoder is a deep full-connection neural network with 6 layers, and the network parameters are

. encoder input as original high-dimensional discrete video caching action

Video caching action with continuous low-dimensional output

. The network parameters of the deep fully-connected neural network are

The number of network layers is 5. The input of the deep fully-connected neural network is the state observed by the edge server

And low-dimensional continuous video caching action

The output is the state of the next time

. As shown in fig. 4, a schematic flow diagram of a training algorithm based on Adam according to an embodiment of the present invention includes the following specific steps:

step S2-1: respectively randomly initializing the encoder,Network parameters of the deep full-connection neural network and the decoder:

、

and

。

step S2-2: encoder caching original high-dimensional discrete video

Dimension reduction into low-dimensional continuous video caching action

。

Step S2-3: caching low-dimensional continuous video

States observed with edge servers

。

Step S2-4: minimizing loss of encoder and deep fully-connected neural networks

And

the formula is as follows：

is a formula of

Calculating expectation;

as a policy

Distribution of lower state transition probabilities;

the parameters for the deep fully-connected neural network are

Time, input

And

then, the network outputs

The probability of (d);

as an encoder parameter is

Time, input

Then, outputting the encoder;

is KL divergence, represents

And Gaussian distribution

The difference between them;

is the weight of the KL divergence.

Step S2-5: repeating steps S2-S2-4 until

And (5) converging to finish the training of the encoder and the deep fully-connected neural network.

Step S2-6: will be provided with

Inputting the video data into decoder, and outputting the video data to obtain original high-dimensional discrete video buffer action

The formula is as follows:

is a parameter of decoder

Time, input

And then, decoder output.

To update decoder parameters

The formula is as follows:

wherein the content of the first and second substances,

as a parameter of the encoder is

Ensuring that decoder is a one-sided inverse of encoder, i.e.

However, but

(ii) a Item II

Ensure

Is the only minimum;

is the weight of the second term.

Step (ii) ofS2-8: repeating steps S2-6-S2-7 until

And converging to finish the decoder training.

Step S2-9: the network parameters of the on-line operator network and the target operator network are respectively

And

And

. Random initialization

And

then respectively connect

And

is assigned to

And

。

step S2-10: edge server observed state

Then, low-dimensional continuity is selected according to the online actor network and random noiseVideo caching actions

The formula is as follows:

wherein the content of the first and second substances,

the parameters for an online actor network are

Time, input state

Then, the output of the network;

random noise is used to increase the exploration of video buffering action.

Step S2-11: decoder will

Restore to original high-dimensional discrete video caching action

。

Step S2-12: the edge server based on the action

Instant reward is obtained after the video is selected and cached

And observe the state of the next time step

。

Step S2-13: will be provided with

Stored in the replay buffer.

Step S2-14: randomly sampling M pieces of data from a replay buffer

。

Step S2-15: minimizing loss in online critic networks using Adam's algorithm

To update its parameters

The formula is as follows:

based on actions for edge servers

Selecting a video to cache to obtain instant rewards;

for discount rates, indicating a degree of interest in future rewards;

the parameter for the target critical network is

Time, input

And

the latter state action value;

the parameters for the online critical network are

Time, input

And

the latter state action value.

Step S2-16: online actor network computation strategy gradient

Thereafter, parameters are updated using the Adam algorithm

The formula is as follows:

is a state action value

About actions

A gradient of (a);

the parameters for an online actor network are

The output action is related to

A gradient of (a);

as a parameter

The update step size of (c).

Step S2-17: updating parameters according to soft mode

And

the formula is as follows:

wherein the content of the first and second substances,

as a parameter

And

the step size is updated.

Step S2-18: repeating steps S2-10-S2-17 until the loss function

And converging to finish training.

And step S3: inputting high dimensional states observed by an edge server

The method comprises the following steps of outputting low-dimensional continuous video caching actions by utilizing a trained high-dimensional video caching action selection model based on the improved DDPG, reducing the actions to original high-dimensional discrete video caching actions by a decoder, and selecting videos to be cached by an edge server according to the original high-dimensional discrete video caching actions, wherein the specific steps are as follows:

step S3-1: edge server observes high dimensional state

Then, will

Inputting the video into a trained online operator network, and outputting a low-dimensional continuous video caching action

The formula is as follows:

as a parameter of the online operator network is

Time, input state

And then, outputting the network.

Step S3-2: input device

Decoder output original high-dimensional discrete video caching action

The formula is as follows:

wherein the content of the first and second substances,

is the parameter of decoder obtained after the training of step S2;

is a parameter of decoder

Time, input

And then, outputting the decoder.

Step S3-3: edge server based on

One video is selected from a large number of videos. If the video is not contained in the cached video of the edge server and the remaining storage capacity of the edge server is sufficient to cache the video, the video is cached in the edge server. Otherwise, deleting the video cached in the edge server in turn until the remaining storage capacity of the edge server is enough to cache the video, and then caching the video in the edge server.

To demonstrate the effectiveness of the present invention, preliminary experiments were conducted. The method proposed by the present invention was compared to DDPG, DQN and PPO. In the DDPG, an operator outputs high-dimensional continuous video caching action, then the high-dimensional continuous video caching action is subjected to chamber-softmax processing, criticc outputs a corresponding state action value, and an edge server selects a video to be cached according to the action output by the operator; in the DQN, an edge server obtains all state action values related to the current state according to a Q network, and then selects a video to cache by utilizing 1013to greedy; in PPO, the edge server selects video to cache by using the action output by new _ operator. The comparison results are shown in fig. 5. It can be seen from the figure that the convergence speed of the edge server is fastest and the obtained accumulated benefit is maximum when the method provided by the invention is used. Therefore, using the method of the present invention, the time delay of video transmission and the traffic cost spent by the user are minimized. The method provided by the invention is more suitable for being deployed in the edge server.

The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiment, but equivalent modifications or changes made by those skilled in the art according to the present disclosure should be included in the scope of the present invention as set forth in the appended claims.

Claims

1. The high-dimensional video cache selection method based on deep reinforcement learning is characterized by comprising the following steps: the method comprises the following steps:

and step S3: and the edge server selects videos to be cached by using the trained high-dimensional video caching action model based on the improved DDPG.

2. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 1, wherein: in step S1, the specific steps are as follows:

，

；

Wherein the content of the first and second substances,

(ii) a Since the number of videos N is huge, so

Is high-dimensional discrete; when in use

(ii) a If the edge server only selects one video cache per time step, then

；

When the time step t is set, the situation that the user k watches the video is

，

；

Is a high-dimensional vector when

When it is, represents the user

The video numbered j is viewed at time t, otherwise,

；

，

，

Is also a high-dimensional vector when

At time step t, it means that the edge server has cached the video labeled j, otherwise,

；

let the memory size occupied by the video with reference number j be

Clothing with hem at time step tThe instant reward obtained after the server caches the video is

wherein the content of the first and second substances,

representing a degree of interest in future rewards for a discount rate;

Represents and;

step S1-2: describing the problem model into a Markov decision process

the method comprises the steps that a high-dimensional action space is formed, and original high-dimensional discrete video caching actions which can be executed by an edge server are stored;

is of low dimensionThe action space is used for storing low-dimensional continuous video caching actions selectable by the edge server;

is a reward space for storing instant rewards obtained by the edge server;

And

(ii) a Import to online actor network is the state observed by the edge server

Video caching action with continuous low-dimensional output

(ii) a The target actor network is used for updating network parameters of the online actor network;

(ii) a decoder input as low-dimensional continuous video buffering action

Output as the original high-dimensional discrete video caching action

；

Replay buffers store states observed by edge servers

Low-dimensional continuous video caching action

Edge server action based

Instant reward obtained after selecting video to cache

And the state of the next time step observed by the edge server

I.e. by

；

The critic is divided into an online critic network and a target critic network, both are deep fully-connected neural networks with 4 layers, and the network parameters are respectively

And

3. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 2, wherein: in step S1-2, the defined states are:

at t-1 time step, the watched situation of each video is

，

；

Is calculated according to the following formula:

handle

And

as the state currently observed by the edge server, i.e.

(ii) a In light of the above-described description,

and

are all high-dimensional vectors of dimension N, and thus

Is a high dimensional state with dimension 2N.

4. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 2, wherein: in step S1-2, actions are defined as:

will have a high dimensional motion space

Of dimensions of

，

；

Of dimension of

，

。

5. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 2, wherein: in step S1-2, the state transition probability is defined as:

in MDP, the edge server is in state

According to the action

The result of selecting video for caching is

To decide.

6. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 2, wherein: in step S1-2, the instant prize is defined as:

the edge server obtains instant reward after caching the video at time step t

(ii) a The fringe server gets the accumulated reward at time step t

The formula is as follows:

To maximize the cumulative revenue of the edge servers.

7. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 1, wherein: in the step S2, network parameters of a high-dimensional video caching action selection model based on the improved DDPG are trained through an Adam algorithm, and the training process is based on training samples; before training the decoder, an encoder and a deep fully-connected neural network need to be trained; the encoder is a deep fully-connected neural network with 6 layers of layers and network parameters of

(ii) a encoder input is original high-dimensional discrete video caching action

Output as low-dimensional continuous video caching action

(ii) a The network parameters of the deep fully-connected neural network are

And low-dimensional continuous video cachingMovement of

The output is the state of the next time

。

8. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 7, wherein: the step S2 comprises the following specific steps:

、

And

；

step S2-2: encoder caching original high-dimensional discrete video

Dimension reduction into low-dimensional continuous video caching action

；

Step S2-3: caching low-dimensional continuous video

States observed with edge servers

Inputting the data into a deep fully-connected neural network, and outputting the data to obtain the next momentStatus of state

；

Step S2-4: minimizing loss of encoder and deep fully-connected neural networks

And

the formula is as follows:

wherein the content of the first and second substances,

is a formula of

Calculating expectation;

as a policy

Distribution of lower state transition probability;

the parameters for the deep fully-connected neural network are

Time, input

And

then, the network outputs

The probability of (d);

as an encoder parameter is

Time, input

Then, outputting an encoder;

is KL divergence, represents

And Gaussian distribution

The difference between them;

is the weight of the KL divergence;

step S2-5: repeating steps S2-2 to S2-4 until

step S2-6: will be provided with

The formula is as follows:

as the parameter of decoder is

Time, input

Then, outputting the decoder;

To update decoder parameters

The formula is as follows:

as an encoder parameter is

Ensuring that decoder is a one-sided inverse of encoder, i.e.

However, but

(ii) a Second item

Ensure that

Is the only minimum;

is the weight of the second term;

step S2-8: repeating steps S2-6 to S2-7 until

Converging to finish decoder training;

And

And

(ii) a Random initialization

And

then respectively will

And

assign to

And

；

step S2-10: edge server observed state

The formula is as follows:

wherein the content of the first and second substances,

as a parameter of the online operator network is

Time, input state

Then, the output of the network;

random noise is used for increasing the exploration of video caching action;

step S2-11: decoder will

Restore to original high-dimensional discrete video caching action

；

Step S2-12: the edge server according to the action

Instant reward is obtained after the video is selected and cached

And observe the state of the next time step

；

Step S2-13: will be provided with

Storing into a replay buffer;

step S2-14: randomly sampling M pieces of data from a replay buffer

；

Step S2-15: using Adam's algorithm, inLine critical network minimizing loss

To update its parameters

The formula is as follows:

based on actions for edge servers

Selecting a video to cache to obtain instant rewards;

for discount rates, indicating a degree of interest in future rewards;

the parameter for the target critic network is

Time, input

And

the latter state action value;

the parameter for the online critic network is

Time, input

And

the latter state action value;

step S2-16: online actor network computation strategy gradient

Thereafter, parameters are updated using Adam's algorithm

The formula is as follows:

wherein the content of the first and second substances,

is a state action value

About actions

A gradient of (a);

as a parameter of the online operator network is

The output action is related to

A gradient of (a);

as a parameter

The update step length of (2);

step S2-17: updating parameters according to soft mode

And

the formula is as follows:

wherein the content of the first and second substances,

as a parameter

And

delay update step size of;

step S2-18: repeating steps S2-10 to S2-17 until the loss function

And converging to finish training.

9. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 1, wherein: in step S3, the high dimensional state observed by the edge server is input

And outputting a low-dimensional continuous video caching action by using a trained high-dimensional video caching action selection model based on the improved DDPG, reducing the action into an original high-dimensional discrete video caching action by a decoder, and selecting a video for caching by the edge server according to the original high-dimensional discrete video caching action.

10. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 9, wherein: the step S3 comprises the following specific steps:

step S3-1: edge server observes high dimensional state

Then, will

The formula is as follows:

wherein the content of the first and second substances,

is obtained after the training of step S2Parameters of the line actor network;

as a parameter of the online operator network is

Time, input state

Then, the output of the network;

step S3-2: input device

Decoder output original high-dimensional discrete video caching action

The formula is as follows:

is the parameter of decoder obtained after the training of step S2;

is a parameter of decoder

Time, input

Then, decoder output;

step S3-3: edge server based on

Selecting a video from a plurality of videos; if the video is not contained in the cached video of the edge server and the residual storage capacity of the edge server is enough to cache the video, caching the video into the edge server; otherwise, deleting the video cached in the edge server in turn until the remaining storage capacity of the edge server is enough to cache the video, and then caching the video in the edge server.