CN115344510A - High-dimensional video cache selection method based on deep reinforcement learning - Google Patents

High-dimensional video cache selection method based on deep reinforcement learning Download PDF

Info

Publication number
CN115344510A
CN115344510A CN202211270042.8A CN202211270042A CN115344510A CN 115344510 A CN115344510 A CN 115344510A CN 202211270042 A CN202211270042 A CN 202211270042A CN 115344510 A CN115344510 A CN 115344510A
Authority
CN
China
Prior art keywords
video
edge server
dimensional
network
action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211270042.8A
Other languages
Chinese (zh)
Other versions
CN115344510B (en
Inventor
周剑
陈然
张伯雷
严筱永
李鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202211270042.8A priority Critical patent/CN115344510B/en
Publication of CN115344510A publication Critical patent/CN115344510A/en
Application granted granted Critical
Publication of CN115344510B publication Critical patent/CN115344510B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
    • G06F12/0802Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
    • G06F12/0877Cache access modes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0655Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
    • G06F3/0656Data buffering arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The high-dimensional video cache selection method based on the deep reinforcement learning applies the deep reinforcement learning to the video cache selection of the edge server, considers the dynamic property and high-dimensional property of the video cache selection, and realizes the high-efficiency video cache of the edge server; the decoder is used for improving the DDPG, so that the edge server can select a proper video for caching, and the time delay of video transmission and the flow cost spent by a user are reduced; when the edge server selects video cache from massive videos, the calculation overhead is greatly reduced, excessive pressure on the edge server with limited resources is avoided, and the calculation cost is saved.

Description

High-dimensional video cache selection method based on deep reinforcement learning
Technical Field
The invention belongs to the technical field of computer application, and particularly relates to a high-dimensional video cache selection method based on deep reinforcement learning.
Background
With the advancement of science and technology, multimedia services and applications thereof have been rapidly developed. The video quantity is more and more, the video quality is higher and more, and the video traffic is larger and larger. The huge video traffic puts pressure on the backbone network. The edge calculation makes the data processing closer to the user, and can improve the quality of the multimedia service. Especially in 5G networks, base stations have been equipped with edge servers to provide storage and computing power. The video cache is combined with the edge calculation, and the edge server selects the video cache which is relatively more for watching by the user, so that the time delay of video transmission and the flow cost spent by the user can be reduced.
The edge server equipped in the base station selects the video cache, and can provide the cached video for a plurality of users in the coverage area of the edge server. When the video to be watched by the user is cached by the edge server, the video can be directly obtained from the edge server. Otherwise, the video is acquired from a backbone network such as a wireless network.
The popularity of the video changes along with time, and the edge server can select different videos to cache, so the video cache selection is dynamic. Due to the limitation of self-caching capacity, the edge server needs to select a cache part video from a large number of videos, and therefore the video cache selection is high in dimensionality. The dynamic and high-dimensional characteristics of the video cache selection bring challenges to the efficient video caching of the edge server.
In the traditional video cache selection method, the dynamics of video cache selection is mostly considered, and the video cache is performed by using reinforcement learning and deep reinforcement learning, but the high-dimensional performance of the video cache selection is not considered. When the edge server selects video cache from massive videos, the calculation cost is high, and pressure is brought to the edge server with limited resources.
Disclosure of Invention
Aiming at the defects in the background technology, the invention provides a high-dimensional video cache selection method based on deep reinforcement learning, and the deep reinforcement learning is applied to the video cache selection of the edge server. The DDPG is improved by using decoder, so that the edge server can select proper video for caching, and the time delay of video transmission and the flow cost spent by a user are reduced.
The high-dimensional video cache selection method based on deep reinforcement learning comprises the following steps:
step S1: performing system modeling aiming at a high-dimensional video cache problem, and then establishing a high-dimensional video cache action selection model based on an improved depth certainty strategy gradient DDPG;
step S2: training network parameters of a high-dimensional video caching action selection model based on the improved DDPG through an Adam algorithm;
and step S3: and the edge server selects videos to cache by using the trained high-dimensional video caching action selection model based on the improved DDPG.
Further, in step S1, the specific steps are as follows:
step S1-1: formalizing the high-dimensional video caching problem of the edge server:
setting the number of users in the coverage range of the edge server as U, the number of videos as N, the time length as T, the maximum storage capacity of the edge server as C, and the unit time delay and the unit flow cost from the local user to the edge server as l and p respectively; the video cache selection strategy of the edge server is set as
Figure 502300DEST_PATH_IMAGE001
Figure 278495DEST_PATH_IMAGE002
Wherein the content of the first and second substances,
Figure 792653DEST_PATH_IMAGE003
when the time step t is represented, the edge server performs video caching action;
Figure 656704DEST_PATH_IMAGE004
(ii) a Since the number of videos N is huge, so
Figure 831596DEST_PATH_IMAGE005
Are high-dimensional discrete; when the temperature is higher than the set temperature
Figure 388479DEST_PATH_IMAGE006
When it is, the edge server caches the video with the mark j at the time step t, otherwise,
Figure 124354DEST_PATH_IMAGE007
(ii) a If the edge server only selects one video cache per time step, then
Figure 182309DEST_PATH_IMAGE008
When the time step t is set, the situation that the user k watches the video is
Figure 320029DEST_PATH_IMAGE009
Figure 47814DEST_PATH_IMAGE010
Figure 395618DEST_PATH_IMAGE009
Is a high-dimensional vector when
Figure 601472DEST_PATH_IMAGE011
When it is, represents the user
Figure 983911DEST_PATH_IMAGE012
The video numbered j is viewed at time t, otherwise,
Figure 882597DEST_PATH_IMAGE013
when the time step t is set, the condition that the edge server caches the video is
Figure 858644DEST_PATH_IMAGE014
Figure 494287DEST_PATH_IMAGE015
Figure 872178DEST_PATH_IMAGE016
Is also a high-dimensional vector when
Figure 535241DEST_PATH_IMAGE017
When, it means that the edge server has cached the video with reference number j at time step t, otherwise,
Figure 998583DEST_PATH_IMAGE018
let the memory size occupied by the video with reference number j be
Figure 546239DEST_PATH_IMAGE019
The instant reward obtained after the edge server caches the video at the time step t is
Figure 44217DEST_PATH_IMAGE020
(ii) a Video cache selection strategy for solving edge server as optimization target of whole problem
Figure 674918DEST_PATH_IMAGE021
To maximize the cumulative revenue of the edge server, i.e. to minimize the time delay of the video transmission and the traffic cost spent by the user:
Figure 359977DEST_PATH_IMAGE022
Figure 976903DEST_PATH_IMAGE023
Figure 922863DEST_PATH_IMAGE024
Figure 599832DEST_PATH_IMAGE025
Figure 772187DEST_PATH_IMAGE026
Figure 927225DEST_PATH_IMAGE027
Figure 760314DEST_PATH_IMAGE028
Figure 873763DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 267836DEST_PATH_IMAGE030
representing a degree of interest in future rewards for a discount rate;
Figure 226564DEST_PATH_IMAGE031
when the time step is t, because the user k watches the video, the edge server obtains instant rewards; e is a positive value instant reward obtained by the edge server when the video to be watched by the user is cached by the edge server;
Figure 147116DEST_PATH_IMAGE032
to favor the time delay of video transmission and the traffic cost spent by the user,
Figure 165887DEST_PATH_IMAGE032
is in the range of 0 to 1; c is the maximum storage capacity of the edge server; u is the number of users in the coverage range of the edge server; in the formula
Figure 312835DEST_PATH_IMAGE033
Represents and;
step S1-2: describing the above problem model as a Markov decision process
Figure 75255DEST_PATH_IMAGE034
Represents; wherein S is a state space storing states observable by the edge server;
Figure 115892DEST_PATH_IMAGE035
the method comprises the steps that a high-dimensional action space is used for storing original high-dimensional discrete video caching actions executable by an edge server;
Figure 39986DEST_PATH_IMAGE036
the video caching method comprises the following steps that a low-dimensional action space is formed, and low-dimensional continuous video caching actions which can be selected by an edge server are stored;
Figure 674229DEST_PATH_IMAGE037
is a reward space for storing the instant rewards obtained by the edge server;
Figure 974761DEST_PATH_IMAGE038
the state transition probability space represents the distribution condition of the edge server in a certain state and entering the next state after executing actions;
step S1-3: the high-dimensional video caching action selection model based on the improved DDPG combines the DDPG with a trained decoder; the DDPG comprises an operator, a critic and a replay cache region;
the actor is divided into an online actor network and a target actor network which are both deep fully-connected neural networks with 4 layers, and the network parameters are respectively
Figure 135483DEST_PATH_IMAGE039
And
Figure 230478DEST_PATH_IMAGE040
(ii) a Import to online actor network is the state observed by the edge server
Figure 352018DEST_PATH_IMAGE041
Output as low-dimensional continuous video caching action
Figure 456240DEST_PATH_IMAGE042
(ii) a The target operator network is used for updating network parameters of the online operator network;
decoder is a deep fully-connected neural network with 6 layers, and the network parameters are
Figure 707355DEST_PATH_IMAGE043
(ii) a Input of decoder is low-dimensional continuous video caching action
Figure 238831DEST_PATH_IMAGE044
Video caching action output as original high-dimensional dispersion
Figure 582088DEST_PATH_IMAGE045
Replay buffers store states observed by the server
Figure 490001DEST_PATH_IMAGE046
Low-dimensional continuous video caching action
Figure 359737DEST_PATH_IMAGE044
Edge server action based
Figure 62113DEST_PATH_IMAGE045
Instant reward obtained after selecting video to cache
Figure 892666DEST_PATH_IMAGE047
And the state of the next time step observed by the edge server
Figure 338691DEST_PATH_IMAGE048
I.e. by
Figure 62933DEST_PATH_IMAGE049
The critic is divided into an online critic network and a target critic network, both are deep full-connection neural networks with 4 layers, and the network parameters are respectively
Figure 936211DEST_PATH_IMAGE050
And
Figure 254060DEST_PATH_IMAGE051
(ii) a The input to the online critic network is the data sampled from the replay buffer
Figure 238197DEST_PATH_IMAGE052
And outputting the state action value after the video cache is selected for the edge server
Figure 82525DEST_PATH_IMAGE053
Namely, the estimation of the accumulated income obtained by the edge server; the target critic network is used for updating network parameters of the online critic network.
Further, in step S1-2, the defined states are:
at t-1 time step, the watched situation of each video is
Figure 126704DEST_PATH_IMAGE054
Figure 931849DEST_PATH_IMAGE055
Figure 719677DEST_PATH_IMAGE056
Is calculated according to the following formula:
Figure 919976DEST_PATH_IMAGE057
handle
Figure 135057DEST_PATH_IMAGE058
And
Figure 896339DEST_PATH_IMAGE059
as the state observed by the current edge server, i.e.
Figure 612492DEST_PATH_IMAGE060
(ii) a In light of the above-described description,
Figure 41199DEST_PATH_IMAGE061
and
Figure 161602DEST_PATH_IMAGE062
are all high-dimensional vectors of dimension N, and thus
Figure 65973DEST_PATH_IMAGE063
Is a high dimensional state with dimension 2N.
Further, in step S1-2, actions are defined as:
will have a high dimensional motion space
Figure 195603DEST_PATH_IMAGE035
Reducing the dimension of the video caching action in the middle to obtain a low-dimensional action space
Figure 478816DEST_PATH_IMAGE036
Of dimension of
Figure 301279DEST_PATH_IMAGE064
(ii) a Then at time step t, the edge server can select the low-dimensional continuous video caching action as
Figure 427367DEST_PATH_IMAGE065
Figure 360688DEST_PATH_IMAGE066
Figure 498408DEST_PATH_IMAGE044
Video caching actions that need to be restored to original high-dimensional dispersion
Figure 491772DEST_PATH_IMAGE045
Of dimension of
Figure 594902DEST_PATH_IMAGE067
Figure 66334DEST_PATH_IMAGE068
Further, in step S1-2, the state transition probability is defined as:
in MDP, the edge server is in state
Figure 324140DEST_PATH_IMAGE069
According to the action
Figure 488406DEST_PATH_IMAGE070
The result of selecting video for caching is
Figure 323506DEST_PATH_IMAGE071
To decide.
Further, in step S1-2, the instant prize is defined as:
the edge server obtains instant reward after caching the video at time step t
Figure 598630DEST_PATH_IMAGE072
(ii) a The fringe server gets the accumulated reward at time step t
Figure 976522DEST_PATH_IMAGE073
The formula is as follows:
Figure 905163DEST_PATH_IMAGE074
the goal of the edge server is to maximize the cumulative revenue, i.e., the expectation of the cumulative rewards, the formula is as follows:
Figure 368506DEST_PATH_IMAGE075
the optimization objective is converted into the optimal video caching action of solving the edge server in the time step t
Figure 181741DEST_PATH_IMAGE045
To maximize the cumulative revenue of the edge servers.
Further, in step S2, network parameters of the high-dimensional video caching action selection model based on the modified DDPG are trained through Adam algorithm, and the training is performedThe training process is based on training samples; before training the decoder, an encoder and a deep fully-connected neural network need to be trained; the encoder is a deep full-connection neural network with 6 layers, and the network parameters are
Figure 414139DEST_PATH_IMAGE076
(ii) a encoder input is original high-dimensional discrete video caching action
Figure 513682DEST_PATH_IMAGE045
Output as low-dimensional continuous video caching action
Figure 464321DEST_PATH_IMAGE044
(ii) a The network parameters of the deep fully-connected neural network are
Figure 81247DEST_PATH_IMAGE077
The number of the network layers is 5; the input of the deep fully-connected neural network is the state observed by the edge server
Figure 794250DEST_PATH_IMAGE078
And low-dimensional continuous video caching action
Figure 205640DEST_PATH_IMAGE044
The output is the state of the next time
Figure 377995DEST_PATH_IMAGE079
Further, the step S2 specifically includes the following steps:
step S2-1: respectively randomly initializing network parameters of encoder, deep fully-connected neural network and decoder
Figure 657667DEST_PATH_IMAGE080
Figure 864657DEST_PATH_IMAGE077
And
Figure 446948DEST_PATH_IMAGE081
step S2-2: encoder caching original high-dimensional discrete video
Figure 965654DEST_PATH_IMAGE045
Dimension reduction into low-dimensional continuous video caching action
Figure 924383DEST_PATH_IMAGE044
Step S2-3: caching low-dimensional continuous video
Figure 985880DEST_PATH_IMAGE044
States observed with edge servers
Figure 4652DEST_PATH_IMAGE078
Inputting the data into a deep full-connection neural network, and outputting the data to obtain the state of the next moment
Figure 276233DEST_PATH_IMAGE082
Step S2-4: minimizing loss of encoder and deep fully-connected neural networks
Figure 507494DEST_PATH_IMAGE083
To update the parameters of the encoder and the deep fully-connected neural network
Figure 423498DEST_PATH_IMAGE080
And
Figure 239269DEST_PATH_IMAGE077
the formula is as follows:
Figure 873513DEST_PATH_IMAGE084
Figure 908465DEST_PATH_IMAGE085
wherein, the first and the second end of the pipe are connected with each other,
Figure 803609DEST_PATH_IMAGE086
is a formula of
Figure 164183DEST_PATH_IMAGE087
Calculating expectation;
Figure 285723DEST_PATH_IMAGE088
as a policy
Figure 983420DEST_PATH_IMAGE089
Distribution of lower state transition probability;
Figure 874016DEST_PATH_IMAGE090
the parameters for the deep fully-connected neural network are
Figure 405491DEST_PATH_IMAGE091
Time, input
Figure 748748DEST_PATH_IMAGE078
And
Figure 781295DEST_PATH_IMAGE044
then, the network outputs
Figure 526397DEST_PATH_IMAGE092
The probability of (d);
Figure 228774DEST_PATH_IMAGE093
as an encoder parameter is
Figure 59326DEST_PATH_IMAGE094
Time, input
Figure 69133DEST_PATH_IMAGE044
Then, outputting the encoder;
Figure 58955DEST_PATH_IMAGE095
is KL divergence, represents
Figure 932233DEST_PATH_IMAGE096
And Gaussian distribution
Figure 250082DEST_PATH_IMAGE097
The difference between them;
Figure 234218DEST_PATH_IMAGE098
is the weight of the KL divergence;
step S2-5: repeating steps S2-2 to S2-4 until
Figure 78546DEST_PATH_IMAGE099
Converging to finish the training of the encoder and the deep fully-connected neural network;
step S2-6: will be provided with
Figure 122726DEST_PATH_IMAGE044
Inputting the video data into decoder, and outputting the video data to obtain original high-dimensional discrete video cache action
Figure 927871DEST_PATH_IMAGE045
The formula is as follows:
Figure 715698DEST_PATH_IMAGE100
wherein the content of the first and second substances,
Figure 414533DEST_PATH_IMAGE101
is a parameter of decoder
Figure 629613DEST_PATH_IMAGE102
Time, input
Figure 656475DEST_PATH_IMAGE103
Then, outputting the decoder;
step S2-7: minimizing the distance between two low-dimensional consecutive video buffering actions
Figure 247993DEST_PATH_IMAGE104
To update decoder parameters
Figure 302799DEST_PATH_IMAGE105
The formula is as follows:
Figure 688781DEST_PATH_IMAGE106
wherein the content of the first and second substances,
Figure 468519DEST_PATH_IMAGE107
as an encoder parameter is
Figure 598149DEST_PATH_IMAGE108
Inputting the output of the decoder after the action of caching the original high-dimensional discrete video output by the decoder, and outputting the encoder; first item
Figure 5996DEST_PATH_IMAGE109
Ensuring that decoder is a one-sided inverse of encoder, i.e.
Figure 828459DEST_PATH_IMAGE110
However, but
Figure 829913DEST_PATH_IMAGE111
(ii) a Second item
Figure 497654DEST_PATH_IMAGE112
Ensure that
Figure 25588DEST_PATH_IMAGE113
Is the only minimum;
Figure 18952DEST_PATH_IMAGE114
is the weight of the second term;
step S2-8: repeating steps S2-6 to S2-7 until
Figure 242122DEST_PATH_IMAGE115
Converging to finish the decoder training;
step S2-9: the network parameters of the on-line actor network and the target actor network are respectively
Figure 838189DEST_PATH_IMAGE116
And
Figure 95995DEST_PATH_IMAGE117
the network parameters of the online critic network and the target critic network are respectively
Figure 994681DEST_PATH_IMAGE118
And
Figure 970727DEST_PATH_IMAGE119
(ii) a Random initialization
Figure 871949DEST_PATH_IMAGE116
And
Figure 984262DEST_PATH_IMAGE120
then respectively connect
Figure 319428DEST_PATH_IMAGE116
And
Figure 517191DEST_PATH_IMAGE118
is assigned to
Figure 720639DEST_PATH_IMAGE121
And
Figure 953038DEST_PATH_IMAGE122
step S2-10: edge server observed state
Figure 193526DEST_PATH_IMAGE078
Then, low-dimensional continuous video caching action is selected according to the online operator network and random noise
Figure 144165DEST_PATH_IMAGE103
The formula is as follows:
Figure 885725DEST_PATH_IMAGE123
wherein the content of the first and second substances,
Figure 972629DEST_PATH_IMAGE124
the parameters for an online actor network are
Figure 384019DEST_PATH_IMAGE116
Time, input state
Figure 821954DEST_PATH_IMAGE078
Then, the output of the network;
Figure 242571DEST_PATH_IMAGE125
random noise is used for increasing the exploration of video caching action;
step S2-11: decoder will
Figure 43036DEST_PATH_IMAGE103
Restore to original high-dimensional discrete video caching action
Figure 156486DEST_PATH_IMAGE045
Step S2-12: the edge server according to the action
Figure 816137DEST_PATH_IMAGE045
Instant reward is obtained after the video is selected and cached
Figure 774866DEST_PATH_IMAGE126
And observe the state of the next time step
Figure 196882DEST_PATH_IMAGE127
Step S2-13: will be provided with
Figure 215654DEST_PATH_IMAGE128
Storing into a replay buffer;
step S2-14: randomly sampling M pieces of data from a replay buffer
Figure 97022DEST_PATH_IMAGE129
Step S2-15: minimizing loss in an online critic network using Adam's algorithm
Figure 984076DEST_PATH_IMAGE130
To update its parameters
Figure 900079DEST_PATH_IMAGE131
The formula is as follows:
Figure 89752DEST_PATH_IMAGE132
wherein, the first and the second end of the pipe are connected with each other,
Figure 723996DEST_PATH_IMAGE133
based on actions for edge servers
Figure 290106DEST_PATH_IMAGE134
Selecting a video to cache and then obtaining instant rewards;
Figure 919671DEST_PATH_IMAGE135
representing a degree of interest in future rewards for a discount rate;
Figure 280245DEST_PATH_IMAGE136
the parameter for the target critic network is
Figure 401785DEST_PATH_IMAGE137
Time, input
Figure 365062DEST_PATH_IMAGE138
And
Figure 255657DEST_PATH_IMAGE139
the latter state action value;
Figure 787133DEST_PATH_IMAGE140
the parameter for the online critic network is
Figure 130389DEST_PATH_IMAGE141
Time, input
Figure 398822DEST_PATH_IMAGE142
And
Figure 143924DEST_PATH_IMAGE143
the latter state action value;
step S2-16: online actor network computation strategy gradient
Figure 846301DEST_PATH_IMAGE144
Thereafter, parameters are updated using the Adam algorithm
Figure 676853DEST_PATH_IMAGE145
The formula is as follows:
Figure 247512DEST_PATH_IMAGE146
Figure 112700DEST_PATH_IMAGE147
wherein, the first and the second end of the pipe are connected with each other,
Figure 985978DEST_PATH_IMAGE148
is a state action value
Figure 38248DEST_PATH_IMAGE149
About actions
Figure 412597DEST_PATH_IMAGE150
A gradient of (a);
Figure 132291DEST_PATH_IMAGE151
as a parameter of the online operator network is
Figure 176471DEST_PATH_IMAGE152
The output action is related to
Figure 716037DEST_PATH_IMAGE153
A gradient of (a);
Figure 628498DEST_PATH_IMAGE154
is prepared from radix GinsengNumber of
Figure 202699DEST_PATH_IMAGE155
The update step size of (c);
step S2-17: updating parameters according to soft mode
Figure 417779DEST_PATH_IMAGE156
And
Figure 975800DEST_PATH_IMAGE157
the formula is as follows:
Figure 301739DEST_PATH_IMAGE158
Figure 368263DEST_PATH_IMAGE159
wherein the content of the first and second substances,
Figure 754245DEST_PATH_IMAGE160
as a parameter
Figure 268403DEST_PATH_IMAGE161
And
Figure 522667DEST_PATH_IMAGE162
delay update step size of;
step S2-18: repeating steps S2-10 to S2-17 until the loss function
Figure 71460DEST_PATH_IMAGE163
And converging to finish training.
Further, in step S3, the high dimensional state observed by the edge server is input
Figure 628343DEST_PATH_IMAGE078
Outputting low-dimensional continuous video caching action by utilizing a trained high-dimensional video caching action selection model based on improved DDPG, reducing the action into the original high-dimensional discrete video caching action by a decoder, and performing edge serviceAnd the device selects the video to be cached according to the original high-dimensional discrete video caching action.
Further, the step S3 specifically includes the following steps:
step S3-1: edge server observes high dimensional state
Figure 629798DEST_PATH_IMAGE164
Then, will
Figure 687752DEST_PATH_IMAGE165
Inputting the video into a trained online actor network, and outputting low-dimensional continuous video caching action
Figure 91052DEST_PATH_IMAGE166
The formula is as follows:
Figure 84416DEST_PATH_IMAGE167
wherein the content of the first and second substances,
Figure 573166DEST_PATH_IMAGE168
is the parameter of the on-line actor network obtained after the training of the step S2;
Figure 169232DEST_PATH_IMAGE169
as a parameter of the online operator network is
Figure 427038DEST_PATH_IMAGE168
Time, input state
Figure 325724DEST_PATH_IMAGE078
Then, outputting the network;
step S3-2: input the method
Figure 301770DEST_PATH_IMAGE170
Decoder output original high-dimensional discrete video caching action
Figure 202992DEST_PATH_IMAGE171
The formula is as follows:
Figure 315305DEST_PATH_IMAGE172
wherein, the first and the second end of the pipe are connected with each other,
Figure 650471DEST_PATH_IMAGE173
is the parameter of decoder obtained after the training of step S2;
Figure 848234DEST_PATH_IMAGE174
as the parameter of decoder is
Figure 51683DEST_PATH_IMAGE175
Time, input
Figure 18502DEST_PATH_IMAGE176
Then, outputting the decoder;
step S3-3: edge server based on
Figure 524569DEST_PATH_IMAGE177
Selecting a video from a plurality of videos; if the cached video of the edge server does not contain the video and the residual storage capacity of the edge server is enough to cache the video, caching the video into the edge server; otherwise, deleting the earliest cached video in the edge server in sequence until the remaining storage capacity of the edge server is enough to cache the video, and then caching the video in the edge server.
The invention has the beneficial effects that:
1) The deep reinforcement learning is applied to the video cache selection of the edge server, and the high-efficiency video cache of the edge server is realized by considering the dynamic property and high-dimensional property of the video cache selection.
2) The decoder is used for improving the DDPG, so that the edge server can select a proper video to be cached, and the time delay of video transmission and the flow cost spent by a user are reduced.
3) When the edge server selects video cache from a large number of videos, the calculation overhead is greatly reduced, excessive pressure on the edge server with limited resources is avoided, and the calculation cost is saved.
Drawings
Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention.
FIG. 2 is a detailed flow chart of an embodiment of the present invention.
Fig. 3 is a schematic diagram of a high-dimensional video caching action selection model based on the improved DDPG according to an embodiment of the present invention.
Fig. 4 is a schematic flow chart of an Adam-based training algorithm according to an embodiment of the present invention.
FIG. 5 is a graph showing the experimental results of the embodiment of the present invention.
In fig. 1, 1-edge server, 2-base station, 3-subscriber, 4-backbone.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the drawings in the specification.
As shown in fig. 1, a system architecture diagram of the embodiment of the present invention is specifically described as follows: the edge server 1, with which the base station 2 is equipped, selects a video cache, and may provide the cached video to a plurality of users 3 within its coverage area. When the video to be viewed by the user has been cached by the edge server 1, the video can be directly obtained therefrom. Otherwise, the video is acquired from the backbone network 4 such as a wireless network. The arrows in the figure indicate the acquisition of video.
As shown in fig. 2, the overall flow of the embodiment of the present invention includes:
step S1: performing system modeling for a high-dimensional video cache problem, and then establishing a high-dimensional video cache action selection model based on an improved Depth Deterministic Policy Gradient (DDPG), as shown in fig. 3, a schematic diagram of the high-dimensional video cache action selection model based on the improved DDPG according to the embodiment of the present invention includes the following specific steps:
step S1-1: formalizing the high-dimensional video caching problem of the edge server:
setting the number of users in the coverage range of the edge server as U, the number of videos as N, the time length as T and the maximum storage of the edge serverThe capacity is C, the unit time delay and the unit traffic cost from the user's local to the edge server are l and p, respectively. The video cache selection strategy of the edge server is set as
Figure 209629DEST_PATH_IMAGE178
Figure 951189DEST_PATH_IMAGE179
. Wherein, the first and the second end of the pipe are connected with each other,
Figure 38093DEST_PATH_IMAGE180
and when the time step t is represented, the video caching action is performed by the edge server.
Figure 715062DEST_PATH_IMAGE181
. Since the number of videos N is extremely large, it is very difficult to determine the number of videos N
Figure 887418DEST_PATH_IMAGE180
Are high-dimensional discrete. When in use
Figure 432668DEST_PATH_IMAGE182
And (3) indicating that the edge server caches the video with the reference number j at the time step t. If not, then,
Figure 374080DEST_PATH_IMAGE183
. If the edge server only selects one video cache per time step, then
Figure 221950DEST_PATH_IMAGE184
. When the time step t is set, the situation that the user k watches the video is
Figure 147181DEST_PATH_IMAGE185
Figure 105909DEST_PATH_IMAGE186
Figure 527926DEST_PATH_IMAGE187
Is a high-dimensional vector. When in use
Figure 281118DEST_PATH_IMAGE188
When representing the user
Figure 428066DEST_PATH_IMAGE189
The video referenced j is viewed at time t. If not, then,
Figure 315119DEST_PATH_IMAGE190
. When the time step t is set, the condition that the edge server caches the video is
Figure 231122DEST_PATH_IMAGE191
Figure 420795DEST_PATH_IMAGE192
. It is clear that,
Figure 789460DEST_PATH_IMAGE191
also a high-dimensional vector. When in use
Figure 214625DEST_PATH_IMAGE193
Time, it indicates that the edge server has cached the video with reference number j at time step t. If not, then,
Figure 250714DEST_PATH_IMAGE194
. Let the memory size occupied by the video with reference number j be
Figure 611288DEST_PATH_IMAGE195
The instant reward obtained after the edge server caches the video at the time step t is
Figure 732828DEST_PATH_IMAGE196
. Video cache selection strategy for solving edge server as optimization target of whole problem
Figure 696105DEST_PATH_IMAGE197
To maximize the cumulative revenue of the edge server, i.e. to minimize the time delay of the video transmission and the traffic cost spent by the user:
Figure 586700DEST_PATH_IMAGE198
Figure 118176DEST_PATH_IMAGE199
Figure 461433DEST_PATH_IMAGE200
Figure 729865DEST_PATH_IMAGE201
Figure 474967DEST_PATH_IMAGE202
Figure 177344DEST_PATH_IMAGE203
Figure 7897DEST_PATH_IMAGE204
Figure 719501DEST_PATH_IMAGE205
wherein the content of the first and second substances,
Figure 443743DEST_PATH_IMAGE206
for discount rates, indicating a degree of interest in future rewards;
Figure 51442DEST_PATH_IMAGE207
when the time step is t, because the user k watches the video, the edge server obtains instant rewards; e is the real-time reward of the positive value obtained by the edge server when the video to be watched by the user is cached by the edge server;
Figure 369291DEST_PATH_IMAGE208
to look atTime delays for the transmission of the frequency and the preference of the traffic cost spent by the user,
Figure 743640DEST_PATH_IMAGE208
is in the range of 0 to 1.
Step S1-2: describing the problem model into a Markov decision process
Figure 197756DEST_PATH_IMAGE209
And (4) showing. Where S is a state space, storing states that can be observed by the edge server.
Figure 241935DEST_PATH_IMAGE210
Is a high-dimensional action space storing the original high-dimensional discrete video caching actions that the edge server can execute.
Figure 47080DEST_PATH_IMAGE211
The method is a low-dimensional action space, and stores low-dimensional continuous video caching actions selectable by an edge server.
Figure 693962DEST_PATH_IMAGE212
Is a reward space that stores the instant rewards obtained by the fringe server.
Figure 533742DEST_PATH_IMAGE213
The state transition probability space represents the distribution situation of the edge server in a certain state and entering the next state after executing the action.
(1) The state is as follows:
at t-1 time step, the watched situation of each video is
Figure 748823DEST_PATH_IMAGE214
Figure 41264DEST_PATH_IMAGE215
Figure 727722DEST_PATH_IMAGE216
Is calculated according to the following formula. Handle
Figure 156429DEST_PATH_IMAGE217
And
Figure 807991DEST_PATH_IMAGE218
as the state currently observed by the edge server, i.e.
Figure 446782DEST_PATH_IMAGE219
. In accordance with the above-described description,
Figure 576412DEST_PATH_IMAGE217
and
Figure 125205DEST_PATH_IMAGE218
are all high-dimensional vectors of dimension N, and thus
Figure 682089DEST_PATH_IMAGE220
Is a high dimensional state with dimension 2N.
Figure 808177DEST_PATH_IMAGE221
(2) The actions are as follows:
will be high dimensional motion space
Figure 741498DEST_PATH_IMAGE222
Reducing the dimension of the video caching action in the middle to obtain a low-dimension action space
Figure 144797DEST_PATH_IMAGE223
Of dimension of
Figure 872582DEST_PATH_IMAGE224
. Then at time step t, the edge server can select the low-dimensional continuous video caching action as
Figure 485966DEST_PATH_IMAGE225
Figure 957398DEST_PATH_IMAGE226
Figure 215204DEST_PATH_IMAGE225
Video caching actions that need to be restored to original high-dimensional dispersion
Figure 379469DEST_PATH_IMAGE225
Of dimensions of
Figure 981614DEST_PATH_IMAGE227
Figure 991158DEST_PATH_IMAGE228
(3) Probability of state transition:
in MDP, the edge server is in state
Figure 369050DEST_PATH_IMAGE229
According to the actions
Figure 438637DEST_PATH_IMAGE180
The result of selecting video to cache is
Figure 26614DEST_PATH_IMAGE230
To determine.
(4) Instant reward:
the edge server obtains instant reward after caching the video at time step t
Figure 574270DEST_PATH_IMAGE231
. The fringe server gets the accumulated reward at time step t
Figure 72247DEST_PATH_IMAGE232
The formula is as follows:
Figure 578315DEST_PATH_IMAGE233
the goal of the edge server is to maximize the cumulative revenue, i.e., the expectation of a cumulative prize, the formula is as follows:
Figure 388008DEST_PATH_IMAGE234
the optimization objective is converted into the optimal video caching action of solving the edge server in the time step t
Figure 4934DEST_PATH_IMAGE180
To maximize the cumulative revenue of the edge servers.
Step S1-3: the high-dimensional video caching action selection model based on the improved DDPG combines the DDPG with a trained decoder. The DDPG includes an operator, a critic and a playback buffer.
The actor is divided into an online actor network and a target actor network which are both deep fully-connected neural networks with 4 layers, and the network parameters are respectively
Figure 91839DEST_PATH_IMAGE235
And
Figure 768808DEST_PATH_IMAGE236
. Import to online actor network is the state observed by the edge server
Figure 65797DEST_PATH_IMAGE237
Output as low-dimensional continuous video caching action
Figure 220834DEST_PATH_IMAGE238
. The target actor network is used for updating network parameters of the online actor network. The decoder is a deep fully-connected neural network with 6 layers, and the network parameters are
Figure 427825DEST_PATH_IMAGE239
. Input of decoder is low-dimensional continuous video caching action
Figure 275695DEST_PATH_IMAGE238
Video caching action output as original high-dimensional dispersion
Figure 295866DEST_PATH_IMAGE045
. Replay buffers store states observed by edge servers
Figure 520174DEST_PATH_IMAGE229
Low-dimensional continuous video caching action
Figure 50512DEST_PATH_IMAGE238
Edge server based on actions
Figure 193918DEST_PATH_IMAGE045
Instant reward obtained after selecting video to cache
Figure 340865DEST_PATH_IMAGE240
And the state of the next time step observed by the edge server
Figure 103285DEST_PATH_IMAGE241
I.e. by
Figure 19288DEST_PATH_IMAGE242
. The critic is divided into an online critic network and a target critic network, both are deep full-connection neural networks with 4 layers, and the network parameters are respectively
Figure 68016DEST_PATH_IMAGE243
And
Figure 702260DEST_PATH_IMAGE244
. The input to the online critic network is the data sampled from the replay buffer
Figure 2791DEST_PATH_IMAGE245
And outputting the state action value after the video cache is selected for the edge server
Figure 38880DEST_PATH_IMAGE246
I.e. an estimate of the accumulated revenue obtained by the edge server. The target critic network is used for updating network parameters of the online critic network.
Step S2: network parameters of a high-dimensional video caching action selection model based on the improved DDPG are trained through an Adam algorithm, the training process is based on a designed training sample, the training sample is generated in the interaction process of an edge server and a video caching environment, and the training sample comprises a state observed by the edge server, an original high-dimensional discrete video caching action, an instant reward obtained after the edge server selects videos for caching and a state of the next time step observed by the edge server. Before training the decoder, the encoder and a deep fully-connected neural network are required to be trained. The encoder is a deep full-connection neural network with 6 layers, and the network parameters are
Figure 258509DEST_PATH_IMAGE247
. encoder input as original high-dimensional discrete video caching action
Figure 380049DEST_PATH_IMAGE045
Video caching action with continuous low-dimensional output
Figure 484271DEST_PATH_IMAGE238
. The network parameters of the deep fully-connected neural network are
Figure 109287DEST_PATH_IMAGE248
The number of network layers is 5. The input of the deep fully-connected neural network is the state observed by the edge server
Figure 266861DEST_PATH_IMAGE229
And low-dimensional continuous video caching action
Figure 610118DEST_PATH_IMAGE238
The output is the state of the next time
Figure 518031DEST_PATH_IMAGE249
. As shown in fig. 4, a schematic flow diagram of a training algorithm based on Adam according to an embodiment of the present invention includes the following specific steps:
step S2-1: respectively randomly initializing the encoder,Network parameters of the deep full-connection neural network and the decoder:
Figure 263133DEST_PATH_IMAGE250
Figure 90144DEST_PATH_IMAGE251
and
Figure 920697DEST_PATH_IMAGE252
step S2-2: encoder caching original high-dimensional discrete video
Figure 366721DEST_PATH_IMAGE045
Dimension reduction into low-dimensional continuous video caching action
Figure 966330DEST_PATH_IMAGE225
Step S2-3: caching low-dimensional continuous video
Figure 964242DEST_PATH_IMAGE225
States observed with edge servers
Figure 282091DEST_PATH_IMAGE229
Inputting the data into a deep full-connection neural network, and outputting the data to obtain the state of the next moment
Figure 266227DEST_PATH_IMAGE253
Step S2-4: minimizing loss of encoder and deep fully-connected neural networks
Figure 985922DEST_PATH_IMAGE254
To update the parameters of the encoder and the deep fully-connected neural network
Figure 154735DEST_PATH_IMAGE255
And
Figure 959880DEST_PATH_IMAGE256
the formula is as follows:
Figure 747707DEST_PATH_IMAGE257
Figure 321908DEST_PATH_IMAGE258
Wherein, the first and the second end of the pipe are connected with each other,
Figure 151368DEST_PATH_IMAGE259
is a formula of
Figure 912651DEST_PATH_IMAGE260
Calculating expectation;
Figure 504169DEST_PATH_IMAGE261
as a policy
Figure 198456DEST_PATH_IMAGE262
Distribution of lower state transition probabilities;
Figure 443493DEST_PATH_IMAGE263
the parameters for the deep fully-connected neural network are
Figure 488809DEST_PATH_IMAGE264
Time, input
Figure 618439DEST_PATH_IMAGE265
And
Figure 901653DEST_PATH_IMAGE225
then, the network outputs
Figure 848749DEST_PATH_IMAGE266
The probability of (d);
Figure 850203DEST_PATH_IMAGE267
as an encoder parameter is
Figure 783524DEST_PATH_IMAGE268
Time, input
Figure 780299DEST_PATH_IMAGE225
Then, outputting the encoder;
Figure 773663DEST_PATH_IMAGE269
is KL divergence, represents
Figure 262413DEST_PATH_IMAGE270
And Gaussian distribution
Figure 733845DEST_PATH_IMAGE271
The difference between them;
Figure DEST_PATH_IMAGE272
is the weight of the KL divergence.
Step S2-5: repeating steps S2-S2-4 until
Figure 352171DEST_PATH_IMAGE273
And (5) converging to finish the training of the encoder and the deep fully-connected neural network.
Step S2-6: will be provided with
Figure 516436DEST_PATH_IMAGE225
Inputting the video data into decoder, and outputting the video data to obtain original high-dimensional discrete video buffer action
Figure 226903DEST_PATH_IMAGE045
The formula is as follows:
Figure 626660DEST_PATH_IMAGE274
wherein, the first and the second end of the pipe are connected with each other,
Figure 4552DEST_PATH_IMAGE275
is a parameter of decoder
Figure 74139DEST_PATH_IMAGE276
Time, input
Figure 537482DEST_PATH_IMAGE225
And then, decoder output.
Step S2-7: minimizing the distance between two low-dimensional consecutive video buffering actions
Figure 475351DEST_PATH_IMAGE277
To update decoder parameters
Figure 707749DEST_PATH_IMAGE278
The formula is as follows:
Figure 948237DEST_PATH_IMAGE279
wherein the content of the first and second substances,
Figure 898876DEST_PATH_IMAGE280
as a parameter of the encoder is
Figure 640436DEST_PATH_IMAGE281
Inputting the output of the decoder after the action of caching the original high-dimensional discrete video output by the decoder, and outputting the encoder; first item
Figure 727340DEST_PATH_IMAGE282
Ensuring that decoder is a one-sided inverse of encoder, i.e.
Figure 404309DEST_PATH_IMAGE283
However, but
Figure 576665DEST_PATH_IMAGE284
(ii) a Item II
Figure 92222DEST_PATH_IMAGE285
Ensure
Figure 299212DEST_PATH_IMAGE286
Is the only minimum;
Figure 881503DEST_PATH_IMAGE287
is the weight of the second term.
Step (ii) ofS2-8: repeating steps S2-6-S2-7 until
Figure 665789DEST_PATH_IMAGE288
And converging to finish the decoder training.
Step S2-9: the network parameters of the on-line operator network and the target operator network are respectively
Figure 624517DEST_PATH_IMAGE289
And
Figure 420435DEST_PATH_IMAGE290
the network parameters of the online critic network and the target critic network are respectively
Figure 563841DEST_PATH_IMAGE291
And
Figure 710788DEST_PATH_IMAGE292
. Random initialization
Figure 207628DEST_PATH_IMAGE293
And
Figure 123632DEST_PATH_IMAGE294
then respectively connect
Figure 437939DEST_PATH_IMAGE295
And
Figure 72182DEST_PATH_IMAGE296
is assigned to
Figure 372714DEST_PATH_IMAGE297
And
Figure 143223DEST_PATH_IMAGE298
step S2-10: edge server observed state
Figure 129896DEST_PATH_IMAGE229
Then, low-dimensional continuity is selected according to the online actor network and random noiseVideo caching actions
Figure 251436DEST_PATH_IMAGE238
The formula is as follows:
Figure 90079DEST_PATH_IMAGE299
wherein the content of the first and second substances,
Figure 980675DEST_PATH_IMAGE300
the parameters for an online actor network are
Figure 636784DEST_PATH_IMAGE301
Time, input state
Figure 714461DEST_PATH_IMAGE237
Then, the output of the network;
Figure 622375DEST_PATH_IMAGE302
random noise is used to increase the exploration of video buffering action.
Step S2-11: decoder will
Figure 492110DEST_PATH_IMAGE238
Restore to original high-dimensional discrete video caching action
Figure 194487DEST_PATH_IMAGE045
Step S2-12: the edge server based on the action
Figure 25040DEST_PATH_IMAGE045
Instant reward is obtained after the video is selected and cached
Figure 205486DEST_PATH_IMAGE303
And observe the state of the next time step
Figure 195307DEST_PATH_IMAGE304
Step S2-13: will be provided with
Figure 68585DEST_PATH_IMAGE305
Stored in the replay buffer.
Step S2-14: randomly sampling M pieces of data from a replay buffer
Figure 386434DEST_PATH_IMAGE306
Step S2-15: minimizing loss in online critic networks using Adam's algorithm
Figure 370571DEST_PATH_IMAGE307
To update its parameters
Figure 716364DEST_PATH_IMAGE308
The formula is as follows:
Figure 760543DEST_PATH_IMAGE309
wherein, the first and the second end of the pipe are connected with each other,
Figure 565688DEST_PATH_IMAGE310
based on actions for edge servers
Figure 353515DEST_PATH_IMAGE311
Selecting a video to cache to obtain instant rewards;
Figure 52350DEST_PATH_IMAGE312
for discount rates, indicating a degree of interest in future rewards;
Figure 267431DEST_PATH_IMAGE313
the parameter for the target critical network is
Figure 294293DEST_PATH_IMAGE314
Time, input
Figure 885811DEST_PATH_IMAGE315
And
Figure 439152DEST_PATH_IMAGE316
the latter state action value;
Figure 825134DEST_PATH_IMAGE317
the parameters for the online critical network are
Figure 604871DEST_PATH_IMAGE318
Time, input
Figure 734501DEST_PATH_IMAGE319
And
Figure 142349DEST_PATH_IMAGE320
the latter state action value.
Step S2-16: online actor network computation strategy gradient
Figure 964811DEST_PATH_IMAGE321
Thereafter, parameters are updated using the Adam algorithm
Figure 966265DEST_PATH_IMAGE322
The formula is as follows:
Figure 634007DEST_PATH_IMAGE323
Figure 663405DEST_PATH_IMAGE324
wherein, the first and the second end of the pipe are connected with each other,
Figure 656769DEST_PATH_IMAGE325
is a state action value
Figure 879940DEST_PATH_IMAGE326
About actions
Figure 616952DEST_PATH_IMAGE327
A gradient of (a);
Figure 999391DEST_PATH_IMAGE328
the parameters for an online actor network are
Figure 632498DEST_PATH_IMAGE329
The output action is related to
Figure 608544DEST_PATH_IMAGE330
A gradient of (a);
Figure 8302DEST_PATH_IMAGE331
as a parameter
Figure 120614DEST_PATH_IMAGE332
The update step size of (c).
Step S2-17: updating parameters according to soft mode
Figure 455781DEST_PATH_IMAGE333
And
Figure 653544DEST_PATH_IMAGE334
the formula is as follows:
Figure 856992DEST_PATH_IMAGE335
Figure 89390DEST_PATH_IMAGE336
wherein the content of the first and second substances,
Figure 329879DEST_PATH_IMAGE337
as a parameter
Figure 280517DEST_PATH_IMAGE338
And
Figure 523542DEST_PATH_IMAGE339
the step size is updated.
Step S2-18: repeating steps S2-10-S2-17 until the loss function
Figure 610447DEST_PATH_IMAGE340
And converging to finish training.
And step S3: inputting high dimensional states observed by an edge server
Figure 21836DEST_PATH_IMAGE220
The method comprises the following steps of outputting low-dimensional continuous video caching actions by utilizing a trained high-dimensional video caching action selection model based on the improved DDPG, reducing the actions to original high-dimensional discrete video caching actions by a decoder, and selecting videos to be cached by an edge server according to the original high-dimensional discrete video caching actions, wherein the specific steps are as follows:
step S3-1: edge server observes high dimensional state
Figure 194192DEST_PATH_IMAGE220
Then, will
Figure 739443DEST_PATH_IMAGE220
Inputting the video into a trained online operator network, and outputting a low-dimensional continuous video caching action
Figure 680854DEST_PATH_IMAGE225
The formula is as follows:
Figure 263145DEST_PATH_IMAGE341
wherein, the first and the second end of the pipe are connected with each other,
Figure 47430DEST_PATH_IMAGE342
is the parameter of the on-line actor network obtained after the training of the step S2;
Figure 6159DEST_PATH_IMAGE343
as a parameter of the online operator network is
Figure 67656DEST_PATH_IMAGE344
Time, input state
Figure 945482DEST_PATH_IMAGE220
And then, outputting the network.
Step S3-2: input device
Figure 826850DEST_PATH_IMAGE225
Decoder output original high-dimensional discrete video caching action
Figure 589270DEST_PATH_IMAGE045
The formula is as follows:
Figure 505273DEST_PATH_IMAGE345
wherein the content of the first and second substances,
Figure 321045DEST_PATH_IMAGE102
is the parameter of decoder obtained after the training of step S2;
Figure 955288DEST_PATH_IMAGE346
is a parameter of decoder
Figure 255820DEST_PATH_IMAGE347
Time, input
Figure 26330DEST_PATH_IMAGE348
And then, outputting the decoder.
Step S3-3: edge server based on
Figure 511538DEST_PATH_IMAGE045
One video is selected from a large number of videos. If the video is not contained in the cached video of the edge server and the remaining storage capacity of the edge server is sufficient to cache the video, the video is cached in the edge server. Otherwise, deleting the video cached in the edge server in turn until the remaining storage capacity of the edge server is enough to cache the video, and then caching the video in the edge server.
To demonstrate the effectiveness of the present invention, preliminary experiments were conducted. The method proposed by the present invention was compared to DDPG, DQN and PPO. In the DDPG, an operator outputs high-dimensional continuous video caching action, then the high-dimensional continuous video caching action is subjected to chamber-softmax processing, criticc outputs a corresponding state action value, and an edge server selects a video to be cached according to the action output by the operator; in the DQN, an edge server obtains all state action values related to the current state according to a Q network, and then selects a video to cache by utilizing 1013to greedy; in PPO, the edge server selects video to cache by using the action output by new _ operator. The comparison results are shown in fig. 5. It can be seen from the figure that the convergence speed of the edge server is fastest and the obtained accumulated benefit is maximum when the method provided by the invention is used. Therefore, using the method of the present invention, the time delay of video transmission and the traffic cost spent by the user are minimized. The method provided by the invention is more suitable for being deployed in the edge server.
The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiment, but equivalent modifications or changes made by those skilled in the art according to the present disclosure should be included in the scope of the present invention as set forth in the appended claims.

Claims (10)

1. The high-dimensional video cache selection method based on deep reinforcement learning is characterized by comprising the following steps: the method comprises the following steps:
step S1: performing system modeling aiming at a high-dimensional video cache problem, and then establishing a high-dimensional video cache action selection model based on an improved depth certainty strategy gradient DDPG;
step S2: training network parameters of a high-dimensional video caching action selection model based on the improved DDPG through an Adam algorithm;
and step S3: and the edge server selects videos to be cached by using the trained high-dimensional video caching action model based on the improved DDPG.
2. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 1, wherein: in step S1, the specific steps are as follows:
step S1-1: formalizing the high-dimensional video caching problem of the edge server:
setting the number of users in the coverage range of the edge server as U, the number of videos as N, the time length as T, the maximum storage capacity of the edge server as C, and the unit time delay and the unit flow cost from the local user to the edge server as l and p respectively; the video cache selection strategy of the edge server is set as
Figure 242678DEST_PATH_IMAGE001
Figure 930012DEST_PATH_IMAGE002
Wherein the content of the first and second substances,
Figure 382859DEST_PATH_IMAGE003
when the time step t is represented, the edge server performs video caching action;
Figure 991695DEST_PATH_IMAGE004
(ii) a Since the number of videos N is huge, so
Figure 837291DEST_PATH_IMAGE005
Is high-dimensional discrete; when in use
Figure 847972DEST_PATH_IMAGE006
When it is, the edge server caches the video with the mark j at the time step t, otherwise,
Figure 612666DEST_PATH_IMAGE007
(ii) a If the edge server only selects one video cache per time step, then
Figure 355801DEST_PATH_IMAGE008
When the time step t is set, the situation that the user k watches the video is
Figure 67405DEST_PATH_IMAGE009
Figure 604697DEST_PATH_IMAGE010
Figure 743554DEST_PATH_IMAGE009
Is a high-dimensional vector when
Figure 858141DEST_PATH_IMAGE011
When it is, represents the user
Figure 763649DEST_PATH_IMAGE012
The video numbered j is viewed at time t, otherwise,
Figure 14501DEST_PATH_IMAGE013
when the time step t is set, the condition that the edge server caches the video is
Figure 589839DEST_PATH_IMAGE014
Figure 926143DEST_PATH_IMAGE015
Figure 979549DEST_PATH_IMAGE016
Is also a high-dimensional vector when
Figure 491433DEST_PATH_IMAGE017
At time step t, it means that the edge server has cached the video labeled j, otherwise,
Figure 627885DEST_PATH_IMAGE018
let the memory size occupied by the video with reference number j be
Figure 451485DEST_PATH_IMAGE019
Clothing with hem at time step tThe instant reward obtained after the server caches the video is
Figure 715107DEST_PATH_IMAGE020
(ii) a Video cache selection strategy for solving edge server as optimization target of whole problem
Figure 674973DEST_PATH_IMAGE021
To maximize the cumulative revenue of the edge server, i.e. to minimize the time delay of the video transmission and the traffic cost spent by the user:
Figure 857692DEST_PATH_IMAGE022
Figure 293222DEST_PATH_IMAGE023
Figure 219589DEST_PATH_IMAGE024
Figure 768382DEST_PATH_IMAGE025
Figure 262949DEST_PATH_IMAGE026
Figure 529982DEST_PATH_IMAGE027
Figure 525620DEST_PATH_IMAGE028
Figure 853221DEST_PATH_IMAGE029
wherein the content of the first and second substances,
Figure 518688DEST_PATH_IMAGE030
representing a degree of interest in future rewards for a discount rate;
Figure 804176DEST_PATH_IMAGE031
when the time step is t, because the user k watches the video, the edge server obtains instant rewards; e is a positive value instant reward obtained by the edge server when the video to be watched by the user is cached by the edge server;
Figure 806767DEST_PATH_IMAGE032
to favor the time delay of video transmission and the traffic cost spent by the user,
Figure 985945DEST_PATH_IMAGE032
is in the range of 0 to 1; c is the maximum storage capacity of the edge server; u is the number of users in the coverage range of the edge server; in the formula
Figure 681368DEST_PATH_IMAGE033
Represents and;
step S1-2: describing the problem model into a Markov decision process
Figure 63939DEST_PATH_IMAGE034
Represents; wherein S is a state space storing states observable by the edge server;
Figure 870221DEST_PATH_IMAGE035
the method comprises the steps that a high-dimensional action space is formed, and original high-dimensional discrete video caching actions which can be executed by an edge server are stored;
Figure 779271DEST_PATH_IMAGE036
is of low dimensionThe action space is used for storing low-dimensional continuous video caching actions selectable by the edge server;
Figure 380017DEST_PATH_IMAGE037
is a reward space for storing instant rewards obtained by the edge server;
Figure 499152DEST_PATH_IMAGE038
the state transition probability space represents the distribution condition of the edge server in a certain state and entering the next state after executing actions;
step S1-3: the high-dimensional video caching action selection model based on the improved DDPG combines the DDPG with a trained decoder; the DDPG comprises an operator, a critic and a replay cache region;
the actor is divided into an online actor network and a target actor network which are both deep fully-connected neural networks with 4 layers, and the network parameters are respectively
Figure 109124DEST_PATH_IMAGE039
And
Figure 872681DEST_PATH_IMAGE040
(ii) a Import to online actor network is the state observed by the edge server
Figure 785273DEST_PATH_IMAGE041
Video caching action with continuous low-dimensional output
Figure 267070DEST_PATH_IMAGE042
(ii) a The target actor network is used for updating network parameters of the online actor network;
decoder is a deep fully-connected neural network with 6 layers, and the network parameters are
Figure 805368DEST_PATH_IMAGE043
(ii) a decoder input as low-dimensional continuous video buffering action
Figure 423431DEST_PATH_IMAGE044
Output as the original high-dimensional discrete video caching action
Figure 897138DEST_PATH_IMAGE045
Replay buffers store states observed by edge servers
Figure 476018DEST_PATH_IMAGE046
Low-dimensional continuous video caching action
Figure 162214DEST_PATH_IMAGE044
Edge server action based
Figure 165942DEST_PATH_IMAGE045
Instant reward obtained after selecting video to cache
Figure 406955DEST_PATH_IMAGE047
And the state of the next time step observed by the edge server
Figure 863344DEST_PATH_IMAGE048
I.e. by
Figure 353231DEST_PATH_IMAGE049
The critic is divided into an online critic network and a target critic network, both are deep fully-connected neural networks with 4 layers, and the network parameters are respectively
Figure 86832DEST_PATH_IMAGE050
And
Figure 636762DEST_PATH_IMAGE051
(ii) a The input to the online critic network is the data sampled from the replay buffer
Figure 705081DEST_PATH_IMAGE052
And outputting the state action value after the video cache is selected for the edge server
Figure 733080DEST_PATH_IMAGE053
Namely, the estimation of the accumulated income obtained by the edge server; the target critic network is used for updating network parameters of the online critic network.
3. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 2, wherein: in step S1-2, the defined states are:
at t-1 time step, the watched situation of each video is
Figure 445821DEST_PATH_IMAGE054
Figure 42019DEST_PATH_IMAGE055
Figure 473000DEST_PATH_IMAGE056
Is calculated according to the following formula:
Figure 304690DEST_PATH_IMAGE057
handle
Figure 996571DEST_PATH_IMAGE058
And
Figure 888304DEST_PATH_IMAGE059
as the state currently observed by the edge server, i.e.
Figure 541002DEST_PATH_IMAGE060
(ii) a In light of the above-described description,
Figure 51749DEST_PATH_IMAGE061
and
Figure 473503DEST_PATH_IMAGE062
are all high-dimensional vectors of dimension N, and thus
Figure 926350DEST_PATH_IMAGE063
Is a high dimensional state with dimension 2N.
4. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 2, wherein: in step S1-2, actions are defined as:
will have a high dimensional motion space
Figure 800765DEST_PATH_IMAGE035
Reducing the dimension of the video caching action in the middle to obtain a low-dimensional action space
Figure 505416DEST_PATH_IMAGE036
Of dimensions of
Figure 657043DEST_PATH_IMAGE064
(ii) a Then at time step t, the edge server can select the low-dimensional continuous video caching action as
Figure 890578DEST_PATH_IMAGE065
Figure 517868DEST_PATH_IMAGE066
Figure 357036DEST_PATH_IMAGE044
Video caching actions that need to be restored to original high-dimensional dispersion
Figure 18961DEST_PATH_IMAGE045
Of dimension of
Figure 423398DEST_PATH_IMAGE067
Figure 147771DEST_PATH_IMAGE068
5. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 2, wherein: in step S1-2, the state transition probability is defined as:
in MDP, the edge server is in state
Figure 663066DEST_PATH_IMAGE069
According to the action
Figure 179498DEST_PATH_IMAGE070
The result of selecting video for caching is
Figure 879470DEST_PATH_IMAGE071
To decide.
6. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 2, wherein: in step S1-2, the instant prize is defined as:
the edge server obtains instant reward after caching the video at time step t
Figure 481353DEST_PATH_IMAGE072
(ii) a The fringe server gets the accumulated reward at time step t
Figure 800339DEST_PATH_IMAGE073
The formula is as follows:
Figure 46643DEST_PATH_IMAGE074
the goal of the edge server is to maximize the cumulative revenue, i.e., the expectation of the cumulative rewards, the formula is as follows:
Figure 792882DEST_PATH_IMAGE075
the optimization objective is converted into the optimal video caching action of solving the edge server in the time step t
Figure 741116DEST_PATH_IMAGE045
To maximize the cumulative revenue of the edge servers.
7. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 1, wherein: in the step S2, network parameters of a high-dimensional video caching action selection model based on the improved DDPG are trained through an Adam algorithm, and the training process is based on training samples; before training the decoder, an encoder and a deep fully-connected neural network need to be trained; the encoder is a deep fully-connected neural network with 6 layers of layers and network parameters of
Figure 863792DEST_PATH_IMAGE076
(ii) a encoder input is original high-dimensional discrete video caching action
Figure 89237DEST_PATH_IMAGE045
Output as low-dimensional continuous video caching action
Figure 881744DEST_PATH_IMAGE044
(ii) a The network parameters of the deep fully-connected neural network are
Figure 458219DEST_PATH_IMAGE077
The number of the network layers is 5; the input of the deep fully-connected neural network is the state observed by the edge server
Figure 853428DEST_PATH_IMAGE078
And low-dimensional continuous video cachingMovement of
Figure 323593DEST_PATH_IMAGE044
The output is the state of the next time
Figure 677214DEST_PATH_IMAGE079
8. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 7, wherein: the step S2 comprises the following specific steps:
step S2-1: respectively randomly initializing network parameters of encoder, deep fully-connected neural network and decoder
Figure 944247DEST_PATH_IMAGE080
Figure 877568DEST_PATH_IMAGE077
And
Figure 218550DEST_PATH_IMAGE081
step S2-2: encoder caching original high-dimensional discrete video
Figure 413513DEST_PATH_IMAGE045
Dimension reduction into low-dimensional continuous video caching action
Figure 902263DEST_PATH_IMAGE044
Step S2-3: caching low-dimensional continuous video
Figure 701592DEST_PATH_IMAGE044
States observed with edge servers
Figure 224977DEST_PATH_IMAGE078
Inputting the data into a deep fully-connected neural network, and outputting the data to obtain the next momentStatus of state
Figure 795767DEST_PATH_IMAGE082
Step S2-4: minimizing loss of encoder and deep fully-connected neural networks
Figure 568551DEST_PATH_IMAGE083
To update the parameters of the encoder and the deep fully-connected neural network
Figure 499467DEST_PATH_IMAGE080
And
Figure 408517DEST_PATH_IMAGE077
the formula is as follows:
Figure 9263DEST_PATH_IMAGE084
Figure 144709DEST_PATH_IMAGE085
wherein the content of the first and second substances,
Figure 754682DEST_PATH_IMAGE086
is a formula of
Figure 518238DEST_PATH_IMAGE087
Calculating expectation;
Figure 24306DEST_PATH_IMAGE088
as a policy
Figure 896316DEST_PATH_IMAGE089
Distribution of lower state transition probability;
Figure 450925DEST_PATH_IMAGE090
the parameters for the deep fully-connected neural network are
Figure 334568DEST_PATH_IMAGE091
Time, input
Figure 277116DEST_PATH_IMAGE078
And
Figure 370843DEST_PATH_IMAGE044
then, the network outputs
Figure 57039DEST_PATH_IMAGE092
The probability of (d);
Figure 795188DEST_PATH_IMAGE093
as an encoder parameter is
Figure 49583DEST_PATH_IMAGE094
Time, input
Figure 240393DEST_PATH_IMAGE044
Then, outputting an encoder;
Figure 464701DEST_PATH_IMAGE095
is KL divergence, represents
Figure 716078DEST_PATH_IMAGE096
And Gaussian distribution
Figure 938112DEST_PATH_IMAGE097
The difference between them;
Figure 616218DEST_PATH_IMAGE098
is the weight of the KL divergence;
step S2-5: repeating steps S2-2 to S2-4 until
Figure 909796DEST_PATH_IMAGE099
Converging to finish the training of the encoder and the deep fully-connected neural network;
step S2-6: will be provided with
Figure 481592DEST_PATH_IMAGE044
Inputting the video data into decoder, and outputting the video data to obtain original high-dimensional discrete video cache action
Figure 202423DEST_PATH_IMAGE045
The formula is as follows:
Figure 367825DEST_PATH_IMAGE100
wherein, the first and the second end of the pipe are connected with each other,
Figure 340460DEST_PATH_IMAGE101
as the parameter of decoder is
Figure 642129DEST_PATH_IMAGE102
Time, input
Figure 799441DEST_PATH_IMAGE103
Then, outputting the decoder;
step S2-7: minimizing the distance between two low-dimensional consecutive video buffering actions
Figure 311193DEST_PATH_IMAGE104
To update decoder parameters
Figure 946574DEST_PATH_IMAGE105
The formula is as follows:
Figure 633907DEST_PATH_IMAGE106
wherein, the first and the second end of the pipe are connected with each other,
Figure 430962DEST_PATH_IMAGE107
as an encoder parameter is
Figure 711902DEST_PATH_IMAGE108
Inputting the output of the decoder after the action of caching the original high-dimensional discrete video output by the decoder, and outputting the encoder; first item
Figure 619815DEST_PATH_IMAGE109
Ensuring that decoder is a one-sided inverse of encoder, i.e.
Figure 551868DEST_PATH_IMAGE110
However, but
Figure 519824DEST_PATH_IMAGE111
(ii) a Second item
Figure 288060DEST_PATH_IMAGE112
Ensure that
Figure 999664DEST_PATH_IMAGE113
Is the only minimum;
Figure 927168DEST_PATH_IMAGE114
is the weight of the second term;
step S2-8: repeating steps S2-6 to S2-7 until
Figure 724748DEST_PATH_IMAGE115
Converging to finish decoder training;
step S2-9: the network parameters of the on-line actor network and the target actor network are respectively
Figure 573755DEST_PATH_IMAGE116
And
Figure 229995DEST_PATH_IMAGE117
the network parameters of the online critic network and the target critic network are respectively
Figure 480848DEST_PATH_IMAGE118
And
Figure 790607DEST_PATH_IMAGE119
(ii) a Random initialization
Figure 517123DEST_PATH_IMAGE116
And
Figure 836109DEST_PATH_IMAGE120
then respectively will
Figure 675889DEST_PATH_IMAGE116
And
Figure 828653DEST_PATH_IMAGE118
assign to
Figure 386673DEST_PATH_IMAGE121
And
Figure 774929DEST_PATH_IMAGE122
step S2-10: edge server observed state
Figure 734795DEST_PATH_IMAGE078
Then, low-dimensional continuous video caching action is selected according to the online operator network and random noise
Figure 307727DEST_PATH_IMAGE103
The formula is as follows:
Figure 87465DEST_PATH_IMAGE123
wherein the content of the first and second substances,
Figure 154778DEST_PATH_IMAGE124
as a parameter of the online operator network is
Figure 500308DEST_PATH_IMAGE116
Time, input state
Figure 712984DEST_PATH_IMAGE078
Then, the output of the network;
Figure 245596DEST_PATH_IMAGE125
random noise is used for increasing the exploration of video caching action;
step S2-11: decoder will
Figure 975655DEST_PATH_IMAGE103
Restore to original high-dimensional discrete video caching action
Figure 785479DEST_PATH_IMAGE045
Step S2-12: the edge server according to the action
Figure 44422DEST_PATH_IMAGE045
Instant reward is obtained after the video is selected and cached
Figure 329910DEST_PATH_IMAGE126
And observe the state of the next time step
Figure 725644DEST_PATH_IMAGE127
Step S2-13: will be provided with
Figure 514608DEST_PATH_IMAGE128
Storing into a replay buffer;
step S2-14: randomly sampling M pieces of data from a replay buffer
Figure 210032DEST_PATH_IMAGE129
Step S2-15: using Adam's algorithm, inLine critical network minimizing loss
Figure 592603DEST_PATH_IMAGE130
To update its parameters
Figure 398885DEST_PATH_IMAGE131
The formula is as follows:
Figure 307935DEST_PATH_IMAGE132
wherein, the first and the second end of the pipe are connected with each other,
Figure 298894DEST_PATH_IMAGE133
based on actions for edge servers
Figure 293394DEST_PATH_IMAGE134
Selecting a video to cache to obtain instant rewards;
Figure 903367DEST_PATH_IMAGE135
for discount rates, indicating a degree of interest in future rewards;
Figure 542290DEST_PATH_IMAGE136
the parameter for the target critic network is
Figure 579516DEST_PATH_IMAGE137
Time, input
Figure 185947DEST_PATH_IMAGE138
And
Figure 334032DEST_PATH_IMAGE139
the latter state action value;
Figure 217674DEST_PATH_IMAGE140
the parameter for the online critic network is
Figure 301168DEST_PATH_IMAGE141
Time, input
Figure 270261DEST_PATH_IMAGE142
And
Figure 222036DEST_PATH_IMAGE143
the latter state action value;
step S2-16: online actor network computation strategy gradient
Figure 819240DEST_PATH_IMAGE144
Thereafter, parameters are updated using Adam's algorithm
Figure 463848DEST_PATH_IMAGE145
The formula is as follows:
Figure 654657DEST_PATH_IMAGE146
Figure 19911DEST_PATH_IMAGE147
wherein the content of the first and second substances,
Figure 878145DEST_PATH_IMAGE148
is a state action value
Figure 162496DEST_PATH_IMAGE149
About actions
Figure 222026DEST_PATH_IMAGE150
A gradient of (a);
Figure 781183DEST_PATH_IMAGE151
as a parameter of the online operator network is
Figure 103712DEST_PATH_IMAGE152
The output action is related to
Figure 824543DEST_PATH_IMAGE153
A gradient of (a);
Figure 989945DEST_PATH_IMAGE154
as a parameter
Figure 946269DEST_PATH_IMAGE155
The update step length of (2);
step S2-17: updating parameters according to soft mode
Figure 779095DEST_PATH_IMAGE156
And
Figure 670828DEST_PATH_IMAGE157
the formula is as follows:
Figure 198892DEST_PATH_IMAGE158
Figure 834273DEST_PATH_IMAGE159
wherein the content of the first and second substances,
Figure 646240DEST_PATH_IMAGE160
as a parameter
Figure 708874DEST_PATH_IMAGE161
And
Figure 583289DEST_PATH_IMAGE162
delay update step size of;
step S2-18: repeating steps S2-10 to S2-17 until the loss function
Figure 897727DEST_PATH_IMAGE163
And converging to finish training.
9. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 1, wherein: in step S3, the high dimensional state observed by the edge server is input
Figure 173988DEST_PATH_IMAGE164
And outputting a low-dimensional continuous video caching action by using a trained high-dimensional video caching action selection model based on the improved DDPG, reducing the action into an original high-dimensional discrete video caching action by a decoder, and selecting a video for caching by the edge server according to the original high-dimensional discrete video caching action.
10. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 9, wherein: the step S3 comprises the following specific steps:
step S3-1: edge server observes high dimensional state
Figure 673102DEST_PATH_IMAGE165
Then, will
Figure 159447DEST_PATH_IMAGE166
Inputting the video into a trained online actor network, and outputting low-dimensional continuous video caching action
Figure 402210DEST_PATH_IMAGE167
The formula is as follows:
Figure 267397DEST_PATH_IMAGE168
wherein the content of the first and second substances,
Figure 812779DEST_PATH_IMAGE169
is obtained after the training of step S2Parameters of the line actor network;
Figure 661787DEST_PATH_IMAGE170
as a parameter of the online operator network is
Figure 442661DEST_PATH_IMAGE169
Time, input state
Figure 162355DEST_PATH_IMAGE078
Then, the output of the network;
step S3-2: input device
Figure 472114DEST_PATH_IMAGE171
Decoder output original high-dimensional discrete video caching action
Figure 998298DEST_PATH_IMAGE172
The formula is as follows:
Figure 458229DEST_PATH_IMAGE173
wherein, the first and the second end of the pipe are connected with each other,
Figure 219380DEST_PATH_IMAGE174
is the parameter of decoder obtained after the training of step S2;
Figure 231199DEST_PATH_IMAGE175
is a parameter of decoder
Figure 54798DEST_PATH_IMAGE176
Time, input
Figure 52841DEST_PATH_IMAGE177
Then, decoder output;
step S3-3: edge server based on
Figure 12707DEST_PATH_IMAGE178
Selecting a video from a plurality of videos; if the video is not contained in the cached video of the edge server and the residual storage capacity of the edge server is enough to cache the video, caching the video into the edge server; otherwise, deleting the video cached in the edge server in turn until the remaining storage capacity of the edge server is enough to cache the video, and then caching the video in the edge server.
CN202211270042.8A 2022-10-18 2022-10-18 High-dimensional video cache selection method based on deep reinforcement learning Active CN115344510B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211270042.8A CN115344510B (en) 2022-10-18 2022-10-18 High-dimensional video cache selection method based on deep reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211270042.8A CN115344510B (en) 2022-10-18 2022-10-18 High-dimensional video cache selection method based on deep reinforcement learning

Publications (2)

Publication Number Publication Date
CN115344510A true CN115344510A (en) 2022-11-15
CN115344510B CN115344510B (en) 2023-02-03

Family

ID=83957657

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211270042.8A Active CN115344510B (en) 2022-10-18 2022-10-18 High-dimensional video cache selection method based on deep reinforcement learning

Country Status (1)

Country Link
CN (1) CN115344510B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114025017A (en) * 2021-11-01 2022-02-08 杭州电子科技大学 Network edge caching method, device and equipment based on deep cycle reinforcement learning
CN114281718A (en) * 2021-12-18 2022-04-05 中国科学院深圳先进技术研究院 Industrial Internet edge service cache decision method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114025017A (en) * 2021-11-01 2022-02-08 杭州电子科技大学 Network edge caching method, device and equipment based on deep cycle reinforcement learning
CN114281718A (en) * 2021-12-18 2022-04-05 中国科学院深圳先进技术研究院 Industrial Internet edge service cache decision method and system

Also Published As

Publication number Publication date
CN115344510B (en) 2023-02-03

Similar Documents

Publication Publication Date Title
Hu et al. Leveraging meta-path based context for top-n recommendation with a neural co-attention model
Zhong et al. A deep reinforcement learning-based framework for content caching
Wu et al. Mobility-aware cooperative caching in vehicular edge computing based on asynchronous federated and deep reinforcement learning
He et al. QoE-driven content-centric caching with deep reinforcement learning in edge-enabled IoT
JP5542688B2 (en) Apparatus and method for optimizing user access to content
CN109995851B (en) Content popularity prediction and edge caching method based on deep learning
Zhao et al. Mahrl: Multi-goals abstraction based deep hierarchical reinforcement learning for recommendations
CN114595396B (en) Federal learning-based sequence recommendation method and system
Fedchenko et al. Feedforward neural networks for caching: N enough or too much?
Zheng et al. MEC-enabled wireless VR video service: A learning-based mixed strategy for energy-latency tradeoff
CN102868936A (en) Method and system for storing video logs
CN113255004A (en) Safe and efficient federal learning content caching method
CN113687960A (en) Edge calculation intelligent caching method based on deep reinforcement learning
Zhou et al. SACC: A size adaptive content caching algorithm in fog/edge computing using deep reinforcement learning
CN117221403A (en) Content caching method based on user movement and federal caching decision
CN115731498A (en) Video abstract generation method combining reinforcement learning and contrast learning
CN115344510B (en) High-dimensional video cache selection method based on deep reinforcement learning
Xue et al. Prefrec: Recommender systems with human preferences for reinforcing long-term user engagement
Avrachenkov et al. A learning algorithm for the Whittle index policy for scheduling web crawlers
Nguyen et al. User-preference-based proactive caching in edge networks
CN112836822A (en) Federal learning strategy optimization method and device based on width learning
CN117459112A (en) Mobile edge caching method and equipment in LEO satellite network based on graph rolling network
Thar et al. Meta-learning-based deep learning model deployment scheme for edge caching
CN114025017B (en) Network edge caching method, device and equipment based on deep circulation reinforcement learning
Yan et al. Drl-based collaborative edge content replication with popularity distillation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant