CN115344510A - High-dimensional video cache selection method based on deep reinforcement learning - Google Patents
High-dimensional video cache selection method based on deep reinforcement learning Download PDFInfo
- Publication number
- CN115344510A CN115344510A CN202211270042.8A CN202211270042A CN115344510A CN 115344510 A CN115344510 A CN 115344510A CN 202211270042 A CN202211270042 A CN 202211270042A CN 115344510 A CN115344510 A CN 115344510A
- Authority
- CN
- China
- Prior art keywords
- video
- edge server
- dimensional
- network
- action
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F12/00—Accessing, addressing or allocating within memory systems or architectures
- G06F12/02—Addressing or allocation; Relocation
- G06F12/08—Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems
- G06F12/0802—Addressing of a memory level in which the access to the desired data or data block requires associative addressing means, e.g. caches
- G06F12/0877—Cache access modes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24552—Database cache management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/71—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/06—Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
- G06F3/0601—Interfaces specially adapted for storage systems
- G06F3/0628—Interfaces specially adapted for storage systems making use of a particular technique
- G06F3/0655—Vertical data movement, i.e. input-output transfer; data movement between one or more hosts and one or more storage devices
- G06F3/0656—Data buffering arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- Processing Or Creating Images (AREA)
Abstract
The high-dimensional video cache selection method based on the deep reinforcement learning applies the deep reinforcement learning to the video cache selection of the edge server, considers the dynamic property and high-dimensional property of the video cache selection, and realizes the high-efficiency video cache of the edge server; the decoder is used for improving the DDPG, so that the edge server can select a proper video for caching, and the time delay of video transmission and the flow cost spent by a user are reduced; when the edge server selects video cache from massive videos, the calculation overhead is greatly reduced, excessive pressure on the edge server with limited resources is avoided, and the calculation cost is saved.
Description
Technical Field
The invention belongs to the technical field of computer application, and particularly relates to a high-dimensional video cache selection method based on deep reinforcement learning.
Background
With the advancement of science and technology, multimedia services and applications thereof have been rapidly developed. The video quantity is more and more, the video quality is higher and more, and the video traffic is larger and larger. The huge video traffic puts pressure on the backbone network. The edge calculation makes the data processing closer to the user, and can improve the quality of the multimedia service. Especially in 5G networks, base stations have been equipped with edge servers to provide storage and computing power. The video cache is combined with the edge calculation, and the edge server selects the video cache which is relatively more for watching by the user, so that the time delay of video transmission and the flow cost spent by the user can be reduced.
The edge server equipped in the base station selects the video cache, and can provide the cached video for a plurality of users in the coverage area of the edge server. When the video to be watched by the user is cached by the edge server, the video can be directly obtained from the edge server. Otherwise, the video is acquired from a backbone network such as a wireless network.
The popularity of the video changes along with time, and the edge server can select different videos to cache, so the video cache selection is dynamic. Due to the limitation of self-caching capacity, the edge server needs to select a cache part video from a large number of videos, and therefore the video cache selection is high in dimensionality. The dynamic and high-dimensional characteristics of the video cache selection bring challenges to the efficient video caching of the edge server.
In the traditional video cache selection method, the dynamics of video cache selection is mostly considered, and the video cache is performed by using reinforcement learning and deep reinforcement learning, but the high-dimensional performance of the video cache selection is not considered. When the edge server selects video cache from massive videos, the calculation cost is high, and pressure is brought to the edge server with limited resources.
Disclosure of Invention
Aiming at the defects in the background technology, the invention provides a high-dimensional video cache selection method based on deep reinforcement learning, and the deep reinforcement learning is applied to the video cache selection of the edge server. The DDPG is improved by using decoder, so that the edge server can select proper video for caching, and the time delay of video transmission and the flow cost spent by a user are reduced.
The high-dimensional video cache selection method based on deep reinforcement learning comprises the following steps:
step S1: performing system modeling aiming at a high-dimensional video cache problem, and then establishing a high-dimensional video cache action selection model based on an improved depth certainty strategy gradient DDPG;
step S2: training network parameters of a high-dimensional video caching action selection model based on the improved DDPG through an Adam algorithm;
and step S3: and the edge server selects videos to cache by using the trained high-dimensional video caching action selection model based on the improved DDPG.
Further, in step S1, the specific steps are as follows:
step S1-1: formalizing the high-dimensional video caching problem of the edge server:
setting the number of users in the coverage range of the edge server as U, the number of videos as N, the time length as T, the maximum storage capacity of the edge server as C, and the unit time delay and the unit flow cost from the local user to the edge server as l and p respectively; the video cache selection strategy of the edge server is set as,;
Wherein the content of the first and second substances,when the time step t is represented, the edge server performs video caching action;(ii) a Since the number of videos N is huge, soAre high-dimensional discrete; when the temperature is higher than the set temperatureWhen it is, the edge server caches the video with the mark j at the time step t, otherwise,(ii) a If the edge server only selects one video cache per time step, then;
When the time step t is set, the situation that the user k watches the video is,;Is a high-dimensional vector whenWhen it is, represents the userThe video numbered j is viewed at time t, otherwise,;
when the time step t is set, the condition that the edge server caches the video is,,Is also a high-dimensional vector whenWhen, it means that the edge server has cached the video with reference number j at time step t, otherwise,;
let the memory size occupied by the video with reference number j beThe instant reward obtained after the edge server caches the video at the time step t is(ii) a Video cache selection strategy for solving edge server as optimization target of whole problemTo maximize the cumulative revenue of the edge server, i.e. to minimize the time delay of the video transmission and the traffic cost spent by the user:
wherein the content of the first and second substances,representing a degree of interest in future rewards for a discount rate;when the time step is t, because the user k watches the video, the edge server obtains instant rewards; e is a positive value instant reward obtained by the edge server when the video to be watched by the user is cached by the edge server;to favor the time delay of video transmission and the traffic cost spent by the user,is in the range of 0 to 1; c is the maximum storage capacity of the edge server; u is the number of users in the coverage range of the edge server; in the formulaRepresents and;
step S1-2: describing the above problem model as a Markov decision processRepresents; wherein S is a state space storing states observable by the edge server;the method comprises the steps that a high-dimensional action space is used for storing original high-dimensional discrete video caching actions executable by an edge server;the video caching method comprises the following steps that a low-dimensional action space is formed, and low-dimensional continuous video caching actions which can be selected by an edge server are stored;is a reward space for storing the instant rewards obtained by the edge server;the state transition probability space represents the distribution condition of the edge server in a certain state and entering the next state after executing actions;
step S1-3: the high-dimensional video caching action selection model based on the improved DDPG combines the DDPG with a trained decoder; the DDPG comprises an operator, a critic and a replay cache region;
the actor is divided into an online actor network and a target actor network which are both deep fully-connected neural networks with 4 layers, and the network parameters are respectivelyAnd(ii) a Import to online actor network is the state observed by the edge serverOutput as low-dimensional continuous video caching action(ii) a The target operator network is used for updating network parameters of the online operator network;
decoder is a deep fully-connected neural network with 6 layers, and the network parameters are(ii) a Input of decoder is low-dimensional continuous video caching actionVideo caching action output as original high-dimensional dispersion;
Replay buffers store states observed by the serverLow-dimensional continuous video caching actionEdge server action basedInstant reward obtained after selecting video to cacheAnd the state of the next time step observed by the edge serverI.e. by;
The critic is divided into an online critic network and a target critic network, both are deep full-connection neural networks with 4 layers, and the network parameters are respectivelyAnd(ii) a The input to the online critic network is the data sampled from the replay bufferAnd outputting the state action value after the video cache is selected for the edge serverNamely, the estimation of the accumulated income obtained by the edge server; the target critic network is used for updating network parameters of the online critic network.
Further, in step S1-2, the defined states are:
at t-1 time step, the watched situation of each video is,;Is calculated according to the following formula:
handleAndas the state observed by the current edge server, i.e.(ii) a In light of the above-described description,andare all high-dimensional vectors of dimension N, and thusIs a high dimensional state with dimension 2N.
Further, in step S1-2, actions are defined as:
will have a high dimensional motion spaceReducing the dimension of the video caching action in the middle to obtain a low-dimensional action spaceOf dimension of(ii) a Then at time step t, the edge server can select the low-dimensional continuous video caching action as,;Video caching actions that need to be restored to original high-dimensional dispersionOf dimension of,。
Further, in step S1-2, the state transition probability is defined as:
in MDP, the edge server is in stateAccording to the actionThe result of selecting video for caching isTo decide.
Further, in step S1-2, the instant prize is defined as:
the edge server obtains instant reward after caching the video at time step t(ii) a The fringe server gets the accumulated reward at time step tThe formula is as follows:
the goal of the edge server is to maximize the cumulative revenue, i.e., the expectation of the cumulative rewards, the formula is as follows:
the optimization objective is converted into the optimal video caching action of solving the edge server in the time step tTo maximize the cumulative revenue of the edge servers.
Further, in step S2, network parameters of the high-dimensional video caching action selection model based on the modified DDPG are trained through Adam algorithm, and the training is performedThe training process is based on training samples; before training the decoder, an encoder and a deep fully-connected neural network need to be trained; the encoder is a deep full-connection neural network with 6 layers, and the network parameters are(ii) a encoder input is original high-dimensional discrete video caching actionOutput as low-dimensional continuous video caching action(ii) a The network parameters of the deep fully-connected neural network areThe number of the network layers is 5; the input of the deep fully-connected neural network is the state observed by the edge serverAnd low-dimensional continuous video caching actionThe output is the state of the next time。
Further, the step S2 specifically includes the following steps:
step S2-1: respectively randomly initializing network parameters of encoder, deep fully-connected neural network and decoder、And;
step S2-2: encoder caching original high-dimensional discrete videoDimension reduction into low-dimensional continuous video caching action;
Step S2-3: caching low-dimensional continuous videoStates observed with edge serversInputting the data into a deep full-connection neural network, and outputting the data to obtain the state of the next moment;
Step S2-4: minimizing loss of encoder and deep fully-connected neural networksTo update the parameters of the encoder and the deep fully-connected neural networkAndthe formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,is a formula ofCalculating expectation;as a policyDistribution of lower state transition probability;the parameters for the deep fully-connected neural network areTime, inputAndthen, the network outputsThe probability of (d);as an encoder parameter isTime, inputThen, outputting the encoder;is KL divergence, representsAnd Gaussian distributionThe difference between them;is the weight of the KL divergence;
step S2-5: repeating steps S2-2 to S2-4 untilConverging to finish the training of the encoder and the deep fully-connected neural network;
step S2-6: will be provided withInputting the video data into decoder, and outputting the video data to obtain original high-dimensional discrete video cache actionThe formula is as follows:
wherein the content of the first and second substances,is a parameter of decoderTime, inputThen, outputting the decoder;
step S2-7: minimizing the distance between two low-dimensional consecutive video buffering actionsTo update decoder parametersThe formula is as follows:
wherein the content of the first and second substances,as an encoder parameter isInputting the output of the decoder after the action of caching the original high-dimensional discrete video output by the decoder, and outputting the encoder; first itemEnsuring that decoder is a one-sided inverse of encoder, i.e.However, but(ii) a Second itemEnsure thatIs the only minimum;is the weight of the second term;
step S2-9: the network parameters of the on-line actor network and the target actor network are respectivelyAndthe network parameters of the online critic network and the target critic network are respectivelyAnd(ii) a Random initializationAndthen respectively connectAndis assigned toAnd;
step S2-10: edge server observed stateThen, low-dimensional continuous video caching action is selected according to the online operator network and random noiseThe formula is as follows:
wherein the content of the first and second substances,the parameters for an online actor network areTime, input stateThen, the output of the network;random noise is used for increasing the exploration of video caching action;
Step S2-12: the edge server according to the actionInstant reward is obtained after the video is selected and cachedAnd observe the state of the next time step;
Step S2-15: minimizing loss in an online critic network using Adam's algorithmTo update its parametersThe formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,based on actions for edge serversSelecting a video to cache and then obtaining instant rewards;representing a degree of interest in future rewards for a discount rate;the parameter for the target critic network isTime, inputAndthe latter state action value;the parameter for the online critic network isTime, inputAndthe latter state action value;
step S2-16: online actor network computation strategy gradientThereafter, parameters are updated using the Adam algorithmThe formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,is a state action valueAbout actionsA gradient of (a);as a parameter of the online operator network isThe output action is related toA gradient of (a);is prepared from radix GinsengNumber ofThe update step size of (c);
step S2-18: repeating steps S2-10 to S2-17 until the loss functionAnd converging to finish training.
Further, in step S3, the high dimensional state observed by the edge server is inputOutputting low-dimensional continuous video caching action by utilizing a trained high-dimensional video caching action selection model based on improved DDPG, reducing the action into the original high-dimensional discrete video caching action by a decoder, and performing edge serviceAnd the device selects the video to be cached according to the original high-dimensional discrete video caching action.
Further, the step S3 specifically includes the following steps:
step S3-1: edge server observes high dimensional stateThen, willInputting the video into a trained online actor network, and outputting low-dimensional continuous video caching actionThe formula is as follows:
wherein the content of the first and second substances,is the parameter of the on-line actor network obtained after the training of the step S2;as a parameter of the online operator network isTime, input stateThen, outputting the network;
step S3-2: input the methodDecoder output original high-dimensional discrete video caching actionThe formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,is the parameter of decoder obtained after the training of step S2;as the parameter of decoder isTime, inputThen, outputting the decoder;
step S3-3: edge server based onSelecting a video from a plurality of videos; if the cached video of the edge server does not contain the video and the residual storage capacity of the edge server is enough to cache the video, caching the video into the edge server; otherwise, deleting the earliest cached video in the edge server in sequence until the remaining storage capacity of the edge server is enough to cache the video, and then caching the video in the edge server.
The invention has the beneficial effects that:
1) The deep reinforcement learning is applied to the video cache selection of the edge server, and the high-efficiency video cache of the edge server is realized by considering the dynamic property and high-dimensional property of the video cache selection.
2) The decoder is used for improving the DDPG, so that the edge server can select a proper video to be cached, and the time delay of video transmission and the flow cost spent by a user are reduced.
3) When the edge server selects video cache from a large number of videos, the calculation overhead is greatly reduced, excessive pressure on the edge server with limited resources is avoided, and the calculation cost is saved.
Drawings
Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present invention.
FIG. 2 is a detailed flow chart of an embodiment of the present invention.
Fig. 3 is a schematic diagram of a high-dimensional video caching action selection model based on the improved DDPG according to an embodiment of the present invention.
Fig. 4 is a schematic flow chart of an Adam-based training algorithm according to an embodiment of the present invention.
FIG. 5 is a graph showing the experimental results of the embodiment of the present invention.
In fig. 1, 1-edge server, 2-base station, 3-subscriber, 4-backbone.
Detailed Description
The technical scheme of the invention is further explained in detail by combining the drawings in the specification.
As shown in fig. 1, a system architecture diagram of the embodiment of the present invention is specifically described as follows: the edge server 1, with which the base station 2 is equipped, selects a video cache, and may provide the cached video to a plurality of users 3 within its coverage area. When the video to be viewed by the user has been cached by the edge server 1, the video can be directly obtained therefrom. Otherwise, the video is acquired from the backbone network 4 such as a wireless network. The arrows in the figure indicate the acquisition of video.
As shown in fig. 2, the overall flow of the embodiment of the present invention includes:
step S1: performing system modeling for a high-dimensional video cache problem, and then establishing a high-dimensional video cache action selection model based on an improved Depth Deterministic Policy Gradient (DDPG), as shown in fig. 3, a schematic diagram of the high-dimensional video cache action selection model based on the improved DDPG according to the embodiment of the present invention includes the following specific steps:
step S1-1: formalizing the high-dimensional video caching problem of the edge server:
setting the number of users in the coverage range of the edge server as U, the number of videos as N, the time length as T and the maximum storage of the edge serverThe capacity is C, the unit time delay and the unit traffic cost from the user's local to the edge server are l and p, respectively. The video cache selection strategy of the edge server is set as,. Wherein, the first and the second end of the pipe are connected with each other,and when the time step t is represented, the video caching action is performed by the edge server.. Since the number of videos N is extremely large, it is very difficult to determine the number of videos NAre high-dimensional discrete. When in useAnd (3) indicating that the edge server caches the video with the reference number j at the time step t. If not, then,. If the edge server only selects one video cache per time step, then. When the time step t is set, the situation that the user k watches the video is,。Is a high-dimensional vector. When in useWhen representing the userThe video referenced j is viewed at time t. If not, then,. When the time step t is set, the condition that the edge server caches the video is,. It is clear that,also a high-dimensional vector. When in useTime, it indicates that the edge server has cached the video with reference number j at time step t. If not, then,. Let the memory size occupied by the video with reference number j beThe instant reward obtained after the edge server caches the video at the time step t is. Video cache selection strategy for solving edge server as optimization target of whole problemTo maximize the cumulative revenue of the edge server, i.e. to minimize the time delay of the video transmission and the traffic cost spent by the user:
wherein the content of the first and second substances,for discount rates, indicating a degree of interest in future rewards;when the time step is t, because the user k watches the video, the edge server obtains instant rewards; e is the real-time reward of the positive value obtained by the edge server when the video to be watched by the user is cached by the edge server;to look atTime delays for the transmission of the frequency and the preference of the traffic cost spent by the user,is in the range of 0 to 1.
Step S1-2: describing the problem model into a Markov decision processAnd (4) showing. Where S is a state space, storing states that can be observed by the edge server.Is a high-dimensional action space storing the original high-dimensional discrete video caching actions that the edge server can execute.The method is a low-dimensional action space, and stores low-dimensional continuous video caching actions selectable by an edge server.Is a reward space that stores the instant rewards obtained by the fringe server.The state transition probability space represents the distribution situation of the edge server in a certain state and entering the next state after executing the action.
(1) The state is as follows:
at t-1 time step, the watched situation of each video is,。Is calculated according to the following formula. HandleAndas the state currently observed by the edge server, i.e.. In accordance with the above-described description,andare all high-dimensional vectors of dimension N, and thusIs a high dimensional state with dimension 2N.
(2) The actions are as follows:
will be high dimensional motion spaceReducing the dimension of the video caching action in the middle to obtain a low-dimension action spaceOf dimension of. Then at time step t, the edge server can select the low-dimensional continuous video caching action as,。Video caching actions that need to be restored to original high-dimensional dispersionOf dimensions of,。
(3) Probability of state transition:
in MDP, the edge server is in stateAccording to the actionsThe result of selecting video to cache isTo determine.
(4) Instant reward:
the edge server obtains instant reward after caching the video at time step t. The fringe server gets the accumulated reward at time step tThe formula is as follows:
the goal of the edge server is to maximize the cumulative revenue, i.e., the expectation of a cumulative prize, the formula is as follows:
the optimization objective is converted into the optimal video caching action of solving the edge server in the time step tTo maximize the cumulative revenue of the edge servers.
Step S1-3: the high-dimensional video caching action selection model based on the improved DDPG combines the DDPG with a trained decoder. The DDPG includes an operator, a critic and a playback buffer.
The actor is divided into an online actor network and a target actor network which are both deep fully-connected neural networks with 4 layers, and the network parameters are respectivelyAnd. Import to online actor network is the state observed by the edge serverOutput as low-dimensional continuous video caching action. The target actor network is used for updating network parameters of the online actor network. The decoder is a deep fully-connected neural network with 6 layers, and the network parameters are. Input of decoder is low-dimensional continuous video caching actionVideo caching action output as original high-dimensional dispersion. Replay buffers store states observed by edge serversLow-dimensional continuous video caching actionEdge server based on actionsInstant reward obtained after selecting video to cacheAnd the state of the next time step observed by the edge serverI.e. by. The critic is divided into an online critic network and a target critic network, both are deep full-connection neural networks with 4 layers, and the network parameters are respectivelyAnd. The input to the online critic network is the data sampled from the replay bufferAnd outputting the state action value after the video cache is selected for the edge serverI.e. an estimate of the accumulated revenue obtained by the edge server. The target critic network is used for updating network parameters of the online critic network.
Step S2: network parameters of a high-dimensional video caching action selection model based on the improved DDPG are trained through an Adam algorithm, the training process is based on a designed training sample, the training sample is generated in the interaction process of an edge server and a video caching environment, and the training sample comprises a state observed by the edge server, an original high-dimensional discrete video caching action, an instant reward obtained after the edge server selects videos for caching and a state of the next time step observed by the edge server. Before training the decoder, the encoder and a deep fully-connected neural network are required to be trained. The encoder is a deep full-connection neural network with 6 layers, and the network parameters are. encoder input as original high-dimensional discrete video caching actionVideo caching action with continuous low-dimensional output. The network parameters of the deep fully-connected neural network areThe number of network layers is 5. The input of the deep fully-connected neural network is the state observed by the edge serverAnd low-dimensional continuous video caching actionThe output is the state of the next time. As shown in fig. 4, a schematic flow diagram of a training algorithm based on Adam according to an embodiment of the present invention includes the following specific steps:
step S2-1: respectively randomly initializing the encoder,Network parameters of the deep full-connection neural network and the decoder:、and。
step S2-2: encoder caching original high-dimensional discrete videoDimension reduction into low-dimensional continuous video caching action。
Step S2-3: caching low-dimensional continuous videoStates observed with edge serversInputting the data into a deep full-connection neural network, and outputting the data to obtain the state of the next moment。
Step S2-4: minimizing loss of encoder and deep fully-connected neural networksTo update the parameters of the encoder and the deep fully-connected neural networkAndthe formula is as follows:
Wherein, the first and the second end of the pipe are connected with each other,is a formula ofCalculating expectation;as a policyDistribution of lower state transition probabilities;the parameters for the deep fully-connected neural network areTime, inputAndthen, the network outputsThe probability of (d);as an encoder parameter isTime, inputThen, outputting the encoder;is KL divergence, representsAnd Gaussian distributionThe difference between them;is the weight of the KL divergence.
Step S2-5: repeating steps S2-S2-4 untilAnd (5) converging to finish the training of the encoder and the deep fully-connected neural network.
Step S2-6: will be provided withInputting the video data into decoder, and outputting the video data to obtain original high-dimensional discrete video buffer actionThe formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,is a parameter of decoderTime, inputAnd then, decoder output.
Step S2-7: minimizing the distance between two low-dimensional consecutive video buffering actionsTo update decoder parametersThe formula is as follows:
wherein the content of the first and second substances,as a parameter of the encoder isInputting the output of the decoder after the action of caching the original high-dimensional discrete video output by the decoder, and outputting the encoder; first itemEnsuring that decoder is a one-sided inverse of encoder, i.e.However, but(ii) a Item IIEnsureIs the only minimum;is the weight of the second term.
Step S2-9: the network parameters of the on-line operator network and the target operator network are respectivelyAndthe network parameters of the online critic network and the target critic network are respectivelyAnd. Random initializationAndthen respectively connectAndis assigned toAnd。
step S2-10: edge server observed stateThen, low-dimensional continuity is selected according to the online actor network and random noiseVideo caching actionsThe formula is as follows:
wherein the content of the first and second substances,the parameters for an online actor network areTime, input stateThen, the output of the network;random noise is used to increase the exploration of video buffering action.
Step S2-12: the edge server based on the actionInstant reward is obtained after the video is selected and cachedAnd observe the state of the next time step。
Step S2-15: minimizing loss in online critic networks using Adam's algorithmTo update its parametersThe formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,based on actions for edge serversSelecting a video to cache to obtain instant rewards;for discount rates, indicating a degree of interest in future rewards;the parameter for the target critical network isTime, inputAndthe latter state action value;the parameters for the online critical network areTime, inputAndthe latter state action value.
Step S2-16: online actor network computation strategy gradientThereafter, parameters are updated using the Adam algorithmThe formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,is a state action valueAbout actionsA gradient of (a);the parameters for an online actor network areThe output action is related toA gradient of (a);as a parameterThe update step size of (c).
And step S3: inputting high dimensional states observed by an edge serverThe method comprises the following steps of outputting low-dimensional continuous video caching actions by utilizing a trained high-dimensional video caching action selection model based on the improved DDPG, reducing the actions to original high-dimensional discrete video caching actions by a decoder, and selecting videos to be cached by an edge server according to the original high-dimensional discrete video caching actions, wherein the specific steps are as follows:
step S3-1: edge server observes high dimensional stateThen, willInputting the video into a trained online operator network, and outputting a low-dimensional continuous video caching actionThe formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,is the parameter of the on-line actor network obtained after the training of the step S2;as a parameter of the online operator network isTime, input stateAnd then, outputting the network.
Step S3-2: input deviceDecoder output original high-dimensional discrete video caching actionThe formula is as follows:
wherein the content of the first and second substances,is the parameter of decoder obtained after the training of step S2;is a parameter of decoderTime, inputAnd then, outputting the decoder.
Step S3-3: edge server based onOne video is selected from a large number of videos. If the video is not contained in the cached video of the edge server and the remaining storage capacity of the edge server is sufficient to cache the video, the video is cached in the edge server. Otherwise, deleting the video cached in the edge server in turn until the remaining storage capacity of the edge server is enough to cache the video, and then caching the video in the edge server.
To demonstrate the effectiveness of the present invention, preliminary experiments were conducted. The method proposed by the present invention was compared to DDPG, DQN and PPO. In the DDPG, an operator outputs high-dimensional continuous video caching action, then the high-dimensional continuous video caching action is subjected to chamber-softmax processing, criticc outputs a corresponding state action value, and an edge server selects a video to be cached according to the action output by the operator; in the DQN, an edge server obtains all state action values related to the current state according to a Q network, and then selects a video to cache by utilizing 1013to greedy; in PPO, the edge server selects video to cache by using the action output by new _ operator. The comparison results are shown in fig. 5. It can be seen from the figure that the convergence speed of the edge server is fastest and the obtained accumulated benefit is maximum when the method provided by the invention is used. Therefore, using the method of the present invention, the time delay of video transmission and the traffic cost spent by the user are minimized. The method provided by the invention is more suitable for being deployed in the edge server.
The above description is only a preferred embodiment of the present invention, and the scope of the present invention is not limited to the above embodiment, but equivalent modifications or changes made by those skilled in the art according to the present disclosure should be included in the scope of the present invention as set forth in the appended claims.
Claims (10)
1. The high-dimensional video cache selection method based on deep reinforcement learning is characterized by comprising the following steps: the method comprises the following steps:
step S1: performing system modeling aiming at a high-dimensional video cache problem, and then establishing a high-dimensional video cache action selection model based on an improved depth certainty strategy gradient DDPG;
step S2: training network parameters of a high-dimensional video caching action selection model based on the improved DDPG through an Adam algorithm;
and step S3: and the edge server selects videos to be cached by using the trained high-dimensional video caching action model based on the improved DDPG.
2. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 1, wherein: in step S1, the specific steps are as follows:
step S1-1: formalizing the high-dimensional video caching problem of the edge server:
setting the number of users in the coverage range of the edge server as U, the number of videos as N, the time length as T, the maximum storage capacity of the edge server as C, and the unit time delay and the unit flow cost from the local user to the edge server as l and p respectively; the video cache selection strategy of the edge server is set as,;
Wherein the content of the first and second substances,when the time step t is represented, the edge server performs video caching action;(ii) a Since the number of videos N is huge, soIs high-dimensional discrete; when in useWhen it is, the edge server caches the video with the mark j at the time step t, otherwise,(ii) a If the edge server only selects one video cache per time step, then;
When the time step t is set, the situation that the user k watches the video is,;Is a high-dimensional vector whenWhen it is, represents the userThe video numbered j is viewed at time t, otherwise,;
when the time step t is set, the condition that the edge server caches the video is,,Is also a high-dimensional vector whenAt time step t, it means that the edge server has cached the video labeled j, otherwise,;
let the memory size occupied by the video with reference number j beClothing with hem at time step tThe instant reward obtained after the server caches the video is(ii) a Video cache selection strategy for solving edge server as optimization target of whole problemTo maximize the cumulative revenue of the edge server, i.e. to minimize the time delay of the video transmission and the traffic cost spent by the user:
wherein the content of the first and second substances,representing a degree of interest in future rewards for a discount rate;when the time step is t, because the user k watches the video, the edge server obtains instant rewards; e is a positive value instant reward obtained by the edge server when the video to be watched by the user is cached by the edge server;to favor the time delay of video transmission and the traffic cost spent by the user,is in the range of 0 to 1; c is the maximum storage capacity of the edge server; u is the number of users in the coverage range of the edge server; in the formulaRepresents and;
step S1-2: describing the problem model into a Markov decision processRepresents; wherein S is a state space storing states observable by the edge server;the method comprises the steps that a high-dimensional action space is formed, and original high-dimensional discrete video caching actions which can be executed by an edge server are stored;is of low dimensionThe action space is used for storing low-dimensional continuous video caching actions selectable by the edge server;is a reward space for storing instant rewards obtained by the edge server;the state transition probability space represents the distribution condition of the edge server in a certain state and entering the next state after executing actions;
step S1-3: the high-dimensional video caching action selection model based on the improved DDPG combines the DDPG with a trained decoder; the DDPG comprises an operator, a critic and a replay cache region;
the actor is divided into an online actor network and a target actor network which are both deep fully-connected neural networks with 4 layers, and the network parameters are respectivelyAnd(ii) a Import to online actor network is the state observed by the edge serverVideo caching action with continuous low-dimensional output(ii) a The target actor network is used for updating network parameters of the online actor network;
decoder is a deep fully-connected neural network with 6 layers, and the network parameters are(ii) a decoder input as low-dimensional continuous video buffering actionOutput as the original high-dimensional discrete video caching action;
Replay buffers store states observed by edge serversLow-dimensional continuous video caching actionEdge server action basedInstant reward obtained after selecting video to cacheAnd the state of the next time step observed by the edge serverI.e. by;
The critic is divided into an online critic network and a target critic network, both are deep fully-connected neural networks with 4 layers, and the network parameters are respectivelyAnd(ii) a The input to the online critic network is the data sampled from the replay bufferAnd outputting the state action value after the video cache is selected for the edge serverNamely, the estimation of the accumulated income obtained by the edge server; the target critic network is used for updating network parameters of the online critic network.
3. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 2, wherein: in step S1-2, the defined states are:
at t-1 time step, the watched situation of each video is,;Is calculated according to the following formula:
4. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 2, wherein: in step S1-2, actions are defined as:
will have a high dimensional motion spaceReducing the dimension of the video caching action in the middle to obtain a low-dimensional action spaceOf dimensions of(ii) a Then at time step t, the edge server can select the low-dimensional continuous video caching action as,;Video caching actions that need to be restored to original high-dimensional dispersionOf dimension of,。
6. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 2, wherein: in step S1-2, the instant prize is defined as:
the edge server obtains instant reward after caching the video at time step t(ii) a The fringe server gets the accumulated reward at time step tThe formula is as follows:
the goal of the edge server is to maximize the cumulative revenue, i.e., the expectation of the cumulative rewards, the formula is as follows:
7. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 1, wherein: in the step S2, network parameters of a high-dimensional video caching action selection model based on the improved DDPG are trained through an Adam algorithm, and the training process is based on training samples; before training the decoder, an encoder and a deep fully-connected neural network need to be trained; the encoder is a deep fully-connected neural network with 6 layers of layers and network parameters of(ii) a encoder input is original high-dimensional discrete video caching actionOutput as low-dimensional continuous video caching action(ii) a The network parameters of the deep fully-connected neural network areThe number of the network layers is 5; the input of the deep fully-connected neural network is the state observed by the edge serverAnd low-dimensional continuous video cachingMovement ofThe output is the state of the next time。
8. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 7, wherein: the step S2 comprises the following specific steps:
step S2-1: respectively randomly initializing network parameters of encoder, deep fully-connected neural network and decoder、And;
step S2-2: encoder caching original high-dimensional discrete videoDimension reduction into low-dimensional continuous video caching action;
Step S2-3: caching low-dimensional continuous videoStates observed with edge serversInputting the data into a deep fully-connected neural network, and outputting the data to obtain the next momentStatus of state;
Step S2-4: minimizing loss of encoder and deep fully-connected neural networksTo update the parameters of the encoder and the deep fully-connected neural networkAndthe formula is as follows:
wherein the content of the first and second substances,is a formula ofCalculating expectation;as a policyDistribution of lower state transition probability;the parameters for the deep fully-connected neural network areTime, inputAndthen, the network outputsThe probability of (d);as an encoder parameter isTime, inputThen, outputting an encoder;is KL divergence, representsAnd Gaussian distributionThe difference between them;is the weight of the KL divergence;
step S2-5: repeating steps S2-2 to S2-4 untilConverging to finish the training of the encoder and the deep fully-connected neural network;
step S2-6: will be provided withInputting the video data into decoder, and outputting the video data to obtain original high-dimensional discrete video cache actionThe formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,as the parameter of decoder isTime, inputThen, outputting the decoder;
step S2-7: minimizing the distance between two low-dimensional consecutive video buffering actionsTo update decoder parametersThe formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,as an encoder parameter isInputting the output of the decoder after the action of caching the original high-dimensional discrete video output by the decoder, and outputting the encoder; first itemEnsuring that decoder is a one-sided inverse of encoder, i.e.However, but(ii) a Second itemEnsure thatIs the only minimum;is the weight of the second term;
step S2-9: the network parameters of the on-line actor network and the target actor network are respectivelyAndthe network parameters of the online critic network and the target critic network are respectivelyAnd(ii) a Random initializationAndthen respectively willAndassign toAnd;
step S2-10: edge server observed stateThen, low-dimensional continuous video caching action is selected according to the online operator network and random noiseThe formula is as follows:
wherein the content of the first and second substances,as a parameter of the online operator network isTime, input stateThen, the output of the network;random noise is used for increasing the exploration of video caching action;
Step S2-12: the edge server according to the actionInstant reward is obtained after the video is selected and cachedAnd observe the state of the next time step;
Step S2-15: using Adam's algorithm, inLine critical network minimizing lossTo update its parametersThe formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,based on actions for edge serversSelecting a video to cache to obtain instant rewards;for discount rates, indicating a degree of interest in future rewards;the parameter for the target critic network isTime, inputAndthe latter state action value;the parameter for the online critic network isTime, inputAndthe latter state action value;
step S2-16: online actor network computation strategy gradientThereafter, parameters are updated using Adam's algorithmThe formula is as follows:
wherein the content of the first and second substances,is a state action valueAbout actionsA gradient of (a);as a parameter of the online operator network isThe output action is related toA gradient of (a);as a parameterThe update step length of (2);
9. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 1, wherein: in step S3, the high dimensional state observed by the edge server is inputAnd outputting a low-dimensional continuous video caching action by using a trained high-dimensional video caching action selection model based on the improved DDPG, reducing the action into an original high-dimensional discrete video caching action by a decoder, and selecting a video for caching by the edge server according to the original high-dimensional discrete video caching action.
10. The method for selecting a high-dimensional video cache based on deep reinforcement learning of claim 9, wherein: the step S3 comprises the following specific steps:
step S3-1: edge server observes high dimensional stateThen, willInputting the video into a trained online actor network, and outputting low-dimensional continuous video caching actionThe formula is as follows:
wherein the content of the first and second substances,is obtained after the training of step S2Parameters of the line actor network;as a parameter of the online operator network isTime, input stateThen, the output of the network;
step S3-2: input deviceDecoder output original high-dimensional discrete video caching actionThe formula is as follows:
wherein, the first and the second end of the pipe are connected with each other,is the parameter of decoder obtained after the training of step S2;is a parameter of decoderTime, inputThen, decoder output;
step S3-3: edge server based onSelecting a video from a plurality of videos; if the video is not contained in the cached video of the edge server and the residual storage capacity of the edge server is enough to cache the video, caching the video into the edge server; otherwise, deleting the video cached in the edge server in turn until the remaining storage capacity of the edge server is enough to cache the video, and then caching the video in the edge server.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211270042.8A CN115344510B (en) | 2022-10-18 | 2022-10-18 | High-dimensional video cache selection method based on deep reinforcement learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211270042.8A CN115344510B (en) | 2022-10-18 | 2022-10-18 | High-dimensional video cache selection method based on deep reinforcement learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115344510A true CN115344510A (en) | 2022-11-15 |
CN115344510B CN115344510B (en) | 2023-02-03 |
Family
ID=83957657
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211270042.8A Active CN115344510B (en) | 2022-10-18 | 2022-10-18 | High-dimensional video cache selection method based on deep reinforcement learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115344510B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114025017A (en) * | 2021-11-01 | 2022-02-08 | 杭州电子科技大学 | Network edge caching method, device and equipment based on deep cycle reinforcement learning |
CN114281718A (en) * | 2021-12-18 | 2022-04-05 | 中国科学院深圳先进技术研究院 | Industrial Internet edge service cache decision method and system |
-
2022
- 2022-10-18 CN CN202211270042.8A patent/CN115344510B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114025017A (en) * | 2021-11-01 | 2022-02-08 | 杭州电子科技大学 | Network edge caching method, device and equipment based on deep cycle reinforcement learning |
CN114281718A (en) * | 2021-12-18 | 2022-04-05 | 中国科学院深圳先进技术研究院 | Industrial Internet edge service cache decision method and system |
Also Published As
Publication number | Publication date |
---|---|
CN115344510B (en) | 2023-02-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Hu et al. | Leveraging meta-path based context for top-n recommendation with a neural co-attention model | |
Zhong et al. | A deep reinforcement learning-based framework for content caching | |
Wu et al. | Mobility-aware cooperative caching in vehicular edge computing based on asynchronous federated and deep reinforcement learning | |
He et al. | QoE-driven content-centric caching with deep reinforcement learning in edge-enabled IoT | |
JP5542688B2 (en) | Apparatus and method for optimizing user access to content | |
CN109995851B (en) | Content popularity prediction and edge caching method based on deep learning | |
Zhao et al. | Mahrl: Multi-goals abstraction based deep hierarchical reinforcement learning for recommendations | |
CN114595396B (en) | Federal learning-based sequence recommendation method and system | |
Fedchenko et al. | Feedforward neural networks for caching: N enough or too much? | |
Zheng et al. | MEC-enabled wireless VR video service: A learning-based mixed strategy for energy-latency tradeoff | |
CN102868936A (en) | Method and system for storing video logs | |
CN113255004A (en) | Safe and efficient federal learning content caching method | |
CN113687960A (en) | Edge calculation intelligent caching method based on deep reinforcement learning | |
Zhou et al. | SACC: A size adaptive content caching algorithm in fog/edge computing using deep reinforcement learning | |
CN117221403A (en) | Content caching method based on user movement and federal caching decision | |
CN115731498A (en) | Video abstract generation method combining reinforcement learning and contrast learning | |
CN115344510B (en) | High-dimensional video cache selection method based on deep reinforcement learning | |
Xue et al. | Prefrec: Recommender systems with human preferences for reinforcing long-term user engagement | |
Avrachenkov et al. | A learning algorithm for the Whittle index policy for scheduling web crawlers | |
Nguyen et al. | User-preference-based proactive caching in edge networks | |
CN112836822A (en) | Federal learning strategy optimization method and device based on width learning | |
CN117459112A (en) | Mobile edge caching method and equipment in LEO satellite network based on graph rolling network | |
Thar et al. | Meta-learning-based deep learning model deployment scheme for edge caching | |
CN114025017B (en) | Network edge caching method, device and equipment based on deep circulation reinforcement learning | |
Yan et al. | Drl-based collaborative edge content replication with popularity distillation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |