CN112752308B

CN112752308B - Mobile prediction wireless edge caching method based on deep reinforcement learning

Info

Publication number: CN112752308B
Application number: CN202011620501.1A
Authority: CN
Inventors: 吴长汶; 辛基梁; 郑建武
Original assignee: Xiamen Yueren Health Technology Research And Development Co ltd
Current assignee: Xiamen Yueren Health Technology Research And Development Co ltd
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2022-08-05
Anticipated expiration: 2040-12-31
Also published as: CN112752308A

Abstract

The invention relates to a mobile prediction wireless edge caching method based on deep reinforcement learning, which comprises the following steps: constructing a wireless intelligent cache network model, which comprises a user set, a service node set, a user request content set, a cache content set, a source content library, a user historical track vector and a user classification group; constructing a long-short term memory network model, taking the historical track vector of the user as a prediction position of a prediction user in the next time slot, and classifying to obtain a user classification group; establishing a replacement cache strategy, acquiring a predicted user set of each service node according to the user classification group, and replacing the cache content of the current service node; and constructing a neural network combining Q learning and DQN reinforcement learning, training the neural network to obtain a trained dynamic cache replacement model, and utilizing the dynamic cache replacement model in a cache replacement strategy.

Description

Mobile prediction wireless edge caching method based on deep reinforcement learning

Technical Field

The invention relates to a mobile prediction wireless edge caching method based on deep reinforcement learning, and belongs to the technical field of wireless communication and computers.

Background

With the exponential growth of mobile wireless communication, data demand, and the continuous improvement of device storage and computing capabilities, real-time multimedia services are gradually becoming a major business in 5G communication networks, and human life and work are gradually migrating towards the overall mobile internet, pushing various network functions to the edge of the network, such as edge computing and edge caching. By pre-storing popular content requested by a user, edge caching aims to reduce traffic load and duplicate transmissions in the backhaul network, thereby significantly reducing latency, and therefore, accurately predicting a user's future needs is critical for edge caching replacement. To capture content popularity and the dynamics of a time-varying wireless environment, a policy control framework is introduced into the field of wireless caching. The deep reinforcement learning combines the deep neural network and the Q learning, shows excellent performance in the aspect of solving the complex control problem, and gets more and more attention in the research of wireless edge cache.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a mobile prediction wireless edge caching method based on deep reinforcement learning, which can predict the position of a mobile user by using a long-term and short-term memory network, overcome the influence of the mobility of the user on the cache hit rate, and solve the caching problem in a wireless network by using a neural network framework combining Q learning and reinforcement learning to carry out a cache replacement strategy of a service node, thereby improving the capability of mobile prediction wireless edge caching.

The technical scheme of the invention is as follows:

a mobile prediction wireless edge caching method based on deep reinforcement learning comprises the following steps:

constructing a wireless intelligent cache network model, which comprises a service node model and a service node control model, wherein the service node model comprises a user set, a service node set, a user request content set, a cache content set and a source content library; the service node control model comprises a user historical track vector and a user classification group;

mobile prediction, namely constructing a long-term and short-term memory network model, taking the historical track vector of the user as input, and outputting the predicted position of the user in the next time slot; classifying according to the predicted position of each user in the user set in the next time slot to obtain the user classification group;

establishing a replacement cache strategy, acquiring a predicted user set of each service node in a service node set in the next time slot according to a user classification group, and acquiring replacement contents from a source content library according to history request contents of users in the predicted user set and cache contents of a current service node to replace the cache contents of the current service node;

and (3) constructing a deep learning neural network combined with Q learning, taking a sample state in a state space formed by a prediction user set, a user request content set and a cache content set as an input, taking a certain action in an action space formed by replacement content as an output, training the neural network to obtain a trained dynamic cache replacement model, and utilizing the dynamic cache replacement model in a cache replacement strategy.

Furthermore, the wireless intelligent cache network model operates in a time discrete mode, and in each time slot, the user request content and the user historical track are updated.

Further, the user historical track vector is a position sequence and represents the moving track of the user within a period of time, and the historical track vector of each user is stored in the service node control model;

and inputting the historical track vector of the user into the long-short term memory network model, introducing a weight matrix, and outputting the predicted position of each user in the next time slot.

Further, in the process of training the neural network, a reward function is constructed based on the cache hit rate to train the neural network, and the method specifically comprises the following steps:

constructing a reward function which calculates an instant reward value through an input sample state and an output action and provides the instant reward value to a neural network;

constructing a cache hit rate calculation formula, wherein the cache hit rate refers to the probability that the request content of each user in a user set corresponding to a service node can be found in the cache content of the corresponding service node;

presetting a threshold value, wherein the threshold value belongs to (0,1), acquiring the state of the sample in the next time slot according to the input sample state and the output action, calculating the cache hit rate of the sample in the state of the next time slot according to the cache hit rate calculation formula, comparing with the threshold value, and obtaining a positive instantaneous reward value when the cache hit rate of the sample in the state of the next time slot is greater than the threshold value.

Furthermore, an experience replay mechanism is arranged in the neural network, and the input sample state, the output action, the instant reward value and the state of the sample in the next time slot are combined and stored in an experience replay library to be used as a training sample of the neural network.

Further, the step of constructing the neural network combining Q learning and DQN reinforcement learning specifically includes:

defining an action value function for calculating a Q value through training samples in an experience replay library through Q learning;

the DQN reinforcement learning adopts a neural network to predict a q value, and for each training sample in an experience playback library, the q value of a currently taken action is predicted through the state and the action of the sample, and then the q value of the next state taken action is predicted through the state and the action of the sample in the next time slot;

and constructing a loss function taking the difference value between the q value of the action taken in the next state and the q value of the current action taken as a reference, and iteratively updating the weight parameters of the neural network by using a gradient descent method to make the neural network converge.

The invention has the following beneficial effects:

1. the invention discloses a mobile prediction wireless edge caching method based on deep reinforcement learning, which can predict the position of a mobile user by utilizing a long-term and short-term memory network, can overcome the influence of the mobility of the user on the cache hit rate, and simultaneously utilizes a neural network framework combining Q learning and reinforcement learning to carry out a cache replacement strategy of a service node, solves the caching problem in a wireless network, thereby improving the capability of mobile prediction wireless edge caching.

2. The invention relates to a mobile prediction wireless edge caching method based on deep reinforcement learning, which is characterized in that a reward function based on cache hit rate is established, and a positive instantaneous reward value is given only when the cache hit rate is greater than a threshold value after cache content is replaced, so that the accuracy of a neural network output result is improved.

3. The invention relates to a mobile prediction wireless edge caching method based on deep reinforcement learning, which is characterized in that a prediction user set of each node is obtained according to the prediction position of a user, so that the user can obtain caching resources in a service node as much as possible, and the time delay is reduced.

4. The invention discloses a mobile prediction wireless edge caching method based on deep reinforcement learning, which is characterized in that the calculation of approximate Q values of a neural network is used, the action of obtaining the maximum Q value in each state is generated through iteration, so that an optimal cache replacement strategy is obtained, the neural network continuously updates iteration parameters through gradient descent, so that a loss function tends to be stable to the minimum value, and the whole network is converged.

Drawings

FIG. 1 is an overall flow chart of an embodiment of the present invention;

FIG. 2 is a diagram of an exemplary wireless intelligent cache network model according to an embodiment of the present invention;

FIG. 3 is a flow chart of a caching policy in an embodiment of the invention;

FIG. 4 is a diagram illustrating an example of different movement modes according to an embodiment of the present invention;

fig. 5 is a comparison example diagram of the calculation results after the scheme of the present embodiment is adopted for different movement modes.

Detailed Description

The invention is described in detail below with reference to the figures and the specific embodiments.

The first embodiment is as follows:

referring to fig. 1, a method for caching mobile prediction wireless edges based on deep reinforcement learning includes the following steps:

the method comprises the steps of constructing a wireless intelligent cache network model, wherein the wireless intelligent cache network model comprises a service node model and a service node control model, and the service node model comprises a user set U ═ { U ═ ₁ ,U ₂ ,...,U _I B, service node set B ═ B ₁ ,B ₂ ,...,B _J }, user request content set

Caching content sets

And source content library O ═ { O ₁ ,O ₂ ,...,O _K }；

Indicating the request content of the ith user in the t time slot

The cache content of the jth service node in the tth time slot is represented;

the service node control model comprises a user historical track vector and a user classification group;

establishing a replacement cache strategy, and acquiring a predicted user set of each service node in the service node set in the next time slot according to the user classification group

And based on the predicted user set

From a user requesting a set

And caching content sets

The history request content of the user and the cache content of the current service node are obtained, and when the history request content of the user does not exist in the cache content of the current service node, the history request content of the user is obtained from a source content library O ═ O ₁ ,O ₂ ,...,O _K Get the replacement content to replace the cache content of the current service node, that is, get the replacement content

Will be O ═ O ₁ ,O ₂ ,...,O _K Replacing with new content provided;

optimizing model, constructing deep learning neural network combining Q learning and DQN reinforcement learning, and defining state space as

Taking the sample state in the state space as input, the motion space is defined as a ^(t) ＝{x ₁ ,x ₂ ,…,x _K And (5) taking each action in the action space of the replacement content as output, training the neural network to obtain a trained dynamic cache replacement model, and using the dynamic cache replacement model in a replacement cache strategy.

The implementation utilizes the long-term and short-term memory network to predict the position of the mobile user, can overcome the influence of the mobility of the user on the cache hit rate, and simultaneously utilizes the neural network framework combined with Q learning and reinforcement learning to carry out the cache replacement strategy of the service node, thereby solving the cache problem in the wireless network and further improving the capability of mobile prediction of wireless edge cache.

Example two:

further, the wireless intelligent cache network model runs in a time discrete mode, T ═ {1,2, …, T }. In each time slot, the location information and the request of the user are updated, i.e.

And

is updated if the requested content is cached in the collection

It will transmit directly to the user; otherwise, the content request and delivery need to be sent from a remote server over the backhaul. For updating the cache contents, the slave servicePredictive user set sent by node controller

Will be provided with

And

used as input to a neural network to determine the buffer content of the next time slot, i.e.

Some of the content in the store will be replaced by new content provided by the remote server.

Further, the user historical track vector is a position sequence and represents the moving track of the user within a period of time, and the historical track vector of each user is stored in the service node control model; defined as a sequence of positions of

A total of beta historical access records are included.

The position sequence is used as the input of a long-time memory network, and the predicted position of the user in the next time slot is realized, namely:

wherein

Here W is a different weight matrix.

constructing a reward function, and combining the cache of the service node of the current user after the service node receives the request content of the user and the predicted user set of the service node controller model at each time slot, namely generating a state s ^(t) Then, it is regarded asInput of the neural network, the neural network will be according to the state s ^(t) Selecting a certain action a in the action space ^(t) When the output is taken, the action is executed according to the reward function

Obtaining an instantaneous prize value

Constructing a cache hit rate calculation formula:

wherein the content of the first and second substances,

indicating when the user requests

Caching at a current serving node

When the information can be found, the value of the indication function is 1, otherwise, the value is 0, and then the cache hit rate of the jth service node is cached, namely the cache hit rate is applied to all the users currently located in the service node

The indicator function is evaluated once and finally normalized to find the percentage hit rate.

Presetting a threshold value zeta, when the threshold value zeta belongs to (0,1), if the cache hit rate is larger than the threshold value

A positive reward is obtained

Experiments show that the cache hit rate of zeta 0.6 is better than other cache hit ratesThe cache hit rate of the value, the purpose of the system is to maximize the cache hit rate of each serving node.

Furthermore, an experience playback mechanism is arranged in the neural network, and after receiving the request content and the predicted user set of the user, the service node combines the cache content of the service node of the current user at each time slot, namely, generates the state s ^(t) Then, as the input of the neural network, the neural network will be based on the state s ^(t) Selecting a certain action a in the action space ^(t) When the output is taken, the system can execute the action according to the reward function

Obtaining an instantaneous prize value

And enters the next state s ^(t+1) . The four elements will then be combined into a combination

Storing the training samples into an experience playback library to be used as training samples of the neural network.

defining a motion cost function for computing Q values through training samples in an empirical replay library by Q learning (Q-learning):

where γ ∈ (0,1) represents the discount factor.

Because the large dimension of the action space can consume a large amount of memory, the DQN reinforcement learning adopts a neural network to estimate the q value q ^π (s ^(t) ,a ^(t) ,ω)≈q ^π (s ^(t) ,a ^(t) ) When samples are extracted from an empirical replay library for training, each sample can be calculated in an estimation network under the current sample stateQ value q for taking the action ^π (s ^(t) ,a ^(t) ω) while using the next state s in the sample ^(t+1) Inputting the value into a target network (structure and estimation network are consistent, and updating is delayed) to calculate the value of the action taken by the next state

Defining the loss function as the square of the difference between the two

Updating a weight parameter omega of the neural network by using a gradient descent method;

the trained weight parameter omega is stable, and the whole neural network is in a convergence state.

In order to make those skilled in the art further understand the solution proposed in the present embodiment, the following detailed description is made with reference to specific embodiments. The embodiment is implemented on the premise of the technical scheme of the invention, and a detailed implementation mode and a specific operation process are given.

Fig. 2 shows a wireless intelligent cache network model.

The model mainly comprises service nodes, a controller, a source server, a cache template and the like, introduces a user cache model under the service nodes, and each service node can download content requested by a user on the source server through a backhaul link, locally caches the content requested by the user and directly serves the user in a cell.

Fig. 3 shows a flow chart of the cache policy.

In time slot t, the request and location made by the user. A request end: if cached at the service node, the user-requested content is sent directly to the user and if not downloaded from a remote server (source content repository). Position end: and updating the historical track vector of the user, predicting the mobility of the user through a long-term and short-term memory network, then obtaining a predicted user set corresponding to the service node through a classification function, and finally updating the cache content through a neural network.

As shown in fig. 4, which is an exemplary diagram of different movement scenarios.

In order to study the DRL-based caching scheme proposed in various mobility scenarios, three different mobility patterns were tested and compared, and (a) in fig. 4 is linear mobility: for simulating a user's straight line movement on a street or road. In fig. 4, (b) is a circular motion: this is a typical deterministic motion pattern used to simulate a fixed path trajectory. In fig. 4, (c) is random shift: for simulating irregular movement of the user in an open area.

The calculation result is shown in fig. 5, and the result shows that the algorithm adopting the mobility prediction is superior to the algorithm without the mobility prediction, and the performance gains of the cache hit rate are 14.5%, 19.3% and 10.0% respectively under the conditions of linear, circular and random motions, which indicates that the accurate prediction user plays a key role in content replacement to adapt to the data request of the user.

The above analysis shows that the scheme provided by the invention can obtain better caching capacity than the existing method, and can well improve the caching problem of the user.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes performed by the present specification and drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A mobile prediction wireless edge caching method based on deep reinforcement learning is characterized by comprising the following steps:

the optimization model is used for constructing a neural network combining Q learning and DQN reinforcement learning, taking a sample state in a state space formed by a prediction user set, a user request content set and a cache content set as an input, taking a certain action in an action space formed by replacement content as an output, training the neural network to obtain a trained dynamic cache replacement model, and using the dynamic cache replacement model in a cache replacement strategy; the wireless intelligent cache network model operates in a time discrete mode, and in each time slot, user request content and user historical tracks are updated.

2. The method for caching the mobile prediction wireless edge based on the deep reinforcement learning of claim 1, wherein: the user historical track vector is a position sequence and represents the moving track of the user within a period of time, and the historical track vector of each user is stored in the service node control model;

3. The method for mobile prediction wireless edge caching based on deep reinforcement learning according to claim 1, wherein in the process of training the neural network, a reward function is constructed based on a cache hit rate to train the neural network, and the method comprises the following specific steps:

4. The method of claim 3, wherein the method comprises: the neural network is provided with an experience replay mechanism, and the input sample state, the output action, the instant reward value and the state of the sample in the next time slot are combined and stored in an experience replay library to be used as a training sample of the neural network.

5. The method for mobile prediction wireless edge caching based on deep reinforcement learning of claim 4, wherein the step of constructing the neural network combining Q learning and DQN reinforcement learning specifically comprises:

the DQN reinforcement learning adopts a neural network to predict a q value, and for each training sample in an experience playback library, the q value of a currently taken action is predicted through the state and the action of the sample, and then the q value of the next state taken action is predicted through the state and the action of the sample in the next time slot; and constructing a loss function taking the difference value between the q value of the action taken in the next state and the q value of the current action taken as a reference, and iteratively updating the weight parameters of the neural network by using a gradient descent method to make the neural network converge.