CN113992770B

CN113992770B - Policy-based federal reinforcement learning collaborative caching method in fog wireless access network

Info

Publication number: CN113992770B
Application number: CN202111270116.3A
Authority: CN
Inventors: 蒋雁翔; 王宇
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2021-10-29
Filing date: 2021-10-29
Publication date: 2024-02-09
Anticipated expiration: 2041-10-29
Also published as: CN113992770A

Abstract

The invention discloses a cooperative caching method of federal reinforcement learning based on a strategy in a fog wireless access network, which comprises the following steps: 1. initializing local cache content of nodes in the cache edge fog network, initializing a global model training period and model weight parameters, and distributing the local model to each node; 2. each cache node shares own cache content state information to adjacent cache nodes and cloud servers; 3. according to the received user request information in each time slot, the cache node can make decision between the local cache, the adjacent node cache and the cloud server to answer the user request; 4. calculating the cache hit rate and the content request delay of the user; 5. and the cache node updates the local cache content and training model parameters according to the local content cache state and the content request information of the user. 6. And carrying out joint updating on the training model weight parameters of each node. The invention reduces the delay of the user request and protects the privacy of the user.

Description

Policy-based federal reinforcement learning collaborative caching method in fog wireless access network

Technical Field

The invention belongs to the field of collaborative caching of an edge network in a mobile communication system, and particularly relates to a collaborative caching method of federal reinforcement learning based on a policy in a fog wireless access network.

Background

With the advent of the 5G age, the number of mobile devices and applications has increased rapidly, and the resulting massive data has brought tremendous traffic pressure to wireless cellular networks. Fog radio access networks are a promising approach to solve the problem of congestion of cellular network communication links. In a foggy radio access network, edge caching places popular content in foggy radio access points closer to the user, also referred to as cache nodes. The introduction of the buffer node can effectively reduce the load pressure and content transmission delay of the backhaul link. Due to the limited communication resources and local storage capacity of the caching nodes, how to cache the most popular content is an important direction of the current edge cache research.

In recent years reinforcement learning has become an important tool for optimizing content collaborative caching in foggy radio access networks. However, most reinforcement learning algorithms applied to the co-edge buffer problem in the mist radio access network are based on Q values, and they need to calculate all possible action state pairs Q values to obtain an optimal action selection, and when the action space dimension is increased, the action state pairs Q values that need to be calculated are also more, so that the performance of such algorithms is poor when the problem of the action space dimension is larger is processed. In addition, most reinforcement learning algorithms require users to upload their own data to the cloud for training, neglecting protection of sensitive data of users. Finally, the traditional way of training the reinforcement learning network in the fog wireless access network is to arrange a learning body on the cloud for independent training, so that the waste of computing resources of each node and the slower convergence speed are caused.

Disclosure of Invention

The invention aims to provide a cooperative caching method of federal reinforcement learning based on a strategy in a fog wireless access network, which aims to solve the technical problems that the content request delay of a user is high, the resource additive waste is high, the network is not suitable for a high-dimensional action space, the network convergence is slow and the privacy of the user cannot be well protected.

In order to solve the technical problems, the specific technical scheme of the invention is as follows:

a federal depth deterministic policy gradient learning collaborative caching method in a fog wireless access network comprises the following steps:

step 1, initializing a total cache content state s (0), a single training period step length l, a total period number T of a model, and network parameters of the model, including an online Q value network Q (s, a|theta) ^Q ) Parameter θ ^Q On-line policy network μ (s|θ ^μ ) Parameter θ ^μ Target Q value network Q' (s, a|θ) ^Q′ ) Parameter θ ^Q′ And a target policy network μ' (s|θ) ^μ′ ) Parameter θ ^μ′ Wherein s represents a state parameter of the input network, a represents an action selection parameter of the input network, and the initialized target Q value network parameter θ ^Q′ Equal to initialized online Q network parameter θ ^Q Initialized target policy network parameter θ ^μ′ Equal to theta ^μ ；

Step 2, selecting a popularity estimation algorithm to calculate global content popularity of the time slot tDegree ofWherein the method comprises the steps ofP _f (t) is popularity of content f; each base station is used as a cache node, collects content request information of users, and makes action selection based on a local content cache state set of all the base stations to acquire a next state;

step 3, calculating the content average request delay D of the user based on the action selection made by the caching node in step 2 and the content popularity in the period _F-U (t)，D _F-F-U (t) and D _C-F-U (t) wherein D _F-U (t) represents the request delay caused by the user directly obtaining its requested content from the local cache node, D _F-F-U (t) represents the delay of the request by the user to acquire its request content from the neighboring cache node, D _C-F-U (t) represents a request delay generated by a user needing to obtain its requested content from a cloud server;

step 4, calculating the rewarding value of the state action pair under different content acquisition modes in the step 3;

step 5, storing the conversion group into an experience playback pool, randomly sampling the conversion group from the experience playback pool for updating network parameters, and entering the next time slot after updating;

and 6, after the training period is finished, uploading the model network parameters of each cache node to the cloud, generating global network parameters at the cloud, distributing the global network parameters to each node, and entering the next training period.

Further, the step 2 specifically includes the following steps:

step 2.1 local on-line policy network μ (s|θ ^μ ) Generating action choices a (t), i.e. a (t) =μ (s (t) |θ, from the current total cache state s (t) ^μ ) Wherein s (t) = { s ₁ (t),...,s _n (t),...,s _N (t) }, N represents the total number of cache nodes in the fog network,to buffer the state space of node n at the t-th time slot, n _c Representing the cache index of the content c in the cache node n, and F represents the total number of the content in the content library;

step 2.2, if the caching node n receives a content request in a time slot t, marking the content as f, and marking the content with highest popularity, which is not cached by the node in the time slot t, as f'; the node executes a cache replacement action according to a (t), and marks C as the cache capacity of the node; the replacement action is performed in three cases: if f is already cached at the node, a (t) =c, c+.c+1 denotes n that node n caches it _c Content substitution is f'; if f is not cached in the node, a (t) =c, c+.c+1 denotes n that node n caches _c The content is replaced with f; a (t) =c+1 indicates that node n does not make a replacement of the cache content in the t-th slot;

step 2.3, after the buffer node n finishes the buffer replacement of step 2.2, updating the local state space of the buffer node n, and ordering the buffer content indexes in the state space in descending order according to popularity to obtain the state space s of the next time slot t+1 _n (t+1), and integrating the state space of each node to obtain a total new state space s (t+1).

Further, the step 3 specifically includes the following steps:

step 3.1, node n receives the content f requested by the user, if the content f is cached in the local node, the local node directly sends the content f to the user, and the generated content request is delayed to d _n1 ，d _n1 Representing the time required to transmit content from the local node to the user so that the average local request delay D for all nodes in the time slot t present model can be calculated _F-U (t) is:

wherein N represents the number of cache nodes, c _f,n (t) =1 means that the content f is cached in the node n;

step 3.2,If the local node does not cache the content f and the adjacent cache node stores the content f, the user will get its request content f from the adjacent cache node, and the generated content request delay is d _n1 +d _n2 ，d _n2 Representing the time required to transmit content between two adjacent nodes, the average collaboration request delay for all nodes in the time slot t model can be calculated as:

step 3.3, if the local node and the neighboring cache node do not store the content f, the user will get the request content f from the cloud server, and the generated content request delay is d _n1 +d _n3 ，d _n3 The transmission time required for sending the content from the cloud to the cache node is represented, so that the average cloud request delay of all nodes in the time slot t model can be calculated as follows:

step 3.4, calculating the total average request delay of all nodes in the time slot t model as follows:

D _total (t)＝D _F-U (t)+D _F-F-U (t)+D _C-F-U (t)。

further, the step 4 specifically includes the following steps:

step 4.1, the local node n receives the request content f sent by the user, if the content f is cached in the local node, the reward value of the action state pair in the time slot t is:

wherein lambda is ₁ Setting parameters of a reward function according to actual application scenes;

step 4.2, if the local node n does not cache the content f and the adjacent cache node stores the content f, the user will get its request content f from the adjacent cache node, and the prize value of the action state pair in the time slot t is:

wherein lambda is ₂ Setting parameters of a reward function according to actual application scenes;

step 4.3, if the local node and the adjacent cache node do not store the content f, the user will get the request content f from the cloud server, and the prize value of the action state pair in the time slot t is:

wherein lambda is ₃ Setting parameters of a reward function according to actual application scenes; furthermore lambda ₁ +λ ₂ +λ ₃ ＝1，λ ₁ ＜λ ₂ ＜＜λ ₃ 。

Further, the step 5 specifically includes the following steps:

step 5.1, each node forms a conversion group with the state s (t), the action a (t), the next state s (t+1) and the obtained prize r (t) in step 4, namely { s (t), a (t), r (t), s (t+1) }, and stores the conversion group in the experience playback Chi of each node;

step 5.2, randomly sampling N groups of conversion groups { s (i), a (i), r (i), s (i+1) } from the experience playback pool epsilon, and calculating a loss equation L of the Q-value network as follows:

wherein y is _i To calculate an intermediate quantity of the loss equation, the expression is:

y _i ＝r(i)+γQ′(s(i+1),μ′(s(i+1)|θ ^μ′ )|θ ^Q′ )；

by minimizing the loss equation L for the Q-network, the on-line Q-network parameter θ ^Q Updating;

step 5.3, calculating a strategy network target equation,

according to the sampled conversion set, the gradient of the approximate strategy network target equation is calculated by using the Monte Carlo method as follows:

the gradient is used to update the online policy network parameter θ ^μ ；

Step 5.4, according to the online Q value network parameter theta ^Q Updating target Q value network parameter theta ^Q′ The following are provided:

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′ ；

network parameters theta according to online policy ^μ Updating target policy network parameters θ ^μ′ The following are provided:

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′ ；

step 5.5, entering the next time slot, let s (t) =s (t+1).

Further, in the step 6, the network parameters of each cache node are updated jointly, which specifically includes the following steps:

step 6.1, after training for one period, each cache node uploads its network parameter θ _n (t _l ) To the cloud end;

step 6.2, cloud computing and updating network parameters theta of the global model _G (l)：

Wherein D is _n Is the local data set of the cache node n;

and 6.3, the cloud server sends the global model network parameters to each cache node, and takes the global model network parameters as initialization parameters of the next period training.

The cooperative caching method of the federal reinforcement learning based on the strategy in the fog wireless access network has the following advantages:

1. the invention adopts a brand new strategy-based reinforcement learning algorithm, namely depth deterministic strategy gradient learning, combines the framework of an actor commentator algorithm based on the strategy and the idea of asynchronous updating in a depth Q learning algorithm, can directly generate action selection through a strategy network, has better adaptability to high-dimensional action space problems, and ensures the convergence of the network through an asynchronous updating strategy.

2. According to the invention, by adopting horizontal federation learning, all local reinforcement learning network parameters of each cache node are aggregated into the global network parameters of the cloud server, so that cache cooperation among the cache nodes is enhanced, effective utilization of storage resources and operation resources is realized, and the network convergence speed is improved by parallel training of each node.

3. In the global model training process, the model parameters are used for replacing the user data for transmission, so that the user data are always remained in respective cache nodes and are not sent to the cloud, and the privacy of the user data is well protected.

Drawings

Fig. 1 is a schematic flow diagram of a collaborative caching method for policy-based federal reinforcement learning in a foggy radio access network according to the present invention;

FIG. 2 is a graph of simulation results comparing content acquisition delay with a control edge cache policy of the present invention.

Detailed Description

For a better understanding of the purposes, structure and function of the present invention, a method for collaborative caching of policy-based federal reinforcement learning in a mist radio access network is described in further detail below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a cooperative caching method based on federal reinforcement learning in a mist radio access network according to the present invention, including the following steps:

s1: according to ziff (th)e Mandelbrot-Zipf distribution to calculate global content popularity P (0), initializing total cache content state s (0), single training period step l, total period number T of model, network parameters of model, including online Q-value network (Q (s, a|θ ^Q ) Parameter theta ^Q On-line policy network (μ (s|θ) ^μ ) Parameter theta ^μ A target Q value network (Q' (s, a|θ) ^Q′ ) Parameter theta ^Q′ And a target policy network (μ' (s|θ) ^μ′ ) Parameter theta ^μ′ Wherein s represents a state parameter of the input network, a represents an action selection parameter of the input network, and the initialized target Q value network parameter θ ^Q′ Equal to initialized online Q network parameter θ ^Q Initialized target policy network parameter θ ^μ′ Equal to theta ^μ ；

S2: selecting popularity estimation algorithm to calculate global content popularity of time slot tWherein the method comprises the steps ofP _f And (t) is popularity of the content f. Each base station is used as a cache node, collects content request information of users, and makes action selection based on a local content cache state set of all the base stations to acquire a next state;

s2-1: local online policy network μ (s|θ ^μ ) Generating action choices a (t), i.e. a (t) =μ (s (t) |θ, from the current total cache state s (t) ^μ ) Wherein s (t) = { s ₁ (t),...,s _n (t),...,s _N (t) }, N represents the total number of cache nodes in the fog network,to buffer the state space of node n at the t-th time slot, n _c Representing the cache index of content c in cache node n, and F represents the total number of content in the content library.

S2-2: if the cache node n receives a content request in time slot t, the content is marked as f,the highest popularity content that the node did not cache at time slot t is denoted as f'. The node performs a cache replacement action according to a (t), and marks C as the cache capacity of the node. The replacement action is performed in three cases: if f is already cached at the node, a (t) =c, c+.c+1 denotes n that node n caches it _c Content substitution is f'; if f is not cached in the node, a (t) =c, c+.c+1 denotes n that node n caches _c The content is replaced with f; a (t) =c+1 indicates that node n does not make a replacement of the cache content in the t-th slot.

S2-3: after the buffer node n finishes the buffer replacement of the step S2-2, updating the local state space of the buffer node n, and ordering the buffer content indexes in the state space in descending order according to popularity to obtain the state space S of the next time slot t+1 _n (t+1), and integrating the state space of each node to obtain a total new state space s (t+1).

S3: calculating the content average request delay D of the user based on action selections made by the cache node in step 2 and content popularity in the period _F-U (t)，D _F-F-U (t) and D _C-F-U (t)，D _F-U (t) represents the request delay caused by the user's requested content directly from the local cache node, D _F-F-U (t) represents the delay of the request by the user to acquire its request content from the neighboring cache node, D _C-F-U (t) represents a request delay resulting from the user's need to obtain its requested content from the cloud server. The method specifically comprises the following steps:

s3-1: node n receives the content f requested by the user, if the content f is cached in the local node, the local node directly sends the content f to the user, and the generated content request is delayed to d _n1 ，d _n1 Representing the time required to send the content from the local node to the user. So that the average local request delay D of all nodes in the time slot t-model can be calculated _F-U (t) is:

wherein N represents the number of cache nodes, c _f,n (t) =1 means that the content f is buffered in the node n.

S3-2: if the local node does not cache the content f and the adjacent cache node stores the content f, the user will get its request content f from the adjacent cache node, and the generated content request delay is d _n1 +d _n2 ，d _n2 Representing the time required to transfer content between two adjacent nodes. The average collaboration request delay for all nodes in the time slot t present model can thus be calculated as:

s3-3: if the local node and the adjacent cache nodes do not store the content f, the user obtains the request content f from the cloud server, and the generated content request is delayed to d _n1 +d _n3 ，d _n3 Representing the transmission time required to send the content from the cloud to the caching node. Thus, the average cloud request delay of all nodes in the time slot t model can be calculated as follows:

s3-4: the total average request delay of all nodes in the time slot t model is calculated as follows:

D _total (t)＝D _F-U (t)+D _F-F-U (t)+D _C-F-U (t)。

s4: calculating the prize value of the state action pair in the different content acquisition modes discussed in the step 3. The method specifically comprises the following steps:

s4-1: the local node n receives the request content f sent by the user, and if the content f is cached in the local node, the rewarding value of the action state pair in the time slot t is as follows:

wherein lambda is ₁ And setting parameters of the reward function according to actual application scenes.

S4-2: if the local node n does not cache the content f and the adjacent cache node stores the content f, the user will get its request content f from the adjacent cache node, and the prize value of the action state pair in the time slot t is:

wherein lambda is ₂ And setting parameters of the reward function according to actual application scenes.

S4-3: if the local node and the adjacent cache node do not store the content f, the user will get the request content f from the cloud server, and the reward value of the action state pair in the time slot t is:

wherein lambda is ₃ And setting parameters of the reward function according to actual application scenes. Furthermore lambda ₁ +λ ₂ +λ ₃ ＝1，λ ₁ ＜λ ₂ ＜＜λ ₃ 。

S5: and storing the conversion group into an experience playback pool, randomly sampling the conversion group from the experience playback pool for updating network parameters, and entering the next time slot after updating. The method specifically comprises the following steps:

s5-1: each node will in step 2 form a "state-action-rewards-next state" transition group, i.e., { s (t), a (t), r (t), s (t+1) }, of state s (t), action a (t), next state s (t+1), the rewards r (t) acquired in step 4. This set of transformations is stored in the empirical playback pool epsilon for each node.

S5-2: randomly sampling N groups of conversion groups { s (i), a (i), r (i), s (i+1) } from the empirical playback pool ε, and calculating a loss equation L for the Q-value network as follows:

y _i ＝r(i)+γQ′(s(i+1),μ′(s(i+1)|θ ^μ′ )|θ ^Q′ )。

by minimizing the loss equation L for the Q-network, the on-line Q-network parameter θ ^Q And updating.

S5-3: according to the sampled conversion set, the gradient of the approximate strategy network target equation is calculated by using the Monte Carlo method as follows:

the gradient is used to update the online policy network parameter θ ^μ 。

S5-4: according to the network parameter theta of the on-line Q value ^Q Updating target Q value network parameter theta ^Q′ The following are provided:

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′ 。

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′ 。

s5-5: entering the next slot, let s (t) =s (t+1).

S6: after a training period is finished, uploading model network parameters theta of each cache node _n (t _l ) And (3) reaching the cloud end, generating global network parameters at the cloud end, distributing the global network parameters to each node, and entering the next training period. The method specifically comprises the following steps:

s6-1: after a period of training, each cache node uploads its network parameters to the cloud. The method comprises the steps of carrying out a first treatment on the surface of the

S6-2: cloud computing and updating network parameters theta of global model _G (l)：

Wherein D is _n Is the local data set of the cache node n.

S6-3: and the cloud server sends the global model network parameters to each cache node, and takes the global model network parameters as initialization parameters of the next period training.

As can be obtained from the simulation results of fig. 2, compared with the four traditional caching methods of least recently used caching method (Least Recently Used, LRU), least recently used caching method (Least Frequently Used, LFU), deep deterministic strategy gradient learning and deep Q learning, the federal deep deterministic strategy gradient learning method of the present invention has significantly better delay reduction performance. The introduction of federal learning enables the algorithm to have higher convergence speed and more stable performance than the gradient learning of depth deterministic strategy.

It will be understood that the invention has been described in terms of several embodiments, and that various changes and equivalents may be made to these features and embodiments by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims

1. A federal depth deterministic policy gradient learning collaborative caching method in a fog wireless access network is characterized by comprising the following steps:

step 1, calculating global content popularity according to ziff distributionInitializing the total cache content state s (0), the single training period step length l, the total period number T of the model, the network parameters of the model, including the online Q value network Q (s, a|theta ^Q ) Parameter θ ^Q On-line policy network μ (s|θ ^μ ) Parameter θ ^μ Target Q value network Q' (s, a|θ) ^Q′ ) Parameter θ ^Q′ And a target policy network μ' (s|θ) ^μ′ ) Parameter θ ^μ′ Wherein s represents a state parameter of the input network, a represents an action selection parameter of the input network, and the initialized target Q value network parameter θ ^Q′ Equal to initialized online Q network parameter θ ^Q Initialized target policy network parameter θ ^μ′ Equal to theta ^μ ；

Step 2, selecting a popularity estimation algorithm to calculate the global content popularity P (t) of the time slot t, whereinP _f (t) is popularity of content f; each base station is used as a cache node, content request information of a user is collected, and a local online policy network mu (s|theta ^μ ) Generating action choices a (t), i.e. a (t) =μ (s (t) |θ, from the current total cache state s (t) ^μ ) Wherein s (t) = { s ₁ (t),...,s _n (t),...,s _N (t) }, N representing the total number of cache nodes in the fog network; based on the local content caching state s (t) set of all the base stations, making action selection a (t) and obtaining the next state s (t+1);

step 4, calculating a reward value r (t) of the state action pair in different content acquisition modes in the step 3;

the step 5 specifically comprises the following steps:

step 5.1, each node composes the state s (t), the action a (t), the next state s (t+1) in step 2, the obtained rewards r (t) in step 4 into a conversion group, namely { s (t), a (t), r (t), s (t+1) }, and stores the conversion group in the experience playback Chi of each node;

y _i ＝r(i)+γQ′(s(i+1),μ′(s(i+1)|θ ^μ′ )|θ ^Q′ )；

step 5.3, calculating a strategy network target equation,

the gradient is used to update the online policy network parameter θ ^μ ；

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′ ；

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′ ；

step 5.5, entering the next time slot, let s (t) =s (t+1);

2. The federal depth deterministic policy gradient learning collaborative caching method according to claim 1, wherein the step 2 specifically comprises the following steps:

step 2.1,To buffer the state space of node n at the t-th time slot, n _c Representing the cache index of the content c in the cache node n, and F represents the total number of the content in the content library;

3. The federal depth deterministic policy gradient learning collaborative caching method according to claim 2, wherein the step 3 specifically comprises the following steps:

step 3.2, if the local node does not cache the content f and the adjacent cache node stores the content f, the user will get its request content f from the adjacent cache node, and the generated content request delay is d _n1 +d _n2 ，d _n2 Representing the time required to transmit content between two adjacent nodes, the average collaboration request delay for all nodes in the time slot t model can be calculated as:

D _total (t)＝D _F-U (t)+D _F-F-U (t)+D _C-F-U (t)。

4. the federal depth deterministic policy gradient learning collaborative caching method according to claim 3, wherein step 4 specifically comprises the steps of:

wherein lambda is ₁ Parameters of a reward function;

wherein lambda is ₂ Parameters of a reward function;

wherein lambda is ₃ Parameters of a reward function; lambda (lambda) ₁ +λ ₂ +λ ₃ ＝1，λ ₁ ＜λ ₂ ＜＜λ ₃ 。

5. The federal depth deterministic policy gradient learning collaborative caching method according to claim 4, wherein in step 6, network parameters of each caching node are jointly updated, and the method specifically comprises the following steps:

；

Wherein D is _n Is the local data set of the cache node n;