CN113992770B - Policy-based federal reinforcement learning collaborative caching method in fog wireless access network - Google Patents

Policy-based federal reinforcement learning collaborative caching method in fog wireless access network Download PDF

Info

Publication number
CN113992770B
CN113992770B CN202111270116.3A CN202111270116A CN113992770B CN 113992770 B CN113992770 B CN 113992770B CN 202111270116 A CN202111270116 A CN 202111270116A CN 113992770 B CN113992770 B CN 113992770B
Authority
CN
China
Prior art keywords
content
node
cache
network
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111270116.3A
Other languages
Chinese (zh)
Other versions
CN113992770A (en
Inventor
蒋雁翔
王宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Priority to CN202111270116.3A priority Critical patent/CN113992770B/en
Publication of CN113992770A publication Critical patent/CN113992770A/en
Application granted granted Critical
Publication of CN113992770B publication Critical patent/CN113992770B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/24Negotiation of communication capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Computer Security & Cryptography (AREA)
  • Mobile Radio Communication Systems (AREA)

Abstract

The invention discloses a cooperative caching method of federal reinforcement learning based on a strategy in a fog wireless access network, which comprises the following steps: 1. initializing local cache content of nodes in the cache edge fog network, initializing a global model training period and model weight parameters, and distributing the local model to each node; 2. each cache node shares own cache content state information to adjacent cache nodes and cloud servers; 3. according to the received user request information in each time slot, the cache node can make decision between the local cache, the adjacent node cache and the cloud server to answer the user request; 4. calculating the cache hit rate and the content request delay of the user; 5. and the cache node updates the local cache content and training model parameters according to the local content cache state and the content request information of the user. 6. And carrying out joint updating on the training model weight parameters of each node. The invention reduces the delay of the user request and protects the privacy of the user.

Description

Policy-based federal reinforcement learning collaborative caching method in fog wireless access network
Technical Field
The invention belongs to the field of collaborative caching of an edge network in a mobile communication system, and particularly relates to a collaborative caching method of federal reinforcement learning based on a policy in a fog wireless access network.
Background
With the advent of the 5G age, the number of mobile devices and applications has increased rapidly, and the resulting massive data has brought tremendous traffic pressure to wireless cellular networks. Fog radio access networks are a promising approach to solve the problem of congestion of cellular network communication links. In a foggy radio access network, edge caching places popular content in foggy radio access points closer to the user, also referred to as cache nodes. The introduction of the buffer node can effectively reduce the load pressure and content transmission delay of the backhaul link. Due to the limited communication resources and local storage capacity of the caching nodes, how to cache the most popular content is an important direction of the current edge cache research.
In recent years reinforcement learning has become an important tool for optimizing content collaborative caching in foggy radio access networks. However, most reinforcement learning algorithms applied to the co-edge buffer problem in the mist radio access network are based on Q values, and they need to calculate all possible action state pairs Q values to obtain an optimal action selection, and when the action space dimension is increased, the action state pairs Q values that need to be calculated are also more, so that the performance of such algorithms is poor when the problem of the action space dimension is larger is processed. In addition, most reinforcement learning algorithms require users to upload their own data to the cloud for training, neglecting protection of sensitive data of users. Finally, the traditional way of training the reinforcement learning network in the fog wireless access network is to arrange a learning body on the cloud for independent training, so that the waste of computing resources of each node and the slower convergence speed are caused.
Disclosure of Invention
The invention aims to provide a cooperative caching method of federal reinforcement learning based on a strategy in a fog wireless access network, which aims to solve the technical problems that the content request delay of a user is high, the resource additive waste is high, the network is not suitable for a high-dimensional action space, the network convergence is slow and the privacy of the user cannot be well protected.
In order to solve the technical problems, the specific technical scheme of the invention is as follows:
a federal depth deterministic policy gradient learning collaborative caching method in a fog wireless access network comprises the following steps:
step 1, initializing a total cache content state s (0), a single training period step length l, a total period number T of a model, and network parameters of the model, including an online Q value network Q (s, a|theta) Q ) Parameter θ Q On-line policy network μ (s|θ μ ) Parameter θ μ Target Q value network Q' (s, a|θ) Q′ ) Parameter θ Q′ And a target policy network μ' (s|θ) μ′ ) Parameter θ μ′ Wherein s represents a state parameter of the input network, a represents an action selection parameter of the input network, and the initialized target Q value network parameter θ Q′ Equal to initialized online Q network parameter θ Q Initialized target policy network parameter θ μ′ Equal to theta μ
Step 2, selecting a popularity estimation algorithm to calculate global content popularity of the time slot tDegree ofWherein the method comprises the steps ofP f (t) is popularity of content f; each base station is used as a cache node, collects content request information of users, and makes action selection based on a local content cache state set of all the base stations to acquire a next state;
step 3, calculating the content average request delay D of the user based on the action selection made by the caching node in step 2 and the content popularity in the period F-U (t),D F-F-U (t) and D C-F-U (t) wherein D F-U (t) represents the request delay caused by the user directly obtaining its requested content from the local cache node, D F-F-U (t) represents the delay of the request by the user to acquire its request content from the neighboring cache node, D C-F-U (t) represents a request delay generated by a user needing to obtain its requested content from a cloud server;
step 4, calculating the rewarding value of the state action pair under different content acquisition modes in the step 3;
step 5, storing the conversion group into an experience playback pool, randomly sampling the conversion group from the experience playback pool for updating network parameters, and entering the next time slot after updating;
and 6, after the training period is finished, uploading the model network parameters of each cache node to the cloud, generating global network parameters at the cloud, distributing the global network parameters to each node, and entering the next training period.
Further, the step 2 specifically includes the following steps:
step 2.1 local on-line policy network μ (s|θ μ ) Generating action choices a (t), i.e. a (t) =μ (s (t) |θ, from the current total cache state s (t) μ ) Wherein s (t) = { s 1 (t),...,s n (t),...,s N (t) }, N represents the total number of cache nodes in the fog network,to buffer the state space of node n at the t-th time slot, n c Representing the cache index of the content c in the cache node n, and F represents the total number of the content in the content library;
step 2.2, if the caching node n receives a content request in a time slot t, marking the content as f, and marking the content with highest popularity, which is not cached by the node in the time slot t, as f'; the node executes a cache replacement action according to a (t), and marks C as the cache capacity of the node; the replacement action is performed in three cases: if f is already cached at the node, a (t) =c, c+.c+1 denotes n that node n caches it c Content substitution is f'; if f is not cached in the node, a (t) =c, c+.c+1 denotes n that node n caches c The content is replaced with f; a (t) =c+1 indicates that node n does not make a replacement of the cache content in the t-th slot;
step 2.3, after the buffer node n finishes the buffer replacement of step 2.2, updating the local state space of the buffer node n, and ordering the buffer content indexes in the state space in descending order according to popularity to obtain the state space s of the next time slot t+1 n (t+1), and integrating the state space of each node to obtain a total new state space s (t+1).
Further, the step 3 specifically includes the following steps:
step 3.1, node n receives the content f requested by the user, if the content f is cached in the local node, the local node directly sends the content f to the user, and the generated content request is delayed to d n1 ,d n1 Representing the time required to transmit content from the local node to the user so that the average local request delay D for all nodes in the time slot t present model can be calculated F-U (t) is:
wherein N represents the number of cache nodes, c f,n (t) =1 means that the content f is cached in the node n;
step 3.2,If the local node does not cache the content f and the adjacent cache node stores the content f, the user will get its request content f from the adjacent cache node, and the generated content request delay is d n1 +d n2 ,d n2 Representing the time required to transmit content between two adjacent nodes, the average collaboration request delay for all nodes in the time slot t model can be calculated as:
step 3.3, if the local node and the neighboring cache node do not store the content f, the user will get the request content f from the cloud server, and the generated content request delay is d n1 +d n3 ,d n3 The transmission time required for sending the content from the cloud to the cache node is represented, so that the average cloud request delay of all nodes in the time slot t model can be calculated as follows:
step 3.4, calculating the total average request delay of all nodes in the time slot t model as follows:
D total (t)=D F-U (t)+D F-F-U (t)+D C-F-U (t)。
further, the step 4 specifically includes the following steps:
step 4.1, the local node n receives the request content f sent by the user, if the content f is cached in the local node, the reward value of the action state pair in the time slot t is:
wherein lambda is 1 Setting parameters of a reward function according to actual application scenes;
step 4.2, if the local node n does not cache the content f and the adjacent cache node stores the content f, the user will get its request content f from the adjacent cache node, and the prize value of the action state pair in the time slot t is:
wherein lambda is 2 Setting parameters of a reward function according to actual application scenes;
step 4.3, if the local node and the adjacent cache node do not store the content f, the user will get the request content f from the cloud server, and the prize value of the action state pair in the time slot t is:
wherein lambda is 3 Setting parameters of a reward function according to actual application scenes; furthermore lambda 123 =1,λ 1 <λ 2 <<λ 3
Further, the step 5 specifically includes the following steps:
step 5.1, each node forms a conversion group with the state s (t), the action a (t), the next state s (t+1) and the obtained prize r (t) in step 4, namely { s (t), a (t), r (t), s (t+1) }, and stores the conversion group in the experience playback Chi of each node;
step 5.2, randomly sampling N groups of conversion groups { s (i), a (i), r (i), s (i+1) } from the experience playback pool epsilon, and calculating a loss equation L of the Q-value network as follows:
wherein y is i To calculate an intermediate quantity of the loss equation, the expression is:
y i =r(i)+γQ′(s(i+1),μ′(s(i+1)|θ μ′ )|θ Q′ );
by minimizing the loss equation L for the Q-network, the on-line Q-network parameter θ Q Updating;
step 5.3, calculating a strategy network target equation,
according to the sampled conversion set, the gradient of the approximate strategy network target equation is calculated by using the Monte Carlo method as follows:
the gradient is used to update the online policy network parameter θ μ
Step 5.4, according to the online Q value network parameter theta Q Updating target Q value network parameter theta Q′ The following are provided:
θ Q′ ←τθ Q +(1-τ)θ Q′
network parameters theta according to online policy μ Updating target policy network parameters θ μ′ The following are provided:
θ μ′ ←τθ μ +(1-τ)θ μ′
step 5.5, entering the next time slot, let s (t) =s (t+1).
Further, in the step 6, the network parameters of each cache node are updated jointly, which specifically includes the following steps:
step 6.1, after training for one period, each cache node uploads its network parameter θ n (t l ) To the cloud end;
step 6.2, cloud computing and updating network parameters theta of the global model G (l):
Wherein D is n Is the local data set of the cache node n;
and 6.3, the cloud server sends the global model network parameters to each cache node, and takes the global model network parameters as initialization parameters of the next period training.
The cooperative caching method of the federal reinforcement learning based on the strategy in the fog wireless access network has the following advantages:
1. the invention adopts a brand new strategy-based reinforcement learning algorithm, namely depth deterministic strategy gradient learning, combines the framework of an actor commentator algorithm based on the strategy and the idea of asynchronous updating in a depth Q learning algorithm, can directly generate action selection through a strategy network, has better adaptability to high-dimensional action space problems, and ensures the convergence of the network through an asynchronous updating strategy.
2. According to the invention, by adopting horizontal federation learning, all local reinforcement learning network parameters of each cache node are aggregated into the global network parameters of the cloud server, so that cache cooperation among the cache nodes is enhanced, effective utilization of storage resources and operation resources is realized, and the network convergence speed is improved by parallel training of each node.
3. In the global model training process, the model parameters are used for replacing the user data for transmission, so that the user data are always remained in respective cache nodes and are not sent to the cloud, and the privacy of the user data is well protected.
Drawings
Fig. 1 is a schematic flow diagram of a collaborative caching method for policy-based federal reinforcement learning in a foggy radio access network according to the present invention;
FIG. 2 is a graph of simulation results comparing content acquisition delay with a control edge cache policy of the present invention.
Detailed Description
For a better understanding of the purposes, structure and function of the present invention, a method for collaborative caching of policy-based federal reinforcement learning in a mist radio access network is described in further detail below with reference to the accompanying drawings.
Fig. 1 is a flowchart of a cooperative caching method based on federal reinforcement learning in a mist radio access network according to the present invention, including the following steps:
s1: according to ziff (th)e Mandelbrot-Zipf distribution to calculate global content popularity P (0), initializing total cache content state s (0), single training period step l, total period number T of model, network parameters of model, including online Q-value network (Q (s, a|θ Q ) Parameter theta Q On-line policy network (μ (s|θ) μ ) Parameter theta μ A target Q value network (Q' (s, a|θ) Q′ ) Parameter theta Q′ And a target policy network (μ' (s|θ) μ′ ) Parameter theta μ′ Wherein s represents a state parameter of the input network, a represents an action selection parameter of the input network, and the initialized target Q value network parameter θ Q′ Equal to initialized online Q network parameter θ Q Initialized target policy network parameter θ μ′ Equal to theta μ
S2: selecting popularity estimation algorithm to calculate global content popularity of time slot tWherein the method comprises the steps ofP f And (t) is popularity of the content f. Each base station is used as a cache node, collects content request information of users, and makes action selection based on a local content cache state set of all the base stations to acquire a next state;
s2-1: local online policy network μ (s|θ μ ) Generating action choices a (t), i.e. a (t) =μ (s (t) |θ, from the current total cache state s (t) μ ) Wherein s (t) = { s 1 (t),...,s n (t),...,s N (t) }, N represents the total number of cache nodes in the fog network,to buffer the state space of node n at the t-th time slot, n c Representing the cache index of content c in cache node n, and F represents the total number of content in the content library.
S2-2: if the cache node n receives a content request in time slot t, the content is marked as f,the highest popularity content that the node did not cache at time slot t is denoted as f'. The node performs a cache replacement action according to a (t), and marks C as the cache capacity of the node. The replacement action is performed in three cases: if f is already cached at the node, a (t) =c, c+.c+1 denotes n that node n caches it c Content substitution is f'; if f is not cached in the node, a (t) =c, c+.c+1 denotes n that node n caches c The content is replaced with f; a (t) =c+1 indicates that node n does not make a replacement of the cache content in the t-th slot.
S2-3: after the buffer node n finishes the buffer replacement of the step S2-2, updating the local state space of the buffer node n, and ordering the buffer content indexes in the state space in descending order according to popularity to obtain the state space S of the next time slot t+1 n (t+1), and integrating the state space of each node to obtain a total new state space s (t+1).
S3: calculating the content average request delay D of the user based on action selections made by the cache node in step 2 and content popularity in the period F-U (t),D F-F-U (t) and D C-F-U (t),D F-U (t) represents the request delay caused by the user's requested content directly from the local cache node, D F-F-U (t) represents the delay of the request by the user to acquire its request content from the neighboring cache node, D C-F-U (t) represents a request delay resulting from the user's need to obtain its requested content from the cloud server. The method specifically comprises the following steps:
s3-1: node n receives the content f requested by the user, if the content f is cached in the local node, the local node directly sends the content f to the user, and the generated content request is delayed to d n1 ,d n1 Representing the time required to send the content from the local node to the user. So that the average local request delay D of all nodes in the time slot t-model can be calculated F-U (t) is:
wherein N represents the number of cache nodes, c f,n (t) =1 means that the content f is buffered in the node n.
S3-2: if the local node does not cache the content f and the adjacent cache node stores the content f, the user will get its request content f from the adjacent cache node, and the generated content request delay is d n1 +d n2 ,d n2 Representing the time required to transfer content between two adjacent nodes. The average collaboration request delay for all nodes in the time slot t present model can thus be calculated as:
s3-3: if the local node and the adjacent cache nodes do not store the content f, the user obtains the request content f from the cloud server, and the generated content request is delayed to d n1 +d n3 ,d n3 Representing the transmission time required to send the content from the cloud to the caching node. Thus, the average cloud request delay of all nodes in the time slot t model can be calculated as follows:
s3-4: the total average request delay of all nodes in the time slot t model is calculated as follows:
D total (t)=D F-U (t)+D F-F-U (t)+D C-F-U (t)。
s4: calculating the prize value of the state action pair in the different content acquisition modes discussed in the step 3. The method specifically comprises the following steps:
s4-1: the local node n receives the request content f sent by the user, and if the content f is cached in the local node, the rewarding value of the action state pair in the time slot t is as follows:
wherein lambda is 1 And setting parameters of the reward function according to actual application scenes.
S4-2: if the local node n does not cache the content f and the adjacent cache node stores the content f, the user will get its request content f from the adjacent cache node, and the prize value of the action state pair in the time slot t is:
wherein lambda is 2 And setting parameters of the reward function according to actual application scenes.
S4-3: if the local node and the adjacent cache node do not store the content f, the user will get the request content f from the cloud server, and the reward value of the action state pair in the time slot t is:
wherein lambda is 3 And setting parameters of the reward function according to actual application scenes. Furthermore lambda 123 =1,λ 1 <λ 2 <<λ 3
S5: and storing the conversion group into an experience playback pool, randomly sampling the conversion group from the experience playback pool for updating network parameters, and entering the next time slot after updating. The method specifically comprises the following steps:
s5-1: each node will in step 2 form a "state-action-rewards-next state" transition group, i.e., { s (t), a (t), r (t), s (t+1) }, of state s (t), action a (t), next state s (t+1), the rewards r (t) acquired in step 4. This set of transformations is stored in the empirical playback pool epsilon for each node.
S5-2: randomly sampling N groups of conversion groups { s (i), a (i), r (i), s (i+1) } from the empirical playback pool ε, and calculating a loss equation L for the Q-value network as follows:
wherein y is i To calculate an intermediate quantity of the loss equation, the expression is:
y i =r(i)+γQ′(s(i+1),μ′(s(i+1)|θ μ′ )|θ Q′ )。
by minimizing the loss equation L for the Q-network, the on-line Q-network parameter θ Q And updating.
S5-3: according to the sampled conversion set, the gradient of the approximate strategy network target equation is calculated by using the Monte Carlo method as follows:
the gradient is used to update the online policy network parameter θ μ
S5-4: according to the network parameter theta of the on-line Q value Q Updating target Q value network parameter theta Q′ The following are provided:
θ Q′ ←τθ Q +(1-τ)θ Q′
network parameters theta according to online policy μ Updating target policy network parameters θ μ′ The following are provided:
θ μ′ ←τθ μ +(1-τ)θ μ′
s5-5: entering the next slot, let s (t) =s (t+1).
S6: after a training period is finished, uploading model network parameters theta of each cache node n (t l ) And (3) reaching the cloud end, generating global network parameters at the cloud end, distributing the global network parameters to each node, and entering the next training period. The method specifically comprises the following steps:
s6-1: after a period of training, each cache node uploads its network parameters to the cloud. The method comprises the steps of carrying out a first treatment on the surface of the
S6-2: cloud computing and updating network parameters theta of global model G (l):
Wherein D is n Is the local data set of the cache node n.
S6-3: and the cloud server sends the global model network parameters to each cache node, and takes the global model network parameters as initialization parameters of the next period training.
As can be obtained from the simulation results of fig. 2, compared with the four traditional caching methods of least recently used caching method (Least Recently Used, LRU), least recently used caching method (Least Frequently Used, LFU), deep deterministic strategy gradient learning and deep Q learning, the federal deep deterministic strategy gradient learning method of the present invention has significantly better delay reduction performance. The introduction of federal learning enables the algorithm to have higher convergence speed and more stable performance than the gradient learning of depth deterministic strategy.
It will be understood that the invention has been described in terms of several embodiments, and that various changes and equivalents may be made to these features and embodiments by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims (5)

1. A federal depth deterministic policy gradient learning collaborative caching method in a fog wireless access network is characterized by comprising the following steps:
step 1, calculating global content popularity according to ziff distributionInitializing the total cache content state s (0), the single training period step length l, the total period number T of the model, the network parameters of the model, including the online Q value network Q (s, a|theta Q ) Parameter θ Q On-line policy network μ (s|θ μ ) Parameter θ μ Target Q value network Q' (s, a|θ) Q′ ) Parameter θ Q′ And a target policy network μ' (s|θ) μ′ ) Parameter θ μ′ Wherein s represents a state parameter of the input network, a represents an action selection parameter of the input network, and the initialized target Q value network parameter θ Q′ Equal to initialized online Q network parameter θ Q Initialized target policy network parameter θ μ′ Equal to theta μ
Step 2, selecting a popularity estimation algorithm to calculate the global content popularity P (t) of the time slot t, whereinP f (t) is popularity of content f; each base station is used as a cache node, content request information of a user is collected, and a local online policy network mu (s|theta μ ) Generating action choices a (t), i.e. a (t) =μ (s (t) |θ, from the current total cache state s (t) μ ) Wherein s (t) = { s 1 (t),...,s n (t),...,s N (t) }, N representing the total number of cache nodes in the fog network; based on the local content caching state s (t) set of all the base stations, making action selection a (t) and obtaining the next state s (t+1);
step 3, calculating the content average request delay D of the user based on the action selection made by the caching node in step 2 and the content popularity in the period F-U (t),D F-F-U (t) and D C-F-U (t) wherein D F-U (t) represents the request delay caused by the user directly obtaining its requested content from the local cache node, D F-F-U (t) represents the delay of the request by the user to acquire its request content from the neighboring cache node, D C-F-U (t) represents a request delay generated by a user needing to obtain its requested content from a cloud server;
step 4, calculating a reward value r (t) of the state action pair in different content acquisition modes in the step 3;
step 5, storing the conversion group into an experience playback pool, randomly sampling the conversion group from the experience playback pool for updating network parameters, and entering the next time slot after updating;
the step 5 specifically comprises the following steps:
step 5.1, each node composes the state s (t), the action a (t), the next state s (t+1) in step 2, the obtained rewards r (t) in step 4 into a conversion group, namely { s (t), a (t), r (t), s (t+1) }, and stores the conversion group in the experience playback Chi of each node;
step 5.2, randomly sampling N groups of conversion groups { s (i), a (i), r (i), s (i+1) } from the experience playback pool epsilon, and calculating a loss equation L of the Q-value network as follows:
wherein y is i To calculate an intermediate quantity of the loss equation, the expression is:
y i =r(i)+γQ′(s(i+1),μ′(s(i+1)|θ μ′ )|θ Q′ );
by minimizing the loss equation L for the Q-network, the on-line Q-network parameter θ Q Updating;
step 5.3, calculating a strategy network target equation,
according to the sampled conversion set, the gradient of the approximate strategy network target equation is calculated by using the Monte Carlo method as follows:
the gradient is used to update the online policy network parameter θ μ
Step 5.4, according to the online Q value network parameter theta Q Updating target Q value network parameter theta Q′ The following are provided:
θ Q′ ←τθ Q +(1-τ)θ Q′
network parameters theta according to online policy μ Updating target policy network parameters θ μ′ The following are provided:
θ μ′ ←τθ μ +(1-τ)θ μ′
step 5.5, entering the next time slot, let s (t) =s (t+1);
and 6, after the training period is finished, uploading the model network parameters of each cache node to the cloud, generating global network parameters at the cloud, distributing the global network parameters to each node, and entering the next training period.
2. The federal depth deterministic policy gradient learning collaborative caching method according to claim 1, wherein the step 2 specifically comprises the following steps:
step 2.1,To buffer the state space of node n at the t-th time slot, n c Representing the cache index of the content c in the cache node n, and F represents the total number of the content in the content library;
step 2.2, if the caching node n receives a content request in a time slot t, marking the content as f, and marking the content with highest popularity, which is not cached by the node in the time slot t, as f'; the node executes a cache replacement action according to a (t), and marks C as the cache capacity of the node; the replacement action is performed in three cases: if f is already cached at the node, a (t) =c, c+.c+1 denotes n that node n caches it c Content substitution is f'; if f is not cached in the node, a (t) =c, c+.c+1 denotes n that node n caches c The content is replaced with f; a (t) =c+1 indicates that node n does not make a replacement of the cache content in the t-th slot;
step 2.3, after the buffer node n finishes the buffer replacement of step 2.2, updating the local state space of the buffer node n, and ordering the buffer content indexes in the state space in descending order according to popularity to obtain the state space s of the next time slot t+1 n (t+1), and integrating the state space of each node to obtain a total new state space s (t+1).
3. The federal depth deterministic policy gradient learning collaborative caching method according to claim 2, wherein the step 3 specifically comprises the following steps:
step 3.1, node n receives the content f requested by the user, if the content f is cached in the local node, the local node directly sends the content f to the user, and the generated content request is delayed to d n1 ,d n1 Representing the time required to transmit content from the local node to the user so that the average local request delay D for all nodes in the time slot t present model can be calculated F-U (t) is:
wherein N represents the number of cache nodes, c f,n (t) =1 means that the content f is cached in the node n;
step 3.2, if the local node does not cache the content f and the adjacent cache node stores the content f, the user will get its request content f from the adjacent cache node, and the generated content request delay is d n1 +d n2 ,d n2 Representing the time required to transmit content between two adjacent nodes, the average collaboration request delay for all nodes in the time slot t model can be calculated as:
step 3.3, if the local node and the neighboring cache node do not store the content f, the user will get the request content f from the cloud server, and the generated content request delay is d n1 +d n3 ,d n3 The transmission time required for sending the content from the cloud to the cache node is represented, so that the average cloud request delay of all nodes in the time slot t model can be calculated as follows:
step 3.4, calculating the total average request delay of all nodes in the time slot t model as follows:
D total (t)=D F-U (t)+D F-F-U (t)+D C-F-U (t)。
4. the federal depth deterministic policy gradient learning collaborative caching method according to claim 3, wherein step 4 specifically comprises the steps of:
step 4.1, the local node n receives the request content f sent by the user, if the content f is cached in the local node, the reward value of the action state pair in the time slot t is:
wherein lambda is 1 Parameters of a reward function;
step 4.2, if the local node n does not cache the content f and the adjacent cache node stores the content f, the user will get its request content f from the adjacent cache node, and the prize value of the action state pair in the time slot t is:
wherein lambda is 2 Parameters of a reward function;
step 4.3, if the local node and the adjacent cache node do not store the content f, the user will get the request content f from the cloud server, and the prize value of the action state pair in the time slot t is:
wherein lambda is 3 Parameters of a reward function; lambda (lambda) 123 =1,λ 1 <λ 2 <<λ 3
5. The federal depth deterministic policy gradient learning collaborative caching method according to claim 4, wherein in step 6, network parameters of each caching node are jointly updated, and the method specifically comprises the following steps:
step 6.1, after training for one period, each cache node uploads its network parameter θ n (t l ) To the cloud end;
step 6.2, cloud computing and updating network parameters theta of the global model G (l):
Wherein D is n Is the local data set of the cache node n;
and 6.3, the cloud server sends the global model network parameters to each cache node, and takes the global model network parameters as initialization parameters of the next period training.
CN202111270116.3A 2021-10-29 2021-10-29 Policy-based federal reinforcement learning collaborative caching method in fog wireless access network Active CN113992770B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111270116.3A CN113992770B (en) 2021-10-29 2021-10-29 Policy-based federal reinforcement learning collaborative caching method in fog wireless access network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111270116.3A CN113992770B (en) 2021-10-29 2021-10-29 Policy-based federal reinforcement learning collaborative caching method in fog wireless access network

Publications (2)

Publication Number Publication Date
CN113992770A CN113992770A (en) 2022-01-28
CN113992770B true CN113992770B (en) 2024-02-09

Family

ID=79744194

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111270116.3A Active CN113992770B (en) 2021-10-29 2021-10-29 Policy-based federal reinforcement learning collaborative caching method in fog wireless access network

Country Status (1)

Country Link
CN (1) CN113992770B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114980127B (en) * 2022-05-18 2024-07-02 东南大学 Computing and unloading method based on federal reinforcement learning in fog wireless access network
CN115484569A (en) * 2022-08-12 2022-12-16 北京邮电大学 Cache data transmission method and device, electronic equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340277A (en) * 2020-02-19 2020-06-26 东南大学 Popularity prediction model and method based on federal learning in fog wireless access network
CN113364854A (en) * 2021-06-02 2021-09-07 东南大学 Privacy protection dynamic edge cache design method based on distributed reinforcement learning in mobile edge computing network
CN113382059A (en) * 2021-06-08 2021-09-10 东南大学 Collaborative caching method based on federal reinforcement learning in fog wireless access network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111340277A (en) * 2020-02-19 2020-06-26 东南大学 Popularity prediction model and method based on federal learning in fog wireless access network
CN113364854A (en) * 2021-06-02 2021-09-07 东南大学 Privacy protection dynamic edge cache design method based on distributed reinforcement learning in mobile edge computing network
CN113382059A (en) * 2021-06-08 2021-09-10 东南大学 Collaborative caching method based on federal reinforcement learning in fog wireless access network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
Cooperative Edge Caching via Federated Deep Reinforcement Learning in Fog-RANs;Min Zhang;2021 IEEE International Conference on Communications Workshops (ICC Workshops);全文 *

Also Published As

Publication number Publication date
CN113992770A (en) 2022-01-28

Similar Documents

Publication Publication Date Title
CN111031102B (en) Multi-user, multi-task mobile edge computing system cacheable task migration method
Yu et al. Federated learning based proactive content caching in edge computing
Yao et al. Joint content placement and storage allocation in C-RANs for IoT sensing service
CN113992770B (en) Policy-based federal reinforcement learning collaborative caching method in fog wireless access network
CN111865826B (en) Active content caching method based on federal learning
CN113115368B (en) Base station cache replacement method, system and storage medium based on deep reinforcement learning
CN113382059B (en) Collaborative caching method based on federal reinforcement learning in fog wireless access network
CN112597388B (en) Cache-enabled D2D communication joint recommendation and caching method
CN115002113A (en) Mobile base station edge computing power resource scheduling method, system and electronic equipment
CN103781115B (en) Distributed base station buffer replacing method based on transmission cost in a kind of cellular network
CN113255004A (en) Safe and efficient federal learning content caching method
CN111556511B (en) Partial opportunistic interference alignment method based on intelligent edge cache
CN108541025B (en) Wireless heterogeneous network-oriented base station and D2D common caching method
CN114863683B (en) Heterogeneous Internet of vehicles edge computing unloading scheduling method based on multi-objective optimization
Zhang et al. Two time-scale caching placement and user association in dynamic cellular networks
CN116916390A (en) Edge collaborative cache optimization method and device combining resource allocation
CN108521640A (en) A kind of content distribution method in cellular network
Shi et al. Content caching policy for 5g network based on asynchronous advantage actor-critic method
CN116723547A (en) Collaborative caching method based on localized federal reinforcement learning in fog wireless access network
CN116362345A (en) Edge caching method and system based on multi-agent reinforcement learning and federal learning
CN113766540B (en) Low-delay network content transmission method, device, electronic equipment and medium
Hua et al. On cost minimization for cache-enabled D2D networks with recommendation
Cai et al. Mobility Prediction-Based Wireless Edge Caching Using Deep Reinforcement Learning
CN112286689A (en) Cooperative shunting and storing method suitable for block chain workload certification
Oualil et al. A personalized learning scheme for internet of vehicles caching

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant