CN111970733A - Deep reinforcement learning-based cooperative edge caching algorithm in ultra-dense network - Google Patents

Deep reinforcement learning-based cooperative edge caching algorithm in ultra-dense network Download PDF

Info

Publication number
CN111970733A
CN111970733A CN202010771674.7A CN202010771674A CN111970733A CN 111970733 A CN111970733 A CN 111970733A CN 202010771674 A CN202010771674 A CN 202010771674A CN 111970733 A CN111970733 A CN 111970733A
Authority
CN
China
Prior art keywords
content
time slot
base station
user equipment
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010771674.7A
Other languages
Chinese (zh)
Other versions
CN111970733B (en
Inventor
韩光洁
张帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Campus of Hohai University
Original Assignee
Changzhou Campus of Hohai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Campus of Hohai University filed Critical Changzhou Campus of Hohai University
Priority to CN202010771674.7A priority Critical patent/CN111970733B/en
Publication of CN111970733A publication Critical patent/CN111970733A/en
Application granted granted Critical
Publication of CN111970733B publication Critical patent/CN111970733B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/10Flow control between communication endpoints
    • H04W28/14Flow control between communication endpoints using intermediate storage
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a depth reinforcement learning-based collaborative edge caching algorithm in an ultra-dense network, which comprises the following specific steps: step 1: setting parameters of a system model; step 2: the Double DQN algorithm is employed to make optimal caching decisions for each SBS to maximize the total content cache hit rate of all SBS. The algorithm combines a DQN algorithm and a Double Q-learning algorithm, thereby effectively solving the problem of Q value over-estimation by the DQN algorithm. In addition, the algorithm adopts a priority experience playback technology, so that the learning speed is accelerated; and step 3: an improved branch-and-bound approach is employed to make optimal bandwidth resource allocation decisions for each SBS in order to minimize the total content download delay for all user devices. The invention can effectively reduce the content downloading delay of all users in the ultra-dense network, improves the content cache hit rate and the spectrum resource utilization rate, has good robustness and expandability, and is suitable for the large-scale user-dense ultra-dense network.

Description

Deep reinforcement learning-based cooperative edge caching algorithm in ultra-dense network
Technical Field
The invention relates to a depth reinforcement learning-based cooperative edge cache algorithm in an ultra-dense network, belonging to the field of edge cache of the ultra-dense network.
Background
In the 5G era, with the popularization of smart mobile devices and mobile applications, mobile data traffic has increased explosively. In order to meet the requirements of a 5G network such as high capacity, high throughput, high user experience rate, high reliability, wide coverage and the like, Ultra-Dense Networks (UDNs) are in force. The UDN densely deploys low-power Small Base Stations (SBS) in indoor and outdoor hot spot areas (such as office buildings, shopping malls, subways, airports, tunnels and the like) within the coverage range of the MBS (Macro Base Station, MBS) to improve the network capacity and the spatial multiplexing degree and make up for a blind area which cannot be covered by the Macro Base Station (MBS).
However, SBS in UDN is connected to the core network through backhaul links, and as the number of SBS and the number of users increase, backhaul data traffic increases sharply, causing backhaul link congestion and greater Service delay, thereby degrading Quality of Service (QoS) and Quality of user Experience (QoE). Thus, backhaul network issues have become a performance bottleneck limiting the development of UDNs.
In view of the above problems, the edge caching technology has become a promising solution, and by caching popular content in the SBS, the user can directly obtain the requested content from the local SBS without downloading the content from the remote cloud server through the backhaul link, thereby reducing the traffic load of the backhaul link and the core network, reducing the content downloading delay, and improving QoS and QoE of the user. However, the performance of edge buffers may be limited due to the limited buffer capacity of the individual SBS. In order to expand the cache capacity and increase the cache diversity, a cooperative edge cache scheme may be adopted, in which a plurality of SBS cache and update contents in a cooperative manner, and share the cached contents with each other, so as to improve the content cache hit rate and reduce the content download delay.
Most of the existing collaborative content caching research needs prior knowledge of probability distribution (such as Zipf distribution) of content popularity and a user preference model, but in fact, the content popularity has complex space-time dynamic characteristics and is usually a non-stable random process, so that the probability distribution of the content popularity is difficult to accurately predict and model.
Deep Learning (DRL) integrates the powerful perception capability of Deep Learning and the powerful decision-making capability of Deep Learning, wherein the most common Deep Learning algorithm is Deep Q network (Deep Learning algorithm)Q-Network, DQN) that approximates a Q function using a Deep Neural Network (DNN) with a weight of θ as a function approximator, i.e., Q (s, a; θ) ═ Q*(s, a), the DNN is called Q-network, and then the weights θ are updated by a random gradient descent method to minimize the loss function, the algorithm is suitable for environments with large state space and motion space, thus resolving the dimension cursing problem. However, the traditional DQN algorithm usually has an excessively high estimated Q value, so a Double DQN algorithm is adopted, which is based on a Double Q-learning algorithm, and can effectively solve the overestimation problem of the DQN algorithm. In addition, the conventional DQN algorithm usually adopts a random uniform sampling manner to extract Experience samples from an Experience Replay memory to update the Q network, that is, the probability of being extracted of each Experience sample is the same, so that especially few but especially high-value Experience samples are not efficiently utilized, and therefore, a priority empirical Replay (Prioritized empirical Replay) technique is adopted to solve the sampling problem, thereby accelerating the learning speed.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a cooperative edge cache algorithm based on deep reinforcement learning in an ultra-dense network, which is a centralized algorithm. The algorithm does not need prior knowledge such as probability distribution of content popularity and a user preference model, and calculates the content popularity by using the instant content request of the user, thereby simplifying the modeling process of the content popularity. The MBS is then responsible for collecting local content popularity information for all SBS and making optimal caching decisions for each SBS, with the goal of maximizing the total content cache hit rate for all SBS. Finally, after determining the optimal caching decision of each SBS, each SBS makes an optimal resource allocation decision based on its bandwidth resources with the goal of minimizing the total content download delay of all user devices. The algorithm has good robustness and expandability and is suitable for large-scale user-intensive UDNs.
In order to achieve the purpose, the technical scheme of the invention is realized as follows:
the cooperative edge cache algorithm based on deep reinforcement learning in the ultra-dense network comprises the following steps:
step 1: setting parameters of a system model;
step 2: the Double DQN algorithm is employed to make optimal caching decisions for each SBS to maximize the total content cache hit rate of all SBS, including the total cache hit rate hit by local SBS and the total cache hit rate hit by other SBS. The algorithm combines a DQN algorithm and a Double Q-learning algorithm, thereby effectively solving the problem of Q value over-estimation by the DQN algorithm. In addition, the algorithm adopts a priority experience playback technology, so that the learning speed can be accelerated;
and step 3: an improved branch-and-bound approach is employed to make optimal bandwidth resource allocation decisions for each SBS in order to minimize the total content download delay for all user devices. The method combines a branch-and-bound method and a linear lower approximation method, and is suitable for the large-scale separable concave integer programming problem with more decision variables.
Preferably, the specific steps of step 1 are as follows:
1.1 setting network model: the method comprises the following steps of dividing the method into three layers, namely a user equipment layer, an MEC layer and a cloud layer, wherein the user equipment layer comprises a plurality of User Equipments (UE), and each UE can only be connected to one SBS; the MEC layer comprises M SBS and an MBS, the MBS covers all SBS, each SBS covers a plurality of UE (each SBS represents a small cell), coverage range between SBS is not overlapped, each SBS is provided with an MEC server M belonging to M, and storage capacity is scmThe storage capacities of all MEC servers form a storage capacity size vector sc ═ sc1,sc2,...,scM]The MEC server is responsible for providing edge cache resources for the UE, and at the same time, is responsible for collecting and transmitting status information (such as size and popularity of each requested content, channel gain) of each small cell to the MBS, and the SBS can communicate with each other through the MBS and share its cache content. The MBS is responsible for collecting the status information of each SBS and making caching decisions for all SBS, and is connected to the cloud through the core backbone (e.g., fiber backhaul link). The cloud layer comprises a plurality of cloud servers, has rich computing and caching resources and is used for caching all contents;
1.2 dividing the whole time axis into T time slots with the same length, wherein T belongs to T and represents the time slot index, and a quasi-static model is adopted, namely in one time slot, all system state parameters (such as popularity of each content request, position of user equipment and channel gain) are kept unchanged, and different time slot parameters are different;
1.3 set content popularity model: there are F contents, each content F is the size of zfAnd each content has a different size, and the sizes of all the contents form a content size vector z ═ z1,z2,...,zf,...,zF]. Defining the popularity of each content f in a cell m at a time slot t as
Figure BDA0002616875110000031
The total number of requests for content f in cell m at time slot t is
Figure BDA0002616875110000032
The total number of content requests for all UEs in cell m at time slot t is
Figure BDA0002616875110000033
Thus, it is possible to provide
Figure BDA0002616875110000034
Popularity of all content within cell m
Figure BDA0002616875110000035
Constructing a content popularity vector
Figure BDA0002616875110000036
The content popularity vectors of all the cells form a content popularity matrix pt
1.4 set content request model: the total U UEs send content requests, and the set of all the UEs sending the content requests in the cell m in the time slot t is defined as
Figure BDA0002616875110000037
The number of UEs transmitting content requests in cell m at time slot t is
Figure BDA0002616875110000038
Assuming that each UE requests each content at most once in time slot t, each UE in cell m in time slot t is defined
Figure BDA0002616875110000039
The content request vector of
Figure BDA00026168751100000310
Each element of which
Figure BDA00026168751100000311
Figure BDA0002616875110000041
Indicating that UE u within cell m at time slot t requests content f,
Figure BDA0002616875110000042
indicating that UE u in cell m has no request content f at time slot t, the content request vectors of all UEs in cell m at time slot t form a content request matrix
Figure BDA0002616875110000043
1.5 setting a cache model: defining a content caching decision vector to be maintained in a cache region of each MEC server m at a time slot t
Figure BDA0002616875110000044
Each element of which
Figure BDA0002616875110000045
Figure BDA0002616875110000046
Indicating that the content f is cached on the MEC server m at time slot t,
Figure BDA0002616875110000047
means that the content f is not cached on the MEC server m at the time slot t and in each MEC serverThe total size of the cached content cannot exceed its storage capacity scm. The content caching decision vectors of all MEC servers form a content caching decision matrix dt
1.6 setting the communication model: assuming that each SBS operates on the same frequency band with a frequency bandwidth of B, the MBS and SBS communicate with each other using a wired optical fiber, and thus the data transmission rate between the SBS and MBS is large. The frequency bandwidth B is divided into beta orthogonal sub-channels by adopting an orthogonal frequency division multiplexing technology, and each UE u defined in a cell m in a time slot t can be allocated with a plurality of orthogonal sub-channels
Figure BDA0002616875110000048
Each subchannel having a bandwidth of
Figure BDA0002616875110000049
Because the coverage areas of the SBS do not overlap with each other, there is no co-channel interference between different SBS and between different UEs of the same SBS. Defining the value of the downlink SNR between the time slot tuue u and the local SBS m as
Figure BDA00026168751100000410
And is
Figure BDA00026168751100000411
Wherein the content of the first and second substances,
Figure BDA00026168751100000412
represents the transmit power at time slot t SBS m,
Figure BDA00026168751100000413
represents the channel gain between time slots t SBS m and UE u, and
Figure BDA00026168751100000414
Figure BDA00026168751100000415
is indicated between time slots t SBS m and UE uMu denotes the path loss factor, sigma2Representing the variance of additive white gaussian noise. Thus, the download rate between time slot tuue u and local SBS m is defined as
Figure BDA00026168751100000416
And is
Figure BDA00026168751100000417
The data transmission rate between each SBS m and MBS n is defined as a constant
Figure BDA00026168751100000418
The data transmission rate between the MBS n and the cloud server c is constant
Figure BDA00026168751100000419
And is
Figure BDA00026168751100000420
Thus, the download delay required to retrieve the content f from the local MEC server m at time slot UEu is defined as
Figure BDA00026168751100000421
And is
Figure BDA0002616875110000051
Defining the download delay needed by UE u to get the content f from other non-local MEC servers-m at time slot t as
Figure BDA0002616875110000052
And is
Figure BDA0002616875110000053
Defining the download delay needed for UE u to obtain the content f from the cloud server c at the time slot t as
Figure BDA0002616875110000054
And is
Figure BDA0002616875110000055
Therefore, the temperature of the molten metal is controlled,
Figure BDA0002616875110000056
1.7 set content delivery model: the basic process of content delivery is that each UE independently requests a plurality of contents from a local MEC server, and if the contents are cached in a cache region of the local MEC server, the contents are directly transmitted to the UE by the local MEC server; if the content is not cached in the local MEC server, the content can be acquired from MEC servers of other SBS through MBS, and then transmitted to UE by the local MEC server; if all MEC servers do not cache the content, the content is relayed to the MBS from the cloud server through the core network, then the MBS transmits the content to the local MEC server, and finally the local MEC server delivers the content to the UE. Defining whether UE u acquires content f from local MEC server m at time slot t as binary variable
Figure BDA0002616875110000057
Wherein
Figure BDA0002616875110000058
Indicating that UE u gets the content f from the local server m at time slot t, otherwise
Figure BDA0002616875110000059
Defining whether UE u acquires content f from non-local server-m at time slot t as binary variable
Figure BDA00026168751100000510
Wherein
Figure BDA00026168751100000511
Indicating that UE u gets content f from non-local server-m at time slot t,otherwise
Figure BDA00026168751100000512
Defining whether UE u acquires content f from cloud server c at time slot t as a binary variable
Figure BDA00026168751100000513
Wherein
Figure BDA00026168751100000514
Indicating that UE u acquires the content f from the cloud server c at the time slot t, otherwise
Figure BDA00026168751100000515
Preferably, the detailed steps of the Double DQN algorithm in step 2 are as follows:
2.1 describes the content caching Decision problem for M SBS's as a Constrained Markov Decision Process (CMDP) problem, which uses tuples<S,A,r,Pr,c1,c2,...,cM>Expressed in terms of the optimization objective is to maximize the long-term cumulative discount rewards of all SBS, where
2.1.1S denotes the state space, StE S represents the state set of all SBS at the time slot t, i.e. the content popularity matrix p formed by the content popularity vectors of all SBS at the time slot ttThus st=pt
2.1.2A denotes the motion space, definition ate.A denotes the action selected in the time slot t MBS, i.e. at=dt
2.1.3 r denotes the reward function, defined as r in the slot tMBSt(st,at) Is shown in state stLower MBS performing action atAn instant prize later obtained, and
Figure BDA0002616875110000061
wherein w1And w2Represents a weight satisfying w1+w 21 and w1>w2Can order w1=0.9,
Figure BDA0002616875110000062
Representing the total cache hit rate hit by the local SBS m,
Figure BDA0002616875110000063
represents the total cache hit rate hit by the non-local SBS-m;
2.1.4 Pr denotes the state transfer function, i.e. the MBS is from the current state stLower execution action atThereafter, the system shifts to the next state st+1And is a probability of
Figure BDA0002616875110000064
2.1.5 c1,c2,...,c0The constraint condition of M SBS is that the total size of the content cached by each SBS does not exceed the sc storage capacitymI.e. satisfy
Figure BDA0002616875110000065
2.2 employs a Double DQN algorithm whose training process is similar to the DQN algorithm, including an online Q network and a target Q network, except that the algorithm decomposes the maximum operation of the target Q value in the DQN algorithm into action selection and action evaluation, i.e. using the online Q network to select an action, and using the target Q network to evaluate the action. The Double DQN algorithm includes two procedures, a training procedure and an execution procedure, wherein the training procedure is as follows:
2.2.1 in the initialization phase of the algorithm: initializing the memory capacity N of the experience replay memory, the size K of the sampling batch (N > K), the experience replay period K (namely the sampling period), the weight theta of the online Q network Q and the target Q network
Figure BDA0002616875110000071
Weight of theta-θ, learning rate α, discount factor γ, parameter of greedy policy, time interval C for updating target Q network parameter, total number of training times (i.e. total number of epamode) EP, and total number of slots T included in each training (T > N), defining index of epamode as i, and initializing i to 1;
2.2.2 if i is less than or equal to EP, entering 2.2.2.1; otherwise, training is finished:
2.2.2.1 initializing t ═ 1;
2.2.2.2 general Current State stInputting the data into an online Q network, outputting Q values of all actions, selecting all actions meeting the storage capacity requirement according to constraint conditions, and selecting an action a from the actions by adopting a greedy strategytAnd performing-a greedy strategy means that the agent selects the action randomly with a small probability and with a large probability 1-the action with the highest Q value at each slot;
2.2.2.3 is performing action atThereafter, the agent obtains an instant prize rtAnd transition to the next state st +1Then the experience sample et=(st,at,rt,st+1) Storing into an experience replay memory;
2.2.2.4 if t < N, let t ← t +1, and return 2.2.2.2; otherwise, go to 2.2.2.5;
2.2.2.5 if t% K is 0, go to 2.2.2.6; otherwise, let t ← t +1, and return to 2.2.2.2;
2.2.2.6 assume that an empirical sample j in the empirical playback memory is ej=(sj,aj,rj,sj+1) Defining the priority of the empirical sample j as
pj=|j|+∈ (9)
Where e > 0 is used to ensure that the priority of each sample is not 0,jrepresenting the Time Difference (TD) error of the sample j, i.e., the difference between the target Q value and the estimated Q value of the sample j, the Double DQN algorithm uses an online Q network to select the sample j with the largest Q valueAnd the target Q network is employed to evaluate the Q value of the action, i.e.
Figure BDA0002616875110000072
Therefore, the larger the TD error of a sample, the higher its priority. Then, calculating the priority of all samples in the empirical playback memory through equations (9) and (10);
2.2.2.7 use a Sum Tree data structure to extract k empirical samples from an empirical playback memory, where each leaf node at the bottom represents the priority of each empirical sample, each parent node has a value equal to the Sum of the values of two children nodes, and the root node at the top represents the Sum of the priorities of all samples. The specific process is as follows: dividing the value of a root node by k to obtain k priority intervals, randomly selecting a value in each interval, judging which leaf node the value corresponds to the bottommost layer by top-down search, and selecting a sample corresponding to the leaf node to obtain k empirical samples;
2.2.2.8 calculating the target Q value y of each experience sample j in k experience samples according to the formula (11)jI.e. using the online Q network to select the action with the largest Q value, and using the target Q network to evaluate the Q value of the action, i.e.
Figure BDA0002616875110000081
2.2.2.9 defining the Loss function Loss (θ) as the target Q value yjAnd the estimated Q value Q(s)j,aj(ii) a Theta) mean square error between
Loss(θ)=E[(yt-Q(st,at;θ))2] (12)
Where E [. cndot. ] represents a mathematical expectation. Then, based on the k empirical samples, updating the weight theta of the online Q network by adopting a random gradient descent method so as to minimize a loss function;
2.2.2.10 if t% C is 0, it will be updatedCopying the weight theta of the later online Q network into the target Q network to update the weight theta of the target Q network-(ii) a Otherwise, the weight theta of the target Q network does not need to be updated-
2.2.2.11 if T < T, let T ← T +1, and return 2.2.2.2; otherwise, let i ← i +1, and return to 2.2.2.1;
after the training process of the DoubleDQN algorithm is completed, the optimal weight theta of the online Q network is obtained*Then, deploying the trained DoubleDQN algorithm to the MBS for execution, wherein the execution process is as follows:
2.2.3 initializing t ═ 1;
2.2.4 MBS collects the set of states s of all SBS's at time slot ttThen s istInputting the data into a trained online Q network so as to output Q values of all actions;
2.2.5 selecting all actions meeting the storage capacity requirement according to the constraint condition, and then selecting the action a with the maximum Q value from the actionstAnd is carried out, i.e.
at=arg maxa′Q(st,a′;θ*) (13)
2.2.6 MBS performing action atThereafter, an instant prize r is obtainedtAnd transition to the next state st+1
2.2.7 if T < T, let T ← T +1, and return to 2.2.4; otherwise, the algorithm ends.
Preferably, the specific steps of step 3 are as follows:
3.1 determining the best content buffering decision vector for each SBS m
Figure BDA0002616875110000082
The bandwidth resource allocation problem for each SBS is then described as the non-linear integer programming problem P, i.e. for
Figure BDA0002616875110000083
All require
Figure BDA0002616875110000091
Figure BDA0002616875110000092
Figure BDA0002616875110000093
Wherein both the objective function and the constraint function can be expressed with respect to all decision variables
Figure BDA0002616875110000094
Of a unitary function summation, i.e.
Figure BDA0002616875110000095
Figure BDA0002616875110000096
And all are
Figure BDA0002616875110000097
The objective function is a separable concave function in the defined domain, and the constraint function is a linear constraint in the defined domain, so that the problem is a separable concave integer programming problem;
3.2 each SBS adopts an improved branch and bound method to solve the separable concave integer programming problem, and the method has the specific flow:
3.2.1 continuously relaxing the original problem P, namely removing integer constraint, and linearly approximating the target function, thereby obtaining continuous relaxation and linear approximation subproblems LSP of the original problem P, wherein the LSP is a separable linear programming problem;
3.2.2 solving the continuous optimal solution of the LSP by utilizing the KKT condition, wherein if the continuous optimal solution is an integer solution, the continuous optimal solution is the optimal solution of the original problem P, and otherwise, the objective function value of the continuous optimal solution is a lower bound of the optimal value of the original problem P;
3.2.3 then branch from the continuous optimal solution, where each branch corresponds to a sub-problem, and then solve the continuous relaxation of these sub-problems until a feasible integer solution is found whose objective function value provides an upper bound for the original problem P, and whose objective function value for the continuous optimal solution of each sub-problem provides a lower bound for the corresponding sub-problem. If a branch has no feasible solution, or the continuous optimal solution is an integer solution, or the lower bound exceeds the upper bound, the branch can be cut. And for the branches which are not cut off, repeating the branching and pruning processes until all the branches are cut off. If a branch has a feasible integer solution, the upper bound needs to be updated if necessary to ensure that the upper bound is equal to the minimum objective function value of the existing feasible integer solution;
3.2.4 at the end of the algorithm, the best feasible integer solution at present is the optimal solution of the original problem P.
Has the advantages that: the invention provides a depth reinforcement learning-based collaborative edge cache algorithm in an ultra-dense network, which can effectively reduce the content download delay of all users in the ultra-dense network, improve the content cache hit rate and the spectrum resource utilization rate, has good robustness and expandability, and is suitable for large-scale user-dense ultra-dense networks.
Drawings
Fig. 1 is a network model of UDN using edge caching in step 1.1;
fig. 2 is a schematic diagram of the use of the data structure Sum Tree to extract k samples in step 2.2.2.7.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application are clearly and completely described below, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The cooperative edge cache algorithm based on deep reinforcement learning in the ultra-dense network specifically comprises the following steps:
step 1: setting parameters of a system model;
step 2: the Double DQN algorithm is employed to make optimal caching decisions for each SBS to maximize the total content cache hit rate of all SBS, including the total cache hit rate hit by local SBS and the total cache hit rate hit by other SBS. The algorithm combines a DQN algorithm and a Double Q-learning algorithm, thereby effectively solving the problem of Q value over-estimation by the DQN algorithm. In addition, the algorithm adopts a priority experience playback technology, so that the learning speed can be accelerated;
and step 3: an improved branch-and-bound approach is employed to make optimal bandwidth resource allocation decisions for each SBS in order to minimize the total content download delay for all user devices. The method combines a branch-and-bound method and a linear lower approximation method, and is suitable for the large-scale separable concave integer programming problem with more decision variables.
Preferably, the specific steps in step 1 are as follows:
1.1 setting network model: the method comprises the following steps of dividing the method into three layers, namely a user equipment layer, an MEC layer and a cloud layer, wherein the user equipment layer comprises a plurality of User Equipments (UE), and each UE can only be connected to one SBS; the MEC layer comprises M SBS and an MBS, the MBS covers all SBS, each SBS covers a plurality of UE (each SBS represents a small cell), coverage range between SBS is not overlapped, each SBS is provided with an MEC server M belonging to M, and storage capacity is scmThe storage capacities of all MEC servers form a storage capacity size vector sc ═ sc1,sc2,...,scM]The MEC server is responsible for providing edge cache resources for the UE, and at the same time, is responsible for collecting and transmitting status information (such as size and popularity of each requested content, channel gain) of each small cell to the MBS, and the SBS can communicate with each other through the MBS and share its cache content. MBS is responsible for collecting the status of each SBSState information and make buffering decisions for all SBS, which are connected to the cloud through the core backbone (e.g., fiber backhaul link). The cloud layer comprises a plurality of cloud servers, has rich computing and caching resources and is used for caching all contents;
1.2 dividing the whole time axis into T time slots with the same length, wherein T belongs to T and represents the time slot index, and a quasi-static model is adopted, namely in one time slot, all system state parameters (such as popularity of each content request, position of user equipment and channel gain) are kept unchanged, and different time slot parameters are different;
1.3 set content popularity model: there are F contents, each content F is the size of zfAnd each content has a different size, and the sizes of all the contents form a content size vector z ═ z1,z2,...,zf,...,zF]. Defining the popularity of each content f in a cell m at a time slot t as
Figure BDA0002616875110000111
The total number of requests for content f in cell m at time slot t is
Figure BDA0002616875110000112
The total number of content requests for all UEs in cell m at time slot t is
Figure BDA0002616875110000113
Thus, it is possible to provide
Figure BDA0002616875110000114
Popularity of all content within cell m
Figure BDA0002616875110000115
Constructing a content popularity vector
Figure BDA0002616875110000116
The content popularity vectors of all the cells form a content popularity matrix pt
1.4 set content request model: there are U UEs sending content requests, defined at small time slot tThe set of all UEs in zone m that send content requests is
Figure BDA0002616875110000117
The number of UEs transmitting content requests in cell m at time slot t is
Figure BDA0002616875110000118
Assuming that each UE requests each content at most once in time slot t, each UE in cell m in time slot t is defined
Figure BDA0002616875110000119
The content request vector of
Figure BDA00026168751100001110
Each element of which
Figure BDA00026168751100001111
Figure BDA00026168751100001112
Indicating that UE u within cell m at time slot t requests content f,
Figure BDA00026168751100001113
indicating that UE u in cell m has no request content f at time slot t, the content request vectors of all UEs in cell m at time slot t form a content request matrix
Figure BDA00026168751100001114
1.5 setting a cache model: defining a content caching decision vector to be maintained in a cache region of each MEC server m at a time slot t
Figure BDA00026168751100001115
Each element of which
Figure BDA00026168751100001116
Figure BDA00026168751100001117
Indicating that the content f is cached on the MEC server m at time slot t,
Figure BDA00026168751100001118
meaning that the content f is not cached on the MEC server m at the time slot t and that the total size of the cached content in each MEC server cannot exceed its storage capacity scm. The content caching decision vectors of all MEC servers form a content caching decision matrix dt
1.6 setting the communication model: assuming that each SBS operates on the same frequency band with a frequency bandwidth of B, the MBS and SBS communicate with each other using a wired optical fiber, and thus the data transmission rate between the SBS and MBS is large. The frequency bandwidth B is divided into beta orthogonal sub-channels by adopting an orthogonal frequency division multiplexing technology, and each UE u defined in a cell m in a time slot t can be allocated with a plurality of orthogonal sub-channels
Figure BDA0002616875110000121
Each subchannel having a bandwidth of
Figure BDA0002616875110000122
Because the coverage areas of the SBS do not overlap with each other, there is no co-channel interference between different SBS and between different UEs of the same SBS. Defining the value of the downlink SNR between the time slot tuue u and the local SBS m as
Figure BDA0002616875110000123
And is
Figure BDA0002616875110000124
Wherein the content of the first and second substances,
Figure BDA0002616875110000125
represents the transmit power at time slot t SBS m,
Figure BDA0002616875110000126
represents the channel gain between time slots t SBS m and UE u, and
Figure BDA0002616875110000127
Figure BDA0002616875110000128
denotes the distance between the time slot t SBS m and the UE u, μ denotes the path loss factor, σ2Representing the variance of additive white gaussian noise. Thus, the download rate between time slot tuue u and local SBS m is defined as
Figure BDA0002616875110000129
And is
Figure BDA00026168751100001210
The data transmission rate between each SBS m and MBS n is defined as a constant
Figure BDA00026168751100001211
The data transmission rate between the MBS n and the cloud server c is constant
Figure BDA00026168751100001212
And is
Figure BDA00026168751100001213
Thus, the download delay required to retrieve the content f from the local MEC server m at time slot UEu is defined as
Figure BDA00026168751100001214
And is
Figure BDA00026168751100001215
Defining the download delay needed by UE u to get the content f from other non-local MEC servers-m at time slot t as
Figure BDA00026168751100001216
And is
Figure BDA00026168751100001217
Defining the download delay needed for UE u to obtain the content f from the cloud server c at the time slot t as
Figure BDA00026168751100001218
And is
Figure BDA0002616875110000131
Therefore, the temperature of the molten metal is controlled,
Figure BDA0002616875110000132
1.7 set content delivery model: the basic process of content delivery is that each UE independently requests a plurality of contents from a local MEC server, and if the contents are cached in a cache region of the local MEC server, the contents are directly transmitted to the UE by the local MEC server; if the content is not cached in the local MEC server, the content can be acquired from MEC servers of other SBS through MBS, and then transmitted to UE by the local MEC server; if all MEC servers do not cache the content, the content is relayed to the MBS from the cloud server through the core network, then the MBS transmits the content to the local MEC server, and finally the local MEC server delivers the content to the UE. Defining whether UE u acquires content f from local MEC server m at time slot t as binary variable
Figure BDA0002616875110000133
Wherein
Figure BDA0002616875110000134
Indicating that UE u gets the content f from the local server m at time slot t, otherwise
Figure BDA0002616875110000135
Defining whether UE u acquires content f from non-local server-m at time slot t as binary variable
Figure BDA0002616875110000136
Wherein
Figure BDA0002616875110000137
Indicating that UE u gets content f from non-local server-m at time slot t, otherwise
Figure BDA0002616875110000138
Defining whether UE u acquires content f from cloud server c at time slot t as a binary variable
Figure BDA0002616875110000139
Wherein
Figure BDA00026168751100001310
Indicating that UE u acquires the content f from the cloud server c at the time slot t, otherwise
Figure BDA00026168751100001311
Preferably, in the step 2, the specific steps are as follows:
2.1 describes the content caching Decision problem for M SBS's as a Constrained Markov Decision Process (CMDP) problem, which uses tuples<S,A,r,Pr,c1,c2,...,cM>Expressed in terms of the optimization objective is to maximize the long-term cumulative discount rewards of all SBS, where
2.1.1S denotes the state space, StE S represents the state set of all SBS at the time slot t, i.e. the content popularity matrix p formed by the content popularity vectors of all SBS at the time slot ttThus st=pt
2.1.2A denotes the motion space, definition ate.A denotes the action selected in the time slot t MBS, i.e. at=dt
2.1.3 r denotes the reward function, defined as r in the slot tMBSt(st,at) Is shown in state stLower MBS performing action atAn instant prize later obtained, and
Figure BDA00026168751100001312
Figure BDA0002616875110000141
wherein w1And w2Represents a weight satisfying w1+w 21 and w1>w2Can order w1=0.9,
Figure BDA0002616875110000142
Representing the total cache hit rate hit by the local SBS m,
Figure BDA0002616875110000143
represents the total cache hit rate hit by the non-local SBS-m;
2.1.4 Pr denotes the state transfer function, i.e. the MBS is from the current state stLower execution action atThereafter, the system shifts to the next state st+1And is a probability of
Figure BDA0002616875110000144
2.1.5 c1,c2,...,cMThe constraint condition of M SBS is that the total size of the content cached by each SBS does not exceed the sc storage capacitymI.e. satisfy
Figure BDA0002616875110000145
2.2 employs a Double DQN algorithm whose training process is similar to the DQN algorithm, including an online Q network and a target Q network, except that the algorithm decomposes the maximum operation of the target Q value in the DQN algorithm into action selection and action evaluation, i.e. using the online Q network to select an action, and using the target Q network to evaluate the action. The Double DQN algorithm includes two procedures, a training procedure and an execution procedure, wherein the training procedure is as follows:
2.2.1 in the initialization phase of the algorithm: initializing the memory capacity N of the experience replay memory, the size K of the sampling batch (N > K), the experience replay period K (namely the sampling period), the weight theta of the online Q network Q and the target Q network
Figure BDA0002616875110000146
Weight of theta-θ, learning rate α, discount factor γ, parameter of greedy policy, time interval C for updating target Q network parameter, total number of training times (i.e. total number of epamode) EP, and total number of slots T included in each training (T > N), defining index of epamode as i, and initializing i to 1;
2.2.2 if i is less than or equal to EP, entering 2.2.2.1; otherwise, training is finished:
2.2.2.1 initializing t ═ 1;
2.2.2.2 general Current State stInputting the data into an online Q network, outputting Q values of all actions, selecting all actions meeting the storage capacity requirement according to constraint conditions, and selecting an action a from the actions by adopting a greedy strategytAnd performing-a greedy strategy means that the agent selects the action randomly with a small probability and with a large probability 1-the action with the highest Q value at each slot;
2.2.2.3 is performing action atThereafter, the agent obtains an instant prize rtAnd transition to the next state st +1Then the experience sample et=(st,at,rt,st+1) Storing into an experience replay memory;
2.2.2.4 if t < N, let t ← t +1, and return 2.2.2.2; otherwise, go to 2.2.2.5;
2.2.2.5 if t% K is 0, go to 2.2.2.6; otherwise, let t ← t +1, and return to 2.2.2.2;
2.2.2.6 assume that an empirical sample j in the empirical playback memory is ej=(sj,aj,rj,sj+1) Defining the priority of the empirical sample j as
pj=|j|+∈ (9)
Where e > 0 is used to ensure that the priority of each sample is not 0,jrepresenting the Time Difference (TD) error of the sample j, i.e., the difference between the target Q value and the estimated Q value of the sample j, the Double DQN algorithm uses the online Q network to select the action with the largest Q value, and uses the target Q network to evaluate the Q value of the action, i.e., the sample j is estimated
Figure BDA0002616875110000151
Therefore, the larger the TD error of a sample, the higher its priority. Then, calculating the priority of all samples in the empirical playback memory through equations (9) and (10);
2.2.2.7 use a Sum Tree data structure to extract k empirical samples from an empirical playback memory, where each leaf node at the bottom represents the priority of each empirical sample, each parent node has a value equal to the Sum of the values of two children nodes, and the root node at the top represents the Sum of the priorities of all samples. The specific process is as follows: dividing the value of a root node by k to obtain k priority intervals, randomly selecting a value in each interval, judging which leaf node the value corresponds to the bottommost layer by top-down search, and selecting a sample corresponding to the leaf node to obtain k empirical samples;
2.2.2.8 calculating the target Q value y of each experience sample j in k experience samples according to the formula (11)jI.e. using the online Q network to select the action with the largest Q value, and using the target Q network to evaluate the Q value of the action, i.e.
Figure BDA0002616875110000152
2.2.2.9 defining the Loss function Loss (θ) as the target Q value yjAnd the estimated Q value Q(s)j,aj(ii) a Theta) mean square error between
Loss(θ)=E[(yt-Q(st,at;θ))2] (12)
Where E [. cndot. ] represents a mathematical expectation. Then, based on the k empirical samples, updating the weight theta of the online Q network by adopting a random gradient descent method so as to minimize a loss function;
2.2.2.10 if t% C is 0, copying the updated weight theta of the online Q network into the target Q network to update the weight theta of the target Q network-(ii) a Otherwise, the weight theta of the target Q network does not need to be updated-
2.2.2.11 if T < T, let T ← T +1, and return 2.2.2.2; otherwise, let i ← i +1, and return to 2.2.2.1;
after the training process of the DoubleDQN algorithm is completed, the optimal weight theta of the online Q network is obtained*Then, deploying the trained DoubleDQN algorithm to the MBS for execution, wherein the execution process is as follows:
2.2.3 initializing t ═ 1;
2.2.4 MBS collects the set of states s of all SBS's at time slot ttThen s istInputting the data into a trained online Q network so as to output Q values of all actions;
2.2.5 selecting all actions meeting the storage capacity requirement according to the constraint condition, and then selecting the action a with the maximum Q value from the actionstAnd is carried out, i.e.
at=arg maxa′Q(st,a′;θ*) (13)
2.2.6 MBS performing action atThereafter, an instant prize r is obtainedtAnd transition to the next state st+1
2.2.7 if T < T, let T ← T +1, and return to 2.2.4; otherwise, the algorithm ends.
Preferably, in the step 3, the specific steps are as follows:
3.1 determining the best content buffering decision vector for each SBS m
Figure BDA0002616875110000161
Then, the bandwidth of each SBS is allocatedThe source allocation problem is described as a non-linear integer programming problem P, i.e. for
Figure BDA0002616875110000162
All require
Figure BDA0002616875110000163
Figure BDA0002616875110000164
Figure BDA0002616875110000165
Wherein both the objective function and the constraint function can be expressed with respect to all decision variables
Figure BDA0002616875110000166
Of a unitary function summation, i.e.
Figure BDA0002616875110000171
Figure BDA0002616875110000172
And all are
Figure BDA0002616875110000173
The objective function is a separable concave function in the defined domain, and the constraint function is a linear constraint in the defined domain, so that the problem is a separable concave integer programming problem;
3.2 each SBS adopts an improved branch and bound method to solve the separable concave integer programming problem, and the method has the specific flow:
3.2.1 continuously relaxing the original problem P, namely removing integer constraint, and linearly approximating the target function, thereby obtaining continuous relaxation and linear approximation subproblems LSP of the original problem P, wherein the LSP is a separable linear programming problem;
3.2.2 solving the continuous optimal solution of the LSP by utilizing the KKT condition, wherein if the continuous optimal solution is an integer solution, the continuous optimal solution is the optimal solution of the original problem P, and otherwise, the objective function value of the continuous optimal solution is a lower bound of the optimal value of the original problem P;
3.2.3 then branch from the continuous optimal solution, where each branch corresponds to a sub-problem, and then solve the continuous relaxation of these sub-problems until a feasible integer solution is found whose objective function value provides an upper bound for the original problem P, and whose objective function value for the continuous optimal solution of each sub-problem provides a lower bound for the corresponding sub-problem. If a branch has no feasible solution, or the continuous optimal solution is an integer solution, or the lower bound exceeds the upper bound, the branch can be cut. And for the branches which are not cut off, repeating the branching and pruning processes until all the branches are cut off. If a branch has a feasible integer solution, the upper bound needs to be updated if necessary to ensure that the upper bound is equal to the minimum objective function value of the existing feasible integer solution;
3.2.4 at the end of the algorithm, the best feasible integer solution at present is the optimal solution of the original problem P.
The methods mentioned in the present invention are all conventional technical means known to those skilled in the art, and thus are not described in detail.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (3)

1. The cooperative edge cache algorithm based on deep reinforcement learning in the ultra-dense network is characterized by comprising the following specific steps:
step 1: setting parameters of a system model;
1.1 setting network model: the method comprises the following steps that three layers are included, namely a user equipment layer, an MEC layer and a cloud layer, wherein the user equipment layer comprises a plurality of user equipment, and each user equipment can be connected to only one small base station; the MEC layer comprises M small-scale base stations and a macro base station, the macro base station covers all the small-scale base stations, each small-scale base station covers a plurality of user equipment, each small-scale base station represents a small-scale cell, the coverage ranges of the small-scale base stations are not mutually overlapped, an MEC server M belonging to M is deployed on each small-scale base station, and the storage capacity of the MEC server M belonging to M is scmThe storage capacities of all MEC servers form a storage capacity size vector sc ═ sc1,sc2,...,scM]The MEC server is responsible for providing edge cache resources for the user equipment, meanwhile, is responsible for collecting state information of each small cell and transmitting the state information to the macro base station, and the small base stations are communicated with each other through the macro base station and share cache contents of the small base stations; the macro base station is responsible for collecting the state information of each small base station and making caching decisions for all the small base stations, and is connected to the cloud layer through a core backbone network; the cloud layer comprises a plurality of cloud servers, has rich computing and caching resources and is used for caching all contents;
1.2 dividing the whole time shaft into T time slots with the same length, wherein the T belongs to T to represent time slot index, and a quasi-static model is adopted, namely in one time slot, all system state parameters are kept unchanged, and different time slot parameters are different;
1.3 set content popularity model: there are F contents, each content F is the size of zfAnd each content has a different size, and the sizes of all the contents form a content size vector z ═ z1,z2,...,zf,...,zF]Defining the popularity of each content f in a cell m at a time slot t as
Figure FDA0002616875100000011
The total number of requests for content f in cell m at time slot t is
Figure FDA0002616875100000012
The total number of content requests of all user equipments in cell m at time slot t is
Figure FDA0002616875100000013
Thus, it is possible to provide
Figure FDA0002616875100000014
Popularity of all content within cell m
Figure FDA0002616875100000015
Constructing a content popularity vector
Figure FDA0002616875100000016
The content popularity vectors of all the cells form a content popularity matrix pt
1.4 set content request model: a total of U user equipments send content requests, and the set of all the user equipments sending content requests in the cell m in the time slot t is defined as
Figure FDA0002616875100000017
The number of user equipments transmitting content requests in cell m at time slot t is
Figure FDA0002616875100000018
Assuming that each user equipment requests each content at most once per time slot t, each user equipment within cell m at time slot t is defined
Figure FDA0002616875100000019
The content request vector of
Figure FDA0002616875100000021
Each element of which
Figure FDA0002616875100000022
Figure FDA0002616875100000023
Indicating that user equipment u within cell m at time slot t requests content f,
Figure FDA0002616875100000024
indicating that the user equipment u in the cell m has no request content f at the time slot t, and the content request vectors of all the user equipment in the cell m at the time slot t form a content request matrix
Figure FDA0002616875100000025
1.5 setting a cache model: defining a content caching decision vector to be maintained in a cache region of each MEC server m at a time slot t
Figure FDA0002616875100000026
Each element of which
Figure FDA0002616875100000027
Figure FDA0002616875100000028
Indicating that the content f is cached on the MEC server m at time slot t,
Figure FDA0002616875100000029
meaning that the content f is not cached on the MEC server m at the time slot t and that the total size of the cached content in each MEC server cannot exceed its storage capacity scm(ii) a The content caching decision vectors of all MEC servers form a content caching decision matrix dt
1.6 setting the communication model: assuming that each small base station operates on the same frequency band with the frequency bandwidth of B, the macro base station and the small base stations communicate with each other by using wired optical fiber, so that the small base stationsThe data transmission rate between the macro base station and the macro base station is high; dividing the frequency bandwidth B into beta orthogonal sub-channels by using an orthogonal frequency division multiplexing technology, and defining that each user equipment u in a cell m at a time slot t is allocated with a plurality of orthogonal sub-channels
Figure FDA00026168751000000210
Each subchannel having a bandwidth of
Figure FDA00026168751000000211
Because the coverage areas of the small base stations are not mutually overlapped, the same frequency interference does not exist between different small base stations and between different user equipment of the same small base station; defining the value of the downlink SNR between the user equipment u and the local small base station m as
Figure FDA00026168751000000212
And is
Figure FDA00026168751000000213
Wherein the content of the first and second substances,
Figure FDA00026168751000000214
represents the transmit power of the small base station m at time slot t,
Figure FDA00026168751000000215
denotes the channel gain between the slot t small base station m and the user equipment u, and
Figure FDA00026168751000000216
Figure FDA00026168751000000217
denotes the distance between the small base station m and the user equipment u at the time slot t, μ denotes the path loss factor, σ2A variance representing additive white gaussian noise; user equipment u and local small base defined in time slot tThe download rate between stations m is
Figure FDA00026168751000000218
And is
Figure FDA00026168751000000219
Defining the data transmission rate between each small base station m and the macro base station n to be constant
Figure FDA00026168751000000220
The data transmission rate between the macro base station n and the cloud server c is constant
Figure FDA00026168751000000317
And is
Figure FDA00026168751000000318
Defining the download delay required for a user equipment u to obtain a content f from a local MEC server m at a time slot t as
Figure FDA0002616875100000031
And is
Figure FDA0002616875100000032
Defining the download delay required for the user equipment u to obtain the content f from the other non-local MEC server-m at the time slot t as
Figure FDA0002616875100000033
And is
Figure FDA0002616875100000034
Defining a download delay required by a user equipment u to obtain a content f from a cloud server c at a time slot tIs delayed as
Figure FDA0002616875100000035
And is
Figure FDA0002616875100000036
Therefore, the temperature of the molten metal is controlled,
Figure FDA0002616875100000037
1.7 set content delivery model: the basic process of content delivery is that each user equipment independently requests a plurality of contents from the local MEC server, and if the contents are cached in the cache region of the local MEC server, the contents are directly transmitted to the user equipment by the local MEC server; if the content is not cached in the local MEC server, the content can be acquired from MEC servers of other small-sized base stations through the macro base station and then transmitted to the user equipment by the local MEC server; if all MEC servers do not cache the content, relaying the content to the macro base station from the cloud server through the core network, transmitting the content to the local MEC server by the macro base station, and finally delivering the content to the user equipment by the local MEC server;
defining whether the user equipment u acquires the content f from the local MEC server m at the time slot t as a binary variable
Figure FDA0002616875100000038
Wherein
Figure FDA0002616875100000039
Indicating that the user equipment u gets the content f from the local server m at time slot t, otherwise
Figure FDA00026168751000000310
Defining whether the user equipment u acquires the content f from the non-local server-m at the time slot t as a binary variable
Figure FDA00026168751000000311
Wherein
Figure FDA00026168751000000312
Indicating that user equipment u gets content f from non-local server-m at time slot t, otherwise
Figure FDA00026168751000000313
Defining whether the user equipment u acquires the content f from the cloud server c at the time slot t as a binary variable
Figure FDA00026168751000000314
Wherein
Figure FDA00026168751000000315
Indicating that the user equipment u acquires the content f from the cloud server c at the time slot t, otherwise
Figure FDA00026168751000000316
Step 2: making an optimal cache decision for each small base station by adopting a Double DQN algorithm so as to maximize the total content cache hit rate of all the small base stations, wherein the total cache hit rate comprises the total cache hit rate hit by a local small base station and the total cache hit rate hit by other small base stations;
and step 3: an improved branch-and-bound approach is employed to make optimal bandwidth resource allocation decisions for each small base station to minimize the total content download delay for all user devices.
2. The cooperative edge caching algorithm based on deep reinforcement learning in the ultra-dense network as claimed in claim 1, wherein the detailed steps of the Double DQN algorithm in the step 2 are as follows:
2.1 describe the content caching decision problem for M small cells as a Markov decision process with constraints on the tuple < S, A, r, Pr, c1,c2,...,cMExpressed in, the optimization objective is to maximize the long-term cumulative discount for all small base stationsAwards, wherein
S represents the state space, StE S represents the state set of all the small base stations in the time slot t, namely a content popularity matrix p formed by content popularity vectors of all the small base stations in the time slot ttThus st=pt
A represents an action space, definition atEpsilon A represents the action selected by the macro base station in the time slot t, i.e. at=dt
r represents a reward function, and the reward function of the macro base station in the time slot t is defined as rt(st,at) Is shown in state stThe lower macro base station performs action atAn instant prize later obtained, and
Figure FDA0002616875100000041
wherein w1And w2Represents a weight satisfying w1+w21 and w1>w2Can order w1=0.9,w2=0.1,
Figure FDA0002616875100000042
Representing the total cache hit rate hit by the local small base station m,
Figure FDA0002616875100000043
representing the total cache hit rate hit by the non-local small base station-m;
pr denotes the state transfer function, i.e. the macro base station is moved from the current state stLower execution action atThereafter, the system shifts to the next state st+1And is a probability of
Figure FDA0002616875100000044
c1,c2,...,cMThe constraint condition of M small base stations means that the total size of the cached content of each small base station does not exceed the sc storage capacitymI.e. satisfy
Figure FDA0002616875100000051
2.2 Double DQN algorithm comprises two procedures, a training procedure and an execution procedure, wherein the training procedure is as follows:
2.2.1 in the initialization phase of the algorithm: initializing the storage capacity N of an experience replay memory, the size K (N > K) of a sampling batch and an experience replay period K, namely a sampling period; weight θ of online Q network Q, target Q network
Figure FDA0002616875100000052
Weight of theta-Defining an index of an epsilon as i, and initializing i as 1, wherein theta, a learning rate alpha, a discount factor gamma, a parameter of a greedy strategy, a time interval C for updating a target Q network parameter, a total training frequency EP and a total time slot number T (T > N) contained in each training;
2.2.2 for each i ∈ {1,2, …, EP }, the following steps are performed:
2.2.2.1 initializing t ═ 1;
2.2.2.2 general Current State stInputting the data into an online Q network, outputting Q values of all actions, selecting all actions meeting the storage capacity requirement according to constraint conditions, and selecting an action a from the actions by adopting a greedy strategytAnd performing-a greedy strategy means that the agent selects the action randomly with a small probability and with a large probability 1-the action with the highest Q value at each slot;
2.2.2.3 is performing action atThereafter, the agent obtains an instant prize rtAnd transition to the next state st+1Then the experience sample et=(st,at,rt,st+1) Storing into an experience replay memory;
2.2.2.4 if t < N, let t ← t +1, and return 2.2.2.2; otherwise, go to 2.2.2.5;
2.2.2.5 if t% K is 0, go to 2.2.2.6; otherwise, let t ← t +1, and return to 2.2.2.2;
2.2.2.6 assume that an empirical sample j in the empirical playback memory is ej=(sj,aj,rj,sj+1) Defining the priority of the empirical sample j as
pj=|j|+∈ (9)
Where e > 0 is used to ensure that the priority of each sample is not 0,jrepresenting the time differential error of the sample j, i.e., the difference between the target Q value and the estimated Q value of the sample j, the Double DQN algorithm uses the online Q network to select the action with the largest Q value, and uses the target Q network to evaluate the Q value of the action, i.e., the Q value of the action
Figure FDA0002616875100000053
Therefore, if the TD error of a sample is larger, its priority is also larger; then, calculating the priority of all samples in the empirical playback memory through equations (9) and (10);
2.2.2.7 use a Sum Tree data structure to extract k empirical samples from an empirical playback memory, where each leaf node at the bottom represents the priority of each empirical sample, each parent node has a value equal to the Sum of the values of two children nodes, and the root node at the top represents the Sum of the priorities of all samples; the specific process is as follows: dividing the value of a root node by k to obtain k priority intervals, randomly selecting a value in each interval, judging which leaf node the value corresponds to the bottommost layer by top-down search, and selecting a sample corresponding to the leaf node to obtain k empirical samples;
2.2.2.8 calculating the target Q value y of each experience sample j in k experience samples according to the formula (11)jI.e. using the online Q network to select the action with the largest Q value, and using the target Q network to evaluate the Q value of the action, i.e.
Figure FDA0002616875100000061
2.2.2.9 defining the Loss function Loss (θ) as the target Q value yjAnd the estimated Q value Q(s)j,aj(ii) a Theta) mean square error between
Loss(θ)=E[(yt-Q(st,at;θ))2] (12)
Wherein E [. cndot. ] represents a mathematical expectation; then, based on the k empirical samples, updating the weight theta of the online Q network by adopting a random gradient descent method so as to minimize a loss function;
2.2.2.10 if t% C is 0, copying the updated weight theta of the online Q network into the target Q network to update the weight theta of the target Q network-(ii) a Otherwise, the weight theta of the target Q network does not need to be updated-
2.2.2.11 if t<T, let T ← T +1, and return to 2.2.2.2; otherwise, let i ← i +1, and return to 2.2.2.1; after the training process of the DoubleDQN algorithm is completed, the optimal weight theta of the online Q network is obtained*Then, deploying the trained DoubleDQN algorithm to a macro base station for execution, wherein the execution process is as follows:
2.2.3 initializing t ═ 1;
2.2.4 the macro base station collects the set of states s of all the small base stations in time slot ttThen s istInputting the data into a trained online Q network so as to output Q values of all actions;
2.2.5 selecting all actions meeting the storage capacity requirement according to the constraint condition, and then selecting the action a with the maximum Q value from the actionstAnd is carried out, i.e.
at=argmaxa′Q(st,a′;θ*) (13)
2.2.6 the macro base station performs action atThereafter, an instant prize r is obtainedtAnd transition to the next state st+1
2.2.7 if T < T, let T ← T +1, and return to 2.2.4; otherwise, the algorithm ends.
3. The cooperative edge caching algorithm based on deep reinforcement learning in the ultra-dense network according to claim 1, wherein the specific steps in the step 3 are as follows:
3.1 determining the best content caching decision vector for each small base station m
Figure FDA0002616875100000071
The bandwidth resource allocation problem for each small base station is then described as the nonlinear integer programming problem P, i.e. for
Figure FDA0002616875100000072
All require
Figure FDA0002616875100000073
Figure FDA0002616875100000074
Figure FDA0002616875100000075
Wherein both the objective function and the constraint function can be expressed with respect to all decision variables
Figure FDA0002616875100000076
Of a unitary function summation, i.e.
Figure FDA0002616875100000077
Figure FDA0002616875100000078
And all are
Figure FDA0002616875100000079
The objective function is a separable concave function in the defined domain, and the constraint function is a linear constraint in the defined domain, so that the problem is a separable concave integer programming problem;
3.2 each small-scale base station adopts an improved branch-and-bound method to solve the separable concave integer programming problem, and the specific flow is as follows:
3.2.1 continuously relaxing the original problem P, namely removing integer constraint, and linearly approximating the target function, thereby obtaining continuous relaxation and linear approximation subproblems LSP of the original problem P, wherein the LSP is a separable linear programming problem;
3.2.2 solving the continuous optimal solution of the LSP by utilizing the KKT condition, wherein if the continuous optimal solution is an integer solution, the continuous optimal solution is the optimal solution of the original problem P, and otherwise, the objective function value of the continuous optimal solution is a lower bound of the optimal value of the original problem P;
3.2.3 then branching from the continuous optimal solution, wherein each branch corresponds to one sub-problem, and then solving the continuous relaxation problem of the sub-problems until a feasible integer solution is found, the objective function value of the feasible integer solution providing an upper bound for the original problem P, and the objective function value of the continuous optimal solution of each sub-problem providing a lower bound for the corresponding sub-problem; if a certain branch has no feasible solution, or the continuous optimal solution is an integer solution, or the lower bound exceeds the upper bound, the branch can be cut off; for branches which are not cut off, the process of branching and pruning is repeated until all the branches are cut off; if a branch has a feasible integer solution, the upper bound needs to be updated if necessary to ensure that the upper bound is equal to the minimum objective function value of the existing feasible integer solution;
3.2.4 at the end of the algorithm, the best feasible integer solution at present is the optimal solution of the original problem P.
CN202010771674.7A 2020-08-04 2020-08-04 Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network Active CN111970733B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010771674.7A CN111970733B (en) 2020-08-04 2020-08-04 Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010771674.7A CN111970733B (en) 2020-08-04 2020-08-04 Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network

Publications (2)

Publication Number Publication Date
CN111970733A true CN111970733A (en) 2020-11-20
CN111970733B CN111970733B (en) 2024-05-14

Family

ID=73364237

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010771674.7A Active CN111970733B (en) 2020-08-04 2020-08-04 Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network

Country Status (1)

Country Link
CN (1) CN111970733B (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112887992A (en) * 2021-01-12 2021-06-01 滨州学院 Dense wireless network edge caching method based on access balance core and replacement rate
CN113094982A (en) * 2021-03-29 2021-07-09 天津理工大学 Internet of vehicles edge caching method based on multi-agent deep reinforcement learning
CN113225584A (en) * 2021-03-24 2021-08-06 西安交通大学 Cross-layer combined video transmission method and system based on coding and caching
CN113240324A (en) * 2021-06-02 2021-08-10 中国电子科技集团公司第五十四研究所 Air and space resource collaborative planning method considering communication mechanism
CN113573320A (en) * 2021-07-06 2021-10-29 西安理工大学 SFC deployment method based on improved actor-critic algorithm in edge network
CN113573324A (en) * 2021-07-06 2021-10-29 河海大学 Cooperative task unloading and resource allocation combined optimization method in industrial Internet of things
CN113687960A (en) * 2021-08-12 2021-11-23 华东师范大学 Edge calculation intelligent caching method based on deep reinforcement learning
CN114285853A (en) * 2022-01-14 2022-04-05 河海大学 Task unloading method based on end edge cloud cooperation in equipment-intensive industrial Internet of things
CN114302421A (en) * 2021-11-29 2022-04-08 北京邮电大学 Method and device for generating communication network architecture, electronic equipment and medium
CN114301909A (en) * 2021-12-02 2022-04-08 阿里巴巴(中国)有限公司 Edge distributed management and control system, method, equipment and storage medium
CN115270867A (en) * 2022-07-22 2022-11-01 北京信息科技大学 Improved DQN fault diagnosis method and system for gas turbine rotor system
CN115499441A (en) * 2022-09-15 2022-12-20 中原工学院 Deep reinforcement learning-based edge computing task unloading method in ultra-dense network
US11836093B2 (en) 2020-12-21 2023-12-05 Electronics And Telecommunications Research Institute Method and apparatus for TTL-based cache management using reinforcement learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107079044A (en) * 2014-09-25 2017-08-18 交互数字专利控股公司 The process cached for perception of content and the provided for radio resources management for coordinated multipoint transmission
US20190138934A1 (en) * 2018-09-07 2019-05-09 Saurav Prakash Technologies for distributing gradient descent computation in a heterogeneous multi-access edge computing (mec) networks
US20190222652A1 (en) * 2019-03-28 2019-07-18 Intel Corporation Sensor network configuration mechanisms
US20190220703A1 (en) * 2019-03-28 2019-07-18 Intel Corporation Technologies for distributing iterative computations in heterogeneous computing environments
EP3605329A1 (en) * 2018-07-31 2020-02-05 Commissariat à l'énergie atomique et aux énergies alternatives Connected cache empowered edge cloud computing offloading
CN111447266A (en) * 2020-03-24 2020-07-24 中国人民解放军国防科技大学 Mobile edge calculation model based on chain and service request and scheduling method thereof

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107079044A (en) * 2014-09-25 2017-08-18 交互数字专利控股公司 The process cached for perception of content and the provided for radio resources management for coordinated multipoint transmission
EP3605329A1 (en) * 2018-07-31 2020-02-05 Commissariat à l'énergie atomique et aux énergies alternatives Connected cache empowered edge cloud computing offloading
US20190138934A1 (en) * 2018-09-07 2019-05-09 Saurav Prakash Technologies for distributing gradient descent computation in a heterogeneous multi-access edge computing (mec) networks
US20190222652A1 (en) * 2019-03-28 2019-07-18 Intel Corporation Sensor network configuration mechanisms
US20190220703A1 (en) * 2019-03-28 2019-07-18 Intel Corporation Technologies for distributing iterative computations in heterogeneous computing environments
CN111447266A (en) * 2020-03-24 2020-07-24 中国人民解放军国防科技大学 Mobile edge calculation model based on chain and service request and scheduling method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张开元, 桂小林, 任德旺, 李 敬, 吴 杰, 任东胜: "移动边缘网络中计算迁移与内容缓存研究综述", 《软件学报》, 22 May 2019 (2019-05-22) *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11836093B2 (en) 2020-12-21 2023-12-05 Electronics And Telecommunications Research Institute Method and apparatus for TTL-based cache management using reinforcement learning
CN112887992B (en) * 2021-01-12 2022-08-12 滨州学院 Dense wireless network edge caching method based on access balance core and replacement rate
CN112887992A (en) * 2021-01-12 2021-06-01 滨州学院 Dense wireless network edge caching method based on access balance core and replacement rate
CN113225584A (en) * 2021-03-24 2021-08-06 西安交通大学 Cross-layer combined video transmission method and system based on coding and caching
CN113225584B (en) * 2021-03-24 2022-02-22 西安交通大学 Cross-layer combined video transmission method and system based on coding and caching
CN113094982A (en) * 2021-03-29 2021-07-09 天津理工大学 Internet of vehicles edge caching method based on multi-agent deep reinforcement learning
CN113240324A (en) * 2021-06-02 2021-08-10 中国电子科技集团公司第五十四研究所 Air and space resource collaborative planning method considering communication mechanism
CN113573320A (en) * 2021-07-06 2021-10-29 西安理工大学 SFC deployment method based on improved actor-critic algorithm in edge network
CN113573324A (en) * 2021-07-06 2021-10-29 河海大学 Cooperative task unloading and resource allocation combined optimization method in industrial Internet of things
CN113573320B (en) * 2021-07-06 2024-03-22 西安理工大学 SFC deployment method based on improved actor-critter algorithm in edge network
CN113687960A (en) * 2021-08-12 2021-11-23 华东师范大学 Edge calculation intelligent caching method based on deep reinforcement learning
CN113687960B (en) * 2021-08-12 2023-09-29 华东师范大学 Edge computing intelligent caching method based on deep reinforcement learning
CN114302421A (en) * 2021-11-29 2022-04-08 北京邮电大学 Method and device for generating communication network architecture, electronic equipment and medium
CN114301909A (en) * 2021-12-02 2022-04-08 阿里巴巴(中国)有限公司 Edge distributed management and control system, method, equipment and storage medium
CN114301909B (en) * 2021-12-02 2023-09-22 阿里巴巴(中国)有限公司 Edge distributed management and control system, method, equipment and storage medium
CN114285853A (en) * 2022-01-14 2022-04-05 河海大学 Task unloading method based on end edge cloud cooperation in equipment-intensive industrial Internet of things
CN114285853B (en) * 2022-01-14 2022-09-23 河海大学 Task unloading method based on end edge cloud cooperation in equipment-intensive industrial Internet of things
CN115270867A (en) * 2022-07-22 2022-11-01 北京信息科技大学 Improved DQN fault diagnosis method and system for gas turbine rotor system
CN115499441A (en) * 2022-09-15 2022-12-20 中原工学院 Deep reinforcement learning-based edge computing task unloading method in ultra-dense network

Also Published As

Publication number Publication date
CN111970733B (en) 2024-05-14

Similar Documents

Publication Publication Date Title
CN111970733B (en) Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network
Fadlullah et al. HCP: Heterogeneous computing platform for federated learning based collaborative content caching towards 6G networks
Hu et al. Twin-timescale artificial intelligence aided mobility-aware edge caching and computing in vehicular networks
Ye et al. Joint RAN slicing and computation offloading for autonomous vehicular networks: A learning-assisted hierarchical approach
CN111565419A (en) Delay optimization oriented collaborative edge caching algorithm in ultra-dense network
CN112995951B (en) 5G Internet of vehicles V2V resource allocation method adopting depth certainty strategy gradient algorithm
CN113543074B (en) Joint computing migration and resource allocation method based on vehicle-road cloud cooperation
CN110267338A (en) Federated resource distribution and Poewr control method in a kind of D2D communication
CN110769514B (en) Heterogeneous cellular network D2D communication resource allocation method and system
CN112954651B (en) Low-delay high-reliability V2V resource allocation method based on deep reinforcement learning
Abouaomar et al. A deep reinforcement learning approach for service migration in mec-enabled vehicular networks
CN114885426B (en) 5G Internet of vehicles resource allocation method based on federal learning and deep Q network
CN111132074A (en) Multi-access edge computing unloading and frame time slot resource allocation method in Internet of vehicles environment
Qi et al. Energy-efficient resource allocation for UAV-assisted vehicular networks with spectrum sharing
Chua et al. Resource allocation for mobile metaverse with the Internet of Vehicles over 6G wireless communications: A deep reinforcement learning approach
CN115134779A (en) Internet of vehicles resource allocation method based on information age perception
Zhang et al. Two time-scale caching placement and user association in dynamic cellular networks
Sun et al. A DQN-based cache strategy for mobile edge networks
Wu et al. Intelligent content precaching scheme for platoon-based edge vehicular networks
Gao et al. Reinforcement learning based resource allocation in cache-enabled small cell networks with mobile users
Wu et al. Dynamic handoff policy for RAN slicing by exploiting deep reinforcement learning
CN106060876A (en) Load balancing method for heterogeneous wireless network
Dai et al. Contextual multi-armed bandit for cache-aware decoupled multiple association in UDNs: A deep learning approach
Xu et al. Energy-efficient resource allocation for multiuser OFDMA system based on hybrid genetic simulated annealing
Hu et al. An efficient deep reinforcement learning based distributed channel multiplexing framework for V2X communication networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant