CN111970733B

CN111970733B - Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network

Info

Publication number: CN111970733B
Application number: CN202010771674.7A
Authority: CN
Inventors: 韩光洁; 张帆
Original assignee: Changzhou Campus of Hohai University
Current assignee: Changzhou Campus of Hohai University
Priority date: 2020-08-04
Filing date: 2020-08-04
Publication date: 2024-05-14
Anticipated expiration: 2040-08-04
Also published as: CN111970733A

Abstract

The invention discloses a cooperative edge caching algorithm based on deep reinforcement learning in an ultra-dense network, which comprises the following specific steps: step 1: setting each parameter of a system model; step 2: the Double DQN algorithm is employed to make an optimal caching decision for each SBS to maximize the total content cache hit rate for all SBS. The algorithm combines the DQN algorithm and the Double Q-learning algorithm, thereby effectively solving the problem of overestimation of the DQN algorithm on the Q value. In addition, the algorithm adopts a priority experience playback technology, so that the learning speed is increased; step 3: an improved branch-and-bound approach is employed to make optimal bandwidth resource allocation decisions for each SBS to minimize the total content download delay for all user equipments. The invention can effectively reduce the content download delay of all users in the ultra-dense network, improve the content cache hit rate and the spectrum resource utilization rate, has good robustness and expandability, and is suitable for the large-scale ultra-dense network with dense users.

Description

Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network

Technical Field

The invention relates to a collaborative edge caching algorithm based on deep reinforcement learning in an ultra-dense network, and belongs to the field of edge caching of the ultra-dense network.

Background

In the 5G age, mobile data traffic has exploded with the popularity of smart mobile devices and mobile applications. To meet the requirements of high capacity, high throughput, high user experience rate, high reliability, wide coverage, etc. of 5G Networks, ultra-Dense Networks (UDNs) have been developed. The UDN densely deploys low-power small base stations (Small Base Stations, SBS) in indoor and outdoor hot spot areas (such as office buildings, markets, subways, airports, tunnels and the like) within the coverage area of the MBS (Macro Base Station, MBS) so as to improve network capacity and space reuse degree and make up blind areas which cannot be covered by a Macro Base Station (MBS).

However, SBS in UDN is connected to the core network through backhaul links, and as the number of SBS and the number of users increases, backhaul data traffic increases sharply, causing backhaul link congestion and greater service delay, thereby reducing quality of service (Quality of Service, qoS) and quality of user experience (Quality of Experience, qoE). Thus, backhaul network problems have become a performance bottleneck limiting the development of UDNs.

In view of the above problems, the edge caching technology has become a promising solution, and the technology enables users to directly obtain request content from local SBS by caching popular content in SBS without downloading content from a remote cloud server through a backhaul link, thereby reducing traffic load of the backhaul link and a core network, reducing content download delay, and improving QoS and user QoE. However, since the buffering capacity of a single SBS is limited, the performance of the edge buffering may be limited. In order to expand the cache capacity and increase the cache diversity, a collaborative edge cache scheme may be adopted, that is, a plurality of SBS cache and update the content in a collaborative manner, and share the cached content with each other, so as to improve the content cache hit rate and reduce the content download delay.

Most of the existing collaborative content caching research needs prior knowledge such as probability distribution (such as Zipf distribution) of content popularity and user preference model, but in fact, the content popularity has complex space-time dynamic characteristics and is usually a non-stationary random process, so that it is difficult to accurately predict and model the probability distribution of the content popularity.

Deep reinforcement learning (Deep Reinforcement Learning, DRL) integrates the strong perceptibility of Deep learning with the strong decision capability of reinforcement learning, where the most common Deep reinforcement learning algorithm is a Deep Q Network (DQN) that approximates the Q function with a Deep neural Network (Deep Neural Network, DNN) with a weight of θ as a function approximator, i.e., Q (s, a; θ) =q ^* (s, a), called Q Network, and then updates the weights θ by a random gradient descent method to minimize the loss function, which is applicable to environments with larger state and action spaces, thus solving the dimensional curse problem. However, the conventional DQN algorithm usually has an excessively high estimated Q value, so that a Double DQN algorithm is adopted, and the problem of overestimation of the DQN algorithm can be effectively solved based on the Double Q-learning algorithm. In addition, the conventional DQN algorithm generally employs a random uniform sampling manner to extract experience samples from the experience playback memory to update the Q network, i.e., the probability of being extracted is the same for each experience sample, which results in that very few but very valuable experience samples are not utilized efficiently, and thus a priority experience playback (Prioritized Experience Replay) technique is employed to solve the sampling problem, thereby speeding up learning.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a collaborative edge caching algorithm based on deep reinforcement learning in an ultra-dense network, which is a centralized algorithm. The algorithm does not need prior knowledge such as probability distribution of content popularity, a user preference model and the like, but calculates the content popularity by utilizing the instantaneous content request of the user, so that the modeling process of the content popularity is simplified. Then, the MBS is responsible for collecting local content popularity information of all SBS and making an optimal caching decision for each SBS, with the goal of maximizing the total content cache hit rate of all SBS. Finally, after determining the optimal caching decision for each SBS, each SBS makes an optimal resource allocation decision according to its bandwidth resources, with the goal of minimizing the total content download delay for all user equipments. The algorithm has good robustness and expandability, and is suitable for large-scale user-intensive UDNs.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network:

Step1: setting each parameter of a system model;

Step 2: the Double DQN algorithm is employed to make an optimal cache decision for each SBS to maximize the total content cache hit rate for all SBS, including the total cache hit rate hit by the local SBS and the total cache hit rate hit by other SBS. The algorithm combines the DQN algorithm and the Double Q-learning algorithm, thereby effectively solving the problem of overestimation of the DQN algorithm on the Q value. In addition, the algorithm adopts a priority experience playback technology, so that the learning speed can be increased;

Step 3: an improved branch-and-bound approach is employed to make optimal bandwidth resource allocation decisions for each SBS to minimize the total content download delay for all user equipments. The method combines a branch-and-bound method and a linear lower approximation method, and is suitable for large-scale separable concave integer programming with more decision variables.

Preferably, the specific steps of the step 1 are as follows:

1.1 setting up a network model: the method comprises three layers, namely a user equipment layer, an MEC layer and a cloud layer, wherein the user equipment layer comprises a plurality of User Equipment (UE), and each UE can be connected to one SBS; the MEC layer contains M SBS and one MBS, the MBS covers all SBS, each SBS covers multiple UEs (each SBS represents a small cell), coverage areas between SBS do not overlap each other, each SBS is deployed with one MEC server M e M, its storage capacity is sc _m, the storage capacities of all MEC servers form a storage capacity size vector sc= [ sc ₁,sc₂,...,sc_M ], the MEC server is responsible for providing edge buffer resources for UEs, and at the same time, is responsible for collecting status information (such as size and popularity of each request content, channel gain) of each small cell and transmitting to the MBS, and SBS can communicate with each other through the MBS and share their buffer content. The MBS is responsible for collecting status information of each SBS and making buffering decisions for all SBS, and is connected to the cloud layer through a core backbone network (e.g., an optical fiber backhaul link). The cloud layer comprises a plurality of cloud servers, has rich computing and caching resources and is used for caching all contents;

1.2 dividing the whole time axis into T time slots with the same length, wherein T epsilon T represents a time slot index, and a quasi-static model is adopted, namely, in one time slot, all system state parameters (such as popularity of each content request, position of user equipment and channel gain) are kept unchanged, and different time slot parameters are different;

1.3 setting a content popularity model: there are F contents in total, each content F e F has a size z _f, and each content has a different size, and the sizes of all the contents constitute one content size vector z= [ z ₁,z₂,...,z_f,...,z_F ]. Defining popularity of each content f in cell m at time slot t as The total number of requests for content f in cell m at time slot t isThe total number of content requests for all UEs in cell m at slot t is/>Thus/>Popularity/>, of all content within cell mConstitute a content popularity vector/>The content popularity vectors of all cells form a content popularity matrix p ^t;

1.4 setting a content request model: a total of U UEs transmit content requests, defining the set of all UEs transmitting content requests in cell m in time slot t as The number of UEs transmitting content requests in cell m in time slot t is/>Let us assume that each UE requests each content at a maximum of one time in time slot t, define each UE/>, in cell m, in time slot tContent request vector of/>Wherein each element/> UE u requesting content f,/>, indicating in cell m at time slot tThe content request vector of all UEs in cell m in time slot t forms a content request matrix, indicating that UE u in cell m in time slot t does not request content f

1.5 Setting a cache model: defining a content caching decision vector to be maintained in the cache area of each MEC server m in time slot tWherein each element/> Representing the buffering of content f on MEC server m at time slot t,/>Meaning that the content f is not cached on the MEC server m at time slot t and the total size of the cached content in each MEC server cannot exceed its storage capacity sc _m. The content caching decision vectors of all MEC servers form a content caching decision matrix d ^t;

1.6 setting up a communication model: assuming that each SBS operates on the same frequency band and the frequency bandwidth is B, the MBS and the SBS communicate with each other using a wired optical fiber, and thus the data transmission rate between the SBS and the MBS is large. Dividing the frequency bandwidth B into beta orthogonal sub-channels by using the orthogonal frequency division multiplexing technology, each UE u defined in the time slot t in the cell m can be allocated with a plurality of orthogonal sub-channels Each sub-channel bandwidth is/>Since the coverage areas of the SBSs are not overlapped with each other, the same-frequency interference does not exist between different SBSs and between different UE of the same SBS. Defining the downlink SNR value between time slot tUE u and local SBS m as/>And is also provided with

Wherein,Representing the transmitted power of SBS m at time slot t,/>Represents the channel gain between time slot tbs m and UE u, and/> Represents the distance between the time slot tbs m and UE u, μ represents the path loss factor, σ ² represents the variance of the additive gaussian white noise. Thus, the download rate between time slot tUE u and local SBS m is defined as/>And is also provided with

Defining the data transmission rate between each SBS m and MBS n as constantThe data transmission rate between MBS n and cloud server c is constant/>And/>Thus, the download delay required to obtain content f from the local MEC server m at time slot t UEu is defined as/>And is also provided with

Defining the download delay required to obtain content f from other non-local MEC servers-m at time slot tue u asAnd is also provided with

Defining the download delay required to acquire content f from cloud server c at time slot tue u asAnd is also provided with

Thus, the first and second substrates are bonded together,

1.7 Setting a content delivery model: the basic process of content delivery is that each UE independently requests a plurality of contents from a local MEC server, and if the contents are cached in a cache area of the local MEC server, the contents are directly transmitted to the UE by the local MEC server; if the content is not cached in the local MEC server, the content can be acquired from the MEC servers of other SBSs through MBS and then transmitted to the UE through the local MEC server; if all MEC servers do not cache the content, the content is relayed from the cloud server to the MBS through the core network, then the MBS is transmitted to the local MEC server, and finally the local MEC server delivers the content to the UE. Defining whether or not the content f is obtained from the local MEC server m in the time slot t UE u as a binary variableWherein/>Indicating that the content f is obtained from the local server m in time slot t UE u, otherwiseDefining whether or not the content f is obtained from the non-local server-m in the time slot t UE u as a binary variableWherein/>Indicating that the content f is obtained from the non-local server-m in time slot t UE u, otherwiseDefine whether content f is acquired from cloud server c in time slot t UE u as binary variable/>Wherein/>Indicating that content f is acquired from cloud server c in time slot tUE u, otherwise/>

Preferably, the specific steps of the Double DQN algorithm in step2 are as follows:

2.1 describing the content caching decision problem for M SBSs as a constrained Markov decision process (Constrained Markov Decision Process, CMDP) problem, which can be expressed in terms of tuples < S, A, r, pr, c ₁,c₂,...,c_M >, the optimization objective is to maximize the long-term cumulative discounted prize for all SBSs, where

2.1.1 S represents a state space, S ^t e S represents a state set of all SBS at time slot t, i.e. a content popularity matrix p ^t formed by content popularity vectors of all SBS at time slot t, so S ^t＝p^t;

2.1.2 A represents action space, definition a ^t ε A represents action selected in time slot t MBS, namely a ^t＝d^t;

2.1.3 r denotes the bonus function, defined as r ^t(s^t,a^t at time slot tMBS), denotes the immediate bonus obtained after MBS performs action a ^t in state s ^t, and

Where w ₁ and w ₂ represent weights, satisfying w ₁+w₂ =1 and w ₁＞w₂, w ₁ =0.9,Indicating the total cache hit rate hit by the local SBS m,Representing the total cache hit rate hit by non-native SBS-m;

2.1.4 Pr represents the state transition function, i.e. the probability of MBS to transition to the next state s ^t+1 after performing action a ^t from the current state s ^t, and

2.1.5 C ₁,c₂,...,c₀ represents the constraint of M SBSs, meaning that the total size of the cached content of each SBS does not exceed the storage capacity sc _m, i.e. meets

2.2 A Double DQN algorithm is used, the training process of which is similar to the DQN algorithm, including an online Q network and a target Q network, except that the algorithm decomposes the maximum operation of the target Q value in the DQN algorithm into action selection and action evaluation, i.e. the action is selected using the online Q network and evaluated using the target Q network. The Double DQN algorithm includes two processes, namely a training process and an execution process, wherein the training process is as follows:

2.2.1 in the initialization phase of the algorithm: initializing the storage capacity N of an empirical playback memory, the sampling batch size K (N > K), the empirical playback period K (i.e., sampling period), the weight θ of the online Q network Q, the target Q network The weight of (a) is θ ^- =θ, the learning rate α, the discount factor γ, the parameter ε of ε -greedy strategy, the time interval C for updating the target Q network parameter, the total training frequency (i.e. total number of episode) EP, and the total time slot number T (T > N) contained in each training, defining the index of episode as i, and initializing i=1;

2.2.2 if i.ltoreq.EP, enter 2.2.2.1; otherwise, training ends:

2.2.2.1 initializing t=1;

2.2.2.2 inputting the current state s ^t into an online Q network so as to output the Q values of all actions, then selecting all actions meeting the requirement of the storage capacity according to constraint conditions, selecting an action a ^t from the actions and executing the actions by adopting an epsilon-greedy strategy, wherein the epsilon-greedy strategy is that an intelligent agent randomly selects the actions with smaller probability epsilon in each time slot, and selects the action with the highest Q value with larger probability 1-epsilon;

2.2.2.3 after performing action a ^t, the agent gets the instant prize r ^t and transitions to the next state s ^t ⁺¹, then stores the experience sample e ^t＝(s^t,a^t,r^t,s^t+1) in an experience replay memory;

2.2.2.4 if t < N, let t +1, and returns to 2.2.2.2; otherwise, enter 2.2.2.5;

2.2.2.5 if t% k=0, then go to 2.2.2.6; otherwise, let t≡t+1 and return 2.2.2.2;

2.2.2.6 assume that a certain experience sample j in the experience playback memory is e ^j＝(s^j,a^j,r^j,s^j+1), and define the priority of the experience sample j as

p_j＝|δ_j|+∈ (9)

Where ε > 0 is used to ensure that each sample has a priority other than 0, δ _j represents the time difference (TemporalDifference, TD) error of sample j, i.e., the difference between the target Q value and the estimated Q value of sample j, the Double DQN algorithm uses an online Q network to select the action with the largest Q value, and uses the target Q network to evaluate the Q value of the action, i.e

Therefore, if the TD error of a sample is larger, the priority thereof is also larger. Then, the priorities of all samples in the experience playback memory are calculated through formulas (9) and (10);

2.2.2.7 employ a Sum Tree data structure to extract k experience samples from the experience playback memory, where each leaf node at the bottom layer represents the priority of each experience sample, the value of each parent node is equal to the Sum of the values of the two child nodes, and the root node at the top layer represents the Sum of the priorities of all samples. The specific process is as follows: firstly, dividing the value of a root node by k to obtain k priority intervals, then randomly selecting a value in each interval, judging which leaf node at the bottom layer corresponds to the value through searching from top to bottom, and selecting a sample corresponding to the leaf node to obtain k experience samples;

2.2.2.8 separately calculate the target Q value y ^j for each of the k empirical samples j according to equation (11), i.e., using the online Q network to select the action with the largest Q value, and using the target Q network to evaluate the Q value of the action, i.e.

2.2.2.9 Define the Loss function Loss (θ) as the mean square error between the target Q value y ^j and the estimated Q value Q (s ^j,a^j; θ), i.e

Loss(θ)＝E[(y^t-Q(s^t,a^t;θ))²] (12)

Wherein E [. Cndot ] represents a mathematical expectation. Then, based on the k empirical samples, updating the weight θ of the online Q network using a random gradient descent method to minimize the loss function;

2.2.2.10 if t% c=0, copying the updated weight θ of the online Q network to the target Q network to update the weight θ ^- of the target Q network; otherwise, the weight theta ^- of the target Q network does not need to be updated;

2.2.2.11 if T < T, let t++1, and return to 2.2.2.2; otherwise, let i≡i+1 and return to 2.2.2.1;

After the training process of DoubleDQN algorithm is completed, the optimal weight theta ^* of the online Q network is obtained, and then the trained DoubleDQN algorithm is deployed on the MBS to be executed, and the execution process is as follows:

2.2.3 initializing t=1;

2.2.4 The MBS collects the state set s ^t of all SBS in the time slot t, then s ^t is input into the trained online Q network, so as to output the Q values of all actions;

2.2.5 selecting all the actions meeting the storage Capacity requirement according to the constraint, then selecting the action a ^t with the maximum Q value from the actions and executing, namely

a^t＝arg max_a′Q(s^t,a′;θ^*) (13)

2.2.6 After MBS performs action a ^t, instant prize r ^t is obtained and transitions to the next state s ^t+1;

2.2.7 if T < T, let t≡t+1 and return 2.2.4; otherwise, the algorithm ends.

Preferably, the specific steps of the step 3 are as follows:

3.1 determining the optimal content caching decision vector for each SBS m The bandwidth resource allocation problem of each SBS is then described as a nonlinear integer programming problem P, i.e. for/>All require

Wherein both the objective function and the constraint function can be expressed in terms of all decision variablesIn the form of unitary function summation, i.e

And all thatThe objective function is a separable concave function in the definition domain, and the constraint function is a linear constraint in the definition domain, so that the problem is a separable concave integer programming problem;

3.2 each SBS adopts an improved branch-and-bound method to solve the separable concave integer programming problem, and the method comprises the following specific procedures:

3.2.1, continuously relaxing the original problem P, namely removing integer constraint, and linearly approximating an objective function, so as to obtain a continuous relaxation & linear approximation sub-problem LSP of the original problem P, wherein the LSP is a separable linear programming problem;

3.2.2 solving a continuous optimal solution of the LSP by using a KKT condition, wherein if the continuous optimal solution is an integer solution, the continuous optimal solution is an optimal solution of the original problem P, otherwise, the objective function value of the continuous optimal solution is a lower bound of the optimal value of the original problem P;

3.2.3 branching is then performed from the continuous optimal solution, where each branch corresponds to a sub-problem, and then the continuous relaxation problem of the sub-problems is solved until a viable integer solution is found, the objective function value of which provides an upper bound for the original problem P, and the objective function value of the continuous optimal solution of each sub-problem provides a lower bound for the corresponding sub-problem. A branch may be pruned if it has no feasible solution, or if the continuous optimal solution is an integer solution, or if its lower bound exceeds the upper bound. And repeating the branching and pruning processes for branches which are not pruned until all branches are pruned. If a branch has a viable integer solution, the upper bound needs to be updated if necessary to ensure that the upper bound is equal to the minimum objective function value of the existing viable integer solution;

3.2.4 at the end of the algorithm, the best feasible integer solution at present is the optimal solution of the original problem P.

The beneficial effects are that: the invention provides a collaborative edge caching algorithm based on deep reinforcement learning in an ultra-dense network, which can effectively reduce content downloading delay of all users in the ultra-dense network, improve content caching hit rate and spectrum resource utilization rate, has good robustness and expandability, and is suitable for large-scale user-dense ultra-dense networks.

Drawings

FIG. 1 is a network model of the UDN of step 1.1 employing edge caching;

fig. 2 is a schematic diagram illustrating the extraction of k samples using the data structure Sum Tree in step 2.2.2.7.

Detailed Description

In order to better understand the technical solutions of the present application for those skilled in the art, the following description of the technical solutions of the embodiments of the present application will be clearly and completely described, and it is obvious that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.

The cooperative edge caching algorithm based on deep reinforcement learning in the ultra-dense network comprises the following specific steps:

Step1: setting each parameter of a system model;

Preferably, the specific steps in the step1 are as follows:

Thus, the first and second substrates are bonded together,

Preferably, in the step 2, the specific steps are as follows:

2.1.5 C ₁,c₂,...,c_M represents the constraint of M SBSs, meaning that the total size of the cached content of each SBS does not exceed the storage capacity sc _m, i.e. meets

2.2.2 if i.ltoreq.EP, enter 2.2.2.1; otherwise, training ends:

2.2.2.1 initializing t=1;

2.2.2.4 if t < N, let t +1, and returns to 2.2.2.2; otherwise, enter 2.2.2.5;

p_j＝|δ_j|+∈ (9)

Loss(θ)＝E[(y^t-Q(s^t,a^t;θ))²] (12)

2.2.3 initializing t=1;

a^t＝arg max_a′Q(s^t,a′;θ^*) (13)

2.2.7 if T < T, let t≡t+1 and return 2.2.4; otherwise, the algorithm ends.

Preferably, in the step3, the specific steps are as follows:

The methods mentioned in the present invention all belong to conventional technical means known to the person skilled in the art and are not described in detail.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The cooperative edge caching algorithm based on deep reinforcement learning in the ultra-dense network is characterized by comprising the following specific steps of:

Step1: setting each parameter of a system model;

1.1 setting up a network model: the method comprises three layers, namely a user equipment layer, an MEC layer and a cloud layer, wherein the user equipment layer comprises a plurality of user equipment, and each user equipment can only be connected to one small base station; the MEC layer comprises M small base stations and a macro base station, the macro base station covers all the small base stations, each small base station covers a plurality of user equipment, each small base station represents a small cell, the coverage areas among the small base stations are not overlapped with each other, each small base station is provided with an MEC server M epsilon M, the storage capacity of the MEC server M epsilon M is _m, the storage capacity of all the MEC servers forms a storage capacity size vector sc= [ sc ₁,sc₂,...,sc_M ], the MEC server is responsible for providing edge cache resources for the user equipment, meanwhile, is responsible for collecting state information of each small cell and transmitting the state information to the macro base station, and the small base stations communicate with each other through the macro base station and share cache contents; the macro base station is responsible for collecting the state information of each small base station and making a caching decision for all the small base stations, and is connected to the cloud layer through a core backbone network; the cloud layer comprises a plurality of cloud servers, has rich computing and caching resources and is used for caching all contents;

1.2 dividing the whole time axis into T time slots with the same length, wherein T epsilon T represents a time slot index, and a quasi-static model is adopted, namely, in one time slot, all system state parameters are kept unchanged, and different time slot parameters are different;

1.3 setting a content popularity model: f contents are shared, the size of each content F epsilon F is z _f, the sizes of each content are different, the sizes of all the contents form a content size vector z= [ z ₁,z₂,...,z_f,...,z_F ], and the popularity of each content F in a cell m in a time slot t is defined as The total number of requests for content f in cell m at time slot t is/>The total number of content requests of all user equipments in cell m in time slot t is/>Thus/>Popularity/>, of all content within cell mConstitute a content popularity vector/>The content popularity vectors of all cells form a content popularity matrix p ^t;

1.4 setting a content request model: a total of U user equipments transmitting content requests, defining the set of all user equipments transmitting content requests in cell m in time slot t as The number of user equipments transmitting content requests in cell m in time slot t is/>Let us assume that each user equipment requests each content at a maximum of one time in time slot t, define each user equipment/>, in cell m, in time slot tContent request vector of/>Wherein each element/> Indicating that user equipment u within cell m requests content f at time slot t,The content request vector of all user equipments in cell m in time slot t constitutes a content request matrix/>

1.5 Setting a cache model: defining a content caching decision vector to be maintained in the cache area of each MEC server m in time slot tWherein each element/> Representing the buffering of content f on MEC server m at time slot t,/>Indicating that content f is not cached on MEC server m at time slot t and that the total size of the cached content in each MEC server cannot exceed its storage capacity sc _m; the content caching decision vectors of all MEC servers form a content caching decision matrix d ^t;

1.6 setting up a communication model: each small-sized base station is assumed to work on the same frequency band, the frequency band width is B, and the macro base station and the small-sized base stations are communicated by adopting a wired optical fiber, so that the data transmission rate between the small-sized base stations and the macro base station is very high; dividing the frequency band width B into beta orthogonal sub-channels by using the orthogonal frequency division multiplexing technology, and distributing a plurality of orthogonal sub-channels to each user equipment u defined in the time slot t in the cell m Each sub-channel bandwidth is/>Because the coverage areas of the small base stations are not overlapped with each other, the same-frequency interference does not exist between different small base stations and between different user equipment of the same small base station; defining the downlink SNR value between the user equipment u and the local small-scale base station m at time slot t as/>And is also provided with

Wherein,Representing the transmit power of the small base station m at time slot t,/>Represents the channel gain between the small base station m and the user equipment u at time slot t, and/> Represents the distance between the small base station m and the user equipment u at time slot t, μ represents the path loss factor, σ ² represents the variance of the additive white gaussian noise; defining the download rate between the user equipment u and the local mini base station m at time slot t as/>And is also provided with

Defining the data transmission rate between each small base station m and the macro base station n as constantThe data transmission rate between macro base station n and cloud server c is constant/>And/>Defining the download delay required by the user equipment u to obtain the content f from the local MEC server m at time slot t as/>And is also provided with

Defining the download delay required for the user equipment u to obtain the content f from the other non-local MEC server-m at time slot t asAnd is also provided with

Defining the download delay required by the user equipment u to acquire the content f from the cloud server c at the time slot t asAnd is also provided with

Thus, the first and second substrates are bonded together,

1.7 Setting a content delivery model: the basic process of content delivery is that each user equipment independently requests a plurality of contents from a local MEC server, and if the contents are cached in a cache area of the local MEC server, the contents are directly transmitted to the user equipment by the local MEC server; if the content is not cached in the local MEC server, the content can be acquired from the MEC servers of other small base stations through the macro base station and then transmitted to the user equipment by the local MEC server; if all MEC servers do not cache the content, relaying the content from the cloud server to the macro base station through the core network, transmitting the content to the local MEC server through the macro base station, and finally delivering the content to the user equipment through the local MEC server;

Defining whether the user equipment u obtains the content f from the local MEC server m in the time slot t as binary variable Wherein/>Indicating that the user equipment u obtains the content f from the local server m in the time slot t, otherwise/>Defining whether the user equipment u obtains the content f from the non-local server-m in the time slot t as binary variable/>Wherein the method comprises the steps ofIndicating that the user equipment u obtains the content f from the non-local server-m in the time slot t, otherwise/>Defining whether the user equipment u obtains the content f from the cloud server c in the time slot t as binary variable/>Wherein/>Indicating that the user equipment u obtains the content f from the cloud server c in the time slot t, otherwise/>

Step 2: adopting a Double DQN algorithm to make an optimal cache decision for each small base station so as to maximize the total content cache hit rate of all small base stations, including the total cache hit rate hit by a local small base station and the total cache hit rate hit by other small base stations;

step 3: an improved branch-and-bound approach is employed to make optimal bandwidth resource allocation decisions for each small base station to minimize the total content download delay for all user equipment.

2. The collaborative edge caching algorithm based on deep reinforcement learning in an ultra dense network according to claim 1, wherein the specific steps of the Double DQN algorithm in step 2 are as follows:

2.1 describing the content caching decision problem for M small base stations as a constrained Markov decision process problem expressed in tuples < S, A, r, pr, c ₁,c₂,...,c_M > with the objective of maximizing the long-term cumulative discount rewards for all small base stations, where

S represents a state space, S ^t e S represents a state set of all the small-sized base stations in the time slot t, namely a content popularity matrix p ^t formed by content popularity vectors of all the small-sized base stations in the time slot t, so S ^t＝p^t;

A represents an action space, and a ^t epsilon A represents an action selected by a macro base station in a time slot t, namely a ^t＝d^t;

r denotes a bonus function, defining a bonus function of r ^t(s^t,a^t at time slot t), denotes an instant bonus obtained after the macro base station performs action a ^t in state s ^t, and

Where w ₁ and w ₂ represent weights, satisfying w ₁+w₂ =1 and w ₁＞w₂, w ₁＝0.9,w₂ =0.1,Indicating the total cache hit rate hit by the local femto base station m,Representing the total cache hit rate hit by the non-local femto base station-m;

Pr represents a state transition function, i.e. the probability of the macro base station transitioning to the next state s ^t+1 after performing action a ^t from the current state s ^t, and

C ₁,c₂,...,c_M represents constraint conditions of M small-sized base stations, that is, the total size of the cached content of each small-sized base station is not more than the storage capacity sc _m, that is, the requirement is satisfied

2.2 The Double DQN algorithm includes two processes, namely a training process and an execution process, wherein the training process is as follows:

2.2.1 in the initialization phase of the algorithm: initializing the storage capacity N of an experience playback memory, the sampling batch size K (N > K), and an experience playback period K, namely a sampling period; weight θ of online Q network Q, target Q network The weight of (a) is theta ^- =theta, the learning rate alpha, the discount factor gamma, the parameter epsilon of epsilon-greedy strategy, the time interval C for updating the target Q network parameter, the total training frequency EP and the total time slot number T (T > N) contained in each training, the index of episode is defined as i, and i=1 is initialized;

2.2.2 for each i ε {1,2, …, EP }, the following steps are performed:

2.2.2.1 initializing t=1;

2.2.2.3 after performing action a ^t, the agent gets the instant prize r ^t and transitions to the next state s ^t+1, then stores the experience sample e ^t＝(s^t,a^t,r^t,s^t+1) in the experience replay memory;

2.2.2.4 if t < N, let t +1, and returns to 2.2.2.2; otherwise, enter 2.2.2.5;

p_j＝|δ_j|+∈ (9)

Where ε > 0 is used to ensure that the priority of each sample is not 0, δ _j represents the time difference error of sample j, i.e., the difference between the target Q value and the estimated Q value of sample j, the Double DQN algorithm uses an online Q network to select the action with the largest Q value, and uses the target Q network to evaluate the Q value of the action, i.e

Therefore, if the TD error of the sample is larger, the priority is also larger; then, the priorities of all samples in the experience playback memory are calculated through formulas (9) and (10);

2.2.2.7 extracting k experience samples from the experience playback memory by adopting a Sum Tree data structure, wherein each leaf node at the bottom layer represents the priority of each experience sample, the value of each father node is equal to the Sum of the values of two child nodes, and the root node at the top layer represents the Sum of the priorities of all samples; the specific process is as follows: firstly, dividing the value of a root node by k to obtain k priority intervals, then randomly selecting a value in each interval, judging which leaf node at the bottom layer corresponds to the value through searching from top to bottom, and selecting a sample corresponding to the leaf node to obtain k experience samples;

Loss(θ)＝E[(y^t-Q(s^t,a^t;θ))²] (12)

Wherein E [. Cndot. ] represents a mathematical expectation; then, based on the k empirical samples, updating the weight θ of the online Q network using a random gradient descent method to minimize the loss function;

2.2.2.11 if T < T, let t++1, and return to 2.2.2.2; otherwise, let i≡i+1 and return to 2.2.2.1; after the training process of DoubleDQN algorithm is completed, the optimal weight theta ^* of the online Q network is obtained, and then the trained DoubleDQN algorithm is deployed on the macro base station to be executed, wherein the execution process is as follows:

2.2.3 initializing t=1;

2.2.4 macro base stations collect state sets s ^t of all small base stations in a time slot t, and then input s ^t into a trained online Q network so as to output Q values of all actions;

a^t＝argmax_a′Q(s^t,a′;θ^*) (13)

2.2.6 Macro base station after performing action a ^t, gets the instant prize r ^t and transitions to the next state s ^t+1;

2.2.7 if T < T, let t≡t+1 and return 2.2.4; otherwise, the algorithm ends.

3. The collaborative edge caching algorithm based on deep reinforcement learning in an ultra dense network according to claim 1, wherein the specific steps in step 3 are as follows:

3.1 determining the best content caching decision vector for each small cell m The bandwidth resource allocation problem of each small-sized base station is then described as a nonlinear integer programming problem P, i.e., for/>All require

3.2 each small-sized base station adopts an improved branch-and-bound method to solve the separable concave integer programming problem, and the specific flow is as follows:

3.2.3 branching is then performed from the continuous optimal solution, wherein each branching corresponds to a sub-problem, and then the continuous relaxation problem of the sub-problems is solved until a feasible integer solution is found, the objective function value of the feasible integer solution providing an upper bound for the original problem P, and the objective function value of the continuous optimal solution of each sub-problem providing a lower bound for the corresponding sub-problem; if a branch has no feasible solution, or the continuous optimal solution is an integer solution, or the lower bound exceeds the upper bound, the branch can be cut off; and repeating the processes of branching and pruning for branches without pruning until all branches are pruned; if a branch has a viable integer solution, the upper bound needs to be updated if necessary to ensure that the upper bound is equal to the minimum objective function value of the existing viable integer solution;