CN111970733B - Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network - Google Patents

Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network Download PDF

Info

Publication number
CN111970733B
CN111970733B CN202010771674.7A CN202010771674A CN111970733B CN 111970733 B CN111970733 B CN 111970733B CN 202010771674 A CN202010771674 A CN 202010771674A CN 111970733 B CN111970733 B CN 111970733B
Authority
CN
China
Prior art keywords
content
time slot
base station
user equipment
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010771674.7A
Other languages
Chinese (zh)
Other versions
CN111970733A (en
Inventor
韩光洁
张帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changzhou Campus of Hohai University
Original Assignee
Changzhou Campus of Hohai University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changzhou Campus of Hohai University filed Critical Changzhou Campus of Hohai University
Priority to CN202010771674.7A priority Critical patent/CN111970733B/en
Publication of CN111970733A publication Critical patent/CN111970733A/en
Application granted granted Critical
Publication of CN111970733B publication Critical patent/CN111970733B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W28/00Network traffic management; Network resource management
    • H04W28/02Traffic management, e.g. flow control or congestion control
    • H04W28/10Flow control between communication endpoints
    • H04W28/14Flow control between communication endpoints using intermediate storage
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Mobile Radio Communication Systems (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses a cooperative edge caching algorithm based on deep reinforcement learning in an ultra-dense network, which comprises the following specific steps: step 1: setting each parameter of a system model; step 2: the Double DQN algorithm is employed to make an optimal caching decision for each SBS to maximize the total content cache hit rate for all SBS. The algorithm combines the DQN algorithm and the Double Q-learning algorithm, thereby effectively solving the problem of overestimation of the DQN algorithm on the Q value. In addition, the algorithm adopts a priority experience playback technology, so that the learning speed is increased; step 3: an improved branch-and-bound approach is employed to make optimal bandwidth resource allocation decisions for each SBS to minimize the total content download delay for all user equipments. The invention can effectively reduce the content download delay of all users in the ultra-dense network, improve the content cache hit rate and the spectrum resource utilization rate, has good robustness and expandability, and is suitable for the large-scale ultra-dense network with dense users.

Description

Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network
Technical Field
The invention relates to a collaborative edge caching algorithm based on deep reinforcement learning in an ultra-dense network, and belongs to the field of edge caching of the ultra-dense network.
Background
In the 5G age, mobile data traffic has exploded with the popularity of smart mobile devices and mobile applications. To meet the requirements of high capacity, high throughput, high user experience rate, high reliability, wide coverage, etc. of 5G Networks, ultra-Dense Networks (UDNs) have been developed. The UDN densely deploys low-power small base stations (Small Base Stations, SBS) in indoor and outdoor hot spot areas (such as office buildings, markets, subways, airports, tunnels and the like) within the coverage area of the MBS (Macro Base Station, MBS) so as to improve network capacity and space reuse degree and make up blind areas which cannot be covered by a Macro Base Station (MBS).
However, SBS in UDN is connected to the core network through backhaul links, and as the number of SBS and the number of users increases, backhaul data traffic increases sharply, causing backhaul link congestion and greater service delay, thereby reducing quality of service (Quality of Service, qoS) and quality of user experience (Quality of Experience, qoE). Thus, backhaul network problems have become a performance bottleneck limiting the development of UDNs.
In view of the above problems, the edge caching technology has become a promising solution, and the technology enables users to directly obtain request content from local SBS by caching popular content in SBS without downloading content from a remote cloud server through a backhaul link, thereby reducing traffic load of the backhaul link and a core network, reducing content download delay, and improving QoS and user QoE. However, since the buffering capacity of a single SBS is limited, the performance of the edge buffering may be limited. In order to expand the cache capacity and increase the cache diversity, a collaborative edge cache scheme may be adopted, that is, a plurality of SBS cache and update the content in a collaborative manner, and share the cached content with each other, so as to improve the content cache hit rate and reduce the content download delay.
Most of the existing collaborative content caching research needs prior knowledge such as probability distribution (such as Zipf distribution) of content popularity and user preference model, but in fact, the content popularity has complex space-time dynamic characteristics and is usually a non-stationary random process, so that it is difficult to accurately predict and model the probability distribution of the content popularity.
Deep reinforcement learning (Deep Reinforcement Learning, DRL) integrates the strong perceptibility of Deep learning with the strong decision capability of reinforcement learning, where the most common Deep reinforcement learning algorithm is a Deep Q Network (DQN) that approximates the Q function with a Deep neural Network (Deep Neural Network, DNN) with a weight of θ as a function approximator, i.e., Q (s, a; θ) =q * (s, a), called Q Network, and then updates the weights θ by a random gradient descent method to minimize the loss function, which is applicable to environments with larger state and action spaces, thus solving the dimensional curse problem. However, the conventional DQN algorithm usually has an excessively high estimated Q value, so that a Double DQN algorithm is adopted, and the problem of overestimation of the DQN algorithm can be effectively solved based on the Double Q-learning algorithm. In addition, the conventional DQN algorithm generally employs a random uniform sampling manner to extract experience samples from the experience playback memory to update the Q network, i.e., the probability of being extracted is the same for each experience sample, which results in that very few but very valuable experience samples are not utilized efficiently, and thus a priority experience playback (Prioritized Experience Replay) technique is employed to solve the sampling problem, thereby speeding up learning.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a collaborative edge caching algorithm based on deep reinforcement learning in an ultra-dense network, which is a centralized algorithm. The algorithm does not need prior knowledge such as probability distribution of content popularity, a user preference model and the like, but calculates the content popularity by utilizing the instantaneous content request of the user, so that the modeling process of the content popularity is simplified. Then, the MBS is responsible for collecting local content popularity information of all SBS and making an optimal caching decision for each SBS, with the goal of maximizing the total content cache hit rate of all SBS. Finally, after determining the optimal caching decision for each SBS, each SBS makes an optimal resource allocation decision according to its bandwidth resources, with the goal of minimizing the total content download delay for all user equipments. The algorithm has good robustness and expandability, and is suitable for large-scale user-intensive UDNs.
In order to achieve the above purpose, the technical scheme of the invention is realized as follows:
collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network:
Step1: setting each parameter of a system model;
Step 2: the Double DQN algorithm is employed to make an optimal cache decision for each SBS to maximize the total content cache hit rate for all SBS, including the total cache hit rate hit by the local SBS and the total cache hit rate hit by other SBS. The algorithm combines the DQN algorithm and the Double Q-learning algorithm, thereby effectively solving the problem of overestimation of the DQN algorithm on the Q value. In addition, the algorithm adopts a priority experience playback technology, so that the learning speed can be increased;
Step 3: an improved branch-and-bound approach is employed to make optimal bandwidth resource allocation decisions for each SBS to minimize the total content download delay for all user equipments. The method combines a branch-and-bound method and a linear lower approximation method, and is suitable for large-scale separable concave integer programming with more decision variables.
Preferably, the specific steps of the step 1 are as follows:
1.1 setting up a network model: the method comprises three layers, namely a user equipment layer, an MEC layer and a cloud layer, wherein the user equipment layer comprises a plurality of User Equipment (UE), and each UE can be connected to one SBS; the MEC layer contains M SBS and one MBS, the MBS covers all SBS, each SBS covers multiple UEs (each SBS represents a small cell), coverage areas between SBS do not overlap each other, each SBS is deployed with one MEC server M e M, its storage capacity is sc m, the storage capacities of all MEC servers form a storage capacity size vector sc= [ sc 1,sc2,...,scM ], the MEC server is responsible for providing edge buffer resources for UEs, and at the same time, is responsible for collecting status information (such as size and popularity of each request content, channel gain) of each small cell and transmitting to the MBS, and SBS can communicate with each other through the MBS and share their buffer content. The MBS is responsible for collecting status information of each SBS and making buffering decisions for all SBS, and is connected to the cloud layer through a core backbone network (e.g., an optical fiber backhaul link). The cloud layer comprises a plurality of cloud servers, has rich computing and caching resources and is used for caching all contents;
1.2 dividing the whole time axis into T time slots with the same length, wherein T epsilon T represents a time slot index, and a quasi-static model is adopted, namely, in one time slot, all system state parameters (such as popularity of each content request, position of user equipment and channel gain) are kept unchanged, and different time slot parameters are different;
1.3 setting a content popularity model: there are F contents in total, each content F e F has a size z f, and each content has a different size, and the sizes of all the contents constitute one content size vector z= [ z 1,z2,...,zf,...,zF ]. Defining popularity of each content f in cell m at time slot t as The total number of requests for content f in cell m at time slot t isThe total number of content requests for all UEs in cell m at slot t is/>Thus/>Popularity/>, of all content within cell mConstitute a content popularity vector/>The content popularity vectors of all cells form a content popularity matrix p t;
1.4 setting a content request model: a total of U UEs transmit content requests, defining the set of all UEs transmitting content requests in cell m in time slot t as The number of UEs transmitting content requests in cell m in time slot t is/>Let us assume that each UE requests each content at a maximum of one time in time slot t, define each UE/>, in cell m, in time slot tContent request vector of/>Wherein each element/> UE u requesting content f,/>, indicating in cell m at time slot tThe content request vector of all UEs in cell m in time slot t forms a content request matrix, indicating that UE u in cell m in time slot t does not request content f
1.5 Setting a cache model: defining a content caching decision vector to be maintained in the cache area of each MEC server m in time slot tWherein each element/> Representing the buffering of content f on MEC server m at time slot t,/>Meaning that the content f is not cached on the MEC server m at time slot t and the total size of the cached content in each MEC server cannot exceed its storage capacity sc m. The content caching decision vectors of all MEC servers form a content caching decision matrix d t;
1.6 setting up a communication model: assuming that each SBS operates on the same frequency band and the frequency bandwidth is B, the MBS and the SBS communicate with each other using a wired optical fiber, and thus the data transmission rate between the SBS and the MBS is large. Dividing the frequency bandwidth B into beta orthogonal sub-channels by using the orthogonal frequency division multiplexing technology, each UE u defined in the time slot t in the cell m can be allocated with a plurality of orthogonal sub-channels Each sub-channel bandwidth is/>Since the coverage areas of the SBSs are not overlapped with each other, the same-frequency interference does not exist between different SBSs and between different UE of the same SBS. Defining the downlink SNR value between time slot tUE u and local SBS m as/>And is also provided with
Wherein,Representing the transmitted power of SBS m at time slot t,/>Represents the channel gain between time slot tbs m and UE u, and/> Represents the distance between the time slot tbs m and UE u, μ represents the path loss factor, σ 2 represents the variance of the additive gaussian white noise. Thus, the download rate between time slot tUE u and local SBS m is defined as/>And is also provided with
Defining the data transmission rate between each SBS m and MBS n as constantThe data transmission rate between MBS n and cloud server c is constant/>And/>Thus, the download delay required to obtain content f from the local MEC server m at time slot t UEu is defined as/>And is also provided with
Defining the download delay required to obtain content f from other non-local MEC servers-m at time slot tue u asAnd is also provided with
Defining the download delay required to acquire content f from cloud server c at time slot tue u asAnd is also provided with
Thus, the first and second substrates are bonded together,
1.7 Setting a content delivery model: the basic process of content delivery is that each UE independently requests a plurality of contents from a local MEC server, and if the contents are cached in a cache area of the local MEC server, the contents are directly transmitted to the UE by the local MEC server; if the content is not cached in the local MEC server, the content can be acquired from the MEC servers of other SBSs through MBS and then transmitted to the UE through the local MEC server; if all MEC servers do not cache the content, the content is relayed from the cloud server to the MBS through the core network, then the MBS is transmitted to the local MEC server, and finally the local MEC server delivers the content to the UE. Defining whether or not the content f is obtained from the local MEC server m in the time slot t UE u as a binary variableWherein/>Indicating that the content f is obtained from the local server m in time slot t UE u, otherwiseDefining whether or not the content f is obtained from the non-local server-m in the time slot t UE u as a binary variableWherein/>Indicating that the content f is obtained from the non-local server-m in time slot t UE u, otherwiseDefine whether content f is acquired from cloud server c in time slot t UE u as binary variable/>Wherein/>Indicating that content f is acquired from cloud server c in time slot tUE u, otherwise/>
Preferably, the specific steps of the Double DQN algorithm in step2 are as follows:
2.1 describing the content caching decision problem for M SBSs as a constrained Markov decision process (Constrained Markov Decision Process, CMDP) problem, which can be expressed in terms of tuples < S, A, r, pr, c 1,c2,...,cM >, the optimization objective is to maximize the long-term cumulative discounted prize for all SBSs, where
2.1.1 S represents a state space, S t e S represents a state set of all SBS at time slot t, i.e. a content popularity matrix p t formed by content popularity vectors of all SBS at time slot t, so S t=pt;
2.1.2 A represents action space, definition a t ε A represents action selected in time slot t MBS, namely a t=dt;
2.1.3 r denotes the bonus function, defined as r t(st,at at time slot tMBS), denotes the immediate bonus obtained after MBS performs action a t in state s t, and
Where w 1 and w 2 represent weights, satisfying w 1+w2 =1 and w 1>w2, w 1 =0.9,Indicating the total cache hit rate hit by the local SBS m,Representing the total cache hit rate hit by non-native SBS-m;
2.1.4 Pr represents the state transition function, i.e. the probability of MBS to transition to the next state s t+1 after performing action a t from the current state s t, and
2.1.5 C 1,c2,...,c0 represents the constraint of M SBSs, meaning that the total size of the cached content of each SBS does not exceed the storage capacity sc m, i.e. meets
2.2 A Double DQN algorithm is used, the training process of which is similar to the DQN algorithm, including an online Q network and a target Q network, except that the algorithm decomposes the maximum operation of the target Q value in the DQN algorithm into action selection and action evaluation, i.e. the action is selected using the online Q network and evaluated using the target Q network. The Double DQN algorithm includes two processes, namely a training process and an execution process, wherein the training process is as follows:
2.2.1 in the initialization phase of the algorithm: initializing the storage capacity N of an empirical playback memory, the sampling batch size K (N > K), the empirical playback period K (i.e., sampling period), the weight θ of the online Q network Q, the target Q network The weight of (a) is θ - =θ, the learning rate α, the discount factor γ, the parameter ε of ε -greedy strategy, the time interval C for updating the target Q network parameter, the total training frequency (i.e. total number of episode) EP, and the total time slot number T (T > N) contained in each training, defining the index of episode as i, and initializing i=1;
2.2.2 if i.ltoreq.EP, enter 2.2.2.1; otherwise, training ends:
2.2.2.1 initializing t=1;
2.2.2.2 inputting the current state s t into an online Q network so as to output the Q values of all actions, then selecting all actions meeting the requirement of the storage capacity according to constraint conditions, selecting an action a t from the actions and executing the actions by adopting an epsilon-greedy strategy, wherein the epsilon-greedy strategy is that an intelligent agent randomly selects the actions with smaller probability epsilon in each time slot, and selects the action with the highest Q value with larger probability 1-epsilon;
2.2.2.3 after performing action a t, the agent gets the instant prize r t and transitions to the next state s t +1, then stores the experience sample e t=(st,at,rt,st+1) in an experience replay memory;
2.2.2.4 if t < N, let t +1, and returns to 2.2.2.2; otherwise, enter 2.2.2.5;
2.2.2.5 if t% k=0, then go to 2.2.2.6; otherwise, let t≡t+1 and return 2.2.2.2;
2.2.2.6 assume that a certain experience sample j in the experience playback memory is e j=(sj,aj,rj,sj+1), and define the priority of the experience sample j as
pj=|δj|+∈ (9)
Where ε > 0 is used to ensure that each sample has a priority other than 0, δ j represents the time difference (TemporalDifference, TD) error of sample j, i.e., the difference between the target Q value and the estimated Q value of sample j, the Double DQN algorithm uses an online Q network to select the action with the largest Q value, and uses the target Q network to evaluate the Q value of the action, i.e
Therefore, if the TD error of a sample is larger, the priority thereof is also larger. Then, the priorities of all samples in the experience playback memory are calculated through formulas (9) and (10);
2.2.2.7 employ a Sum Tree data structure to extract k experience samples from the experience playback memory, where each leaf node at the bottom layer represents the priority of each experience sample, the value of each parent node is equal to the Sum of the values of the two child nodes, and the root node at the top layer represents the Sum of the priorities of all samples. The specific process is as follows: firstly, dividing the value of a root node by k to obtain k priority intervals, then randomly selecting a value in each interval, judging which leaf node at the bottom layer corresponds to the value through searching from top to bottom, and selecting a sample corresponding to the leaf node to obtain k experience samples;
2.2.2.8 separately calculate the target Q value y j for each of the k empirical samples j according to equation (11), i.e., using the online Q network to select the action with the largest Q value, and using the target Q network to evaluate the Q value of the action, i.e.
2.2.2.9 Define the Loss function Loss (θ) as the mean square error between the target Q value y j and the estimated Q value Q (s j,aj; θ), i.e
Loss(θ)=E[(yt-Q(st,at;θ))2] (12)
Wherein E [. Cndot ] represents a mathematical expectation. Then, based on the k empirical samples, updating the weight θ of the online Q network using a random gradient descent method to minimize the loss function;
2.2.2.10 if t% c=0, copying the updated weight θ of the online Q network to the target Q network to update the weight θ - of the target Q network; otherwise, the weight theta - of the target Q network does not need to be updated;
2.2.2.11 if T < T, let t++1, and return to 2.2.2.2; otherwise, let i≡i+1 and return to 2.2.2.1;
After the training process of DoubleDQN algorithm is completed, the optimal weight theta * of the online Q network is obtained, and then the trained DoubleDQN algorithm is deployed on the MBS to be executed, and the execution process is as follows:
2.2.3 initializing t=1;
2.2.4 The MBS collects the state set s t of all SBS in the time slot t, then s t is input into the trained online Q network, so as to output the Q values of all actions;
2.2.5 selecting all the actions meeting the storage Capacity requirement according to the constraint, then selecting the action a t with the maximum Q value from the actions and executing, namely
at=arg maxa′Q(st,a′;θ*) (13)
2.2.6 After MBS performs action a t, instant prize r t is obtained and transitions to the next state s t+1;
2.2.7 if T < T, let t≡t+1 and return 2.2.4; otherwise, the algorithm ends.
Preferably, the specific steps of the step 3 are as follows:
3.1 determining the optimal content caching decision vector for each SBS m The bandwidth resource allocation problem of each SBS is then described as a nonlinear integer programming problem P, i.e. for/>All require
Wherein both the objective function and the constraint function can be expressed in terms of all decision variablesIn the form of unitary function summation, i.e
And all thatThe objective function is a separable concave function in the definition domain, and the constraint function is a linear constraint in the definition domain, so that the problem is a separable concave integer programming problem;
3.2 each SBS adopts an improved branch-and-bound method to solve the separable concave integer programming problem, and the method comprises the following specific procedures:
3.2.1, continuously relaxing the original problem P, namely removing integer constraint, and linearly approximating an objective function, so as to obtain a continuous relaxation & linear approximation sub-problem LSP of the original problem P, wherein the LSP is a separable linear programming problem;
3.2.2 solving a continuous optimal solution of the LSP by using a KKT condition, wherein if the continuous optimal solution is an integer solution, the continuous optimal solution is an optimal solution of the original problem P, otherwise, the objective function value of the continuous optimal solution is a lower bound of the optimal value of the original problem P;
3.2.3 branching is then performed from the continuous optimal solution, where each branch corresponds to a sub-problem, and then the continuous relaxation problem of the sub-problems is solved until a viable integer solution is found, the objective function value of which provides an upper bound for the original problem P, and the objective function value of the continuous optimal solution of each sub-problem provides a lower bound for the corresponding sub-problem. A branch may be pruned if it has no feasible solution, or if the continuous optimal solution is an integer solution, or if its lower bound exceeds the upper bound. And repeating the branching and pruning processes for branches which are not pruned until all branches are pruned. If a branch has a viable integer solution, the upper bound needs to be updated if necessary to ensure that the upper bound is equal to the minimum objective function value of the existing viable integer solution;
3.2.4 at the end of the algorithm, the best feasible integer solution at present is the optimal solution of the original problem P.
The beneficial effects are that: the invention provides a collaborative edge caching algorithm based on deep reinforcement learning in an ultra-dense network, which can effectively reduce content downloading delay of all users in the ultra-dense network, improve content caching hit rate and spectrum resource utilization rate, has good robustness and expandability, and is suitable for large-scale user-dense ultra-dense networks.
Drawings
FIG. 1 is a network model of the UDN of step 1.1 employing edge caching;
fig. 2 is a schematic diagram illustrating the extraction of k samples using the data structure Sum Tree in step 2.2.2.7.
Detailed Description
In order to better understand the technical solutions of the present application for those skilled in the art, the following description of the technical solutions of the embodiments of the present application will be clearly and completely described, and it is obvious that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
The cooperative edge caching algorithm based on deep reinforcement learning in the ultra-dense network comprises the following specific steps:
Step1: setting each parameter of a system model;
Step 2: the Double DQN algorithm is employed to make an optimal cache decision for each SBS to maximize the total content cache hit rate for all SBS, including the total cache hit rate hit by the local SBS and the total cache hit rate hit by other SBS. The algorithm combines the DQN algorithm and the Double Q-learning algorithm, thereby effectively solving the problem of overestimation of the DQN algorithm on the Q value. In addition, the algorithm adopts a priority experience playback technology, so that the learning speed can be increased;
Step 3: an improved branch-and-bound approach is employed to make optimal bandwidth resource allocation decisions for each SBS to minimize the total content download delay for all user equipments. The method combines a branch-and-bound method and a linear lower approximation method, and is suitable for large-scale separable concave integer programming with more decision variables.
Preferably, the specific steps in the step1 are as follows:
1.1 setting up a network model: the method comprises three layers, namely a user equipment layer, an MEC layer and a cloud layer, wherein the user equipment layer comprises a plurality of User Equipment (UE), and each UE can be connected to one SBS; the MEC layer contains M SBS and one MBS, the MBS covers all SBS, each SBS covers multiple UEs (each SBS represents a small cell), coverage areas between SBS do not overlap each other, each SBS is deployed with one MEC server M e M, its storage capacity is sc m, the storage capacities of all MEC servers form a storage capacity size vector sc= [ sc 1,sc2,...,scM ], the MEC server is responsible for providing edge buffer resources for UEs, and at the same time, is responsible for collecting status information (such as size and popularity of each request content, channel gain) of each small cell and transmitting to the MBS, and SBS can communicate with each other through the MBS and share their buffer content. The MBS is responsible for collecting status information of each SBS and making buffering decisions for all SBS, and is connected to the cloud layer through a core backbone network (e.g., an optical fiber backhaul link). The cloud layer comprises a plurality of cloud servers, has rich computing and caching resources and is used for caching all contents;
1.2 dividing the whole time axis into T time slots with the same length, wherein T epsilon T represents a time slot index, and a quasi-static model is adopted, namely, in one time slot, all system state parameters (such as popularity of each content request, position of user equipment and channel gain) are kept unchanged, and different time slot parameters are different;
1.3 setting a content popularity model: there are F contents in total, each content F e F has a size z f, and each content has a different size, and the sizes of all the contents constitute one content size vector z= [ z 1,z2,...,zf,...,zF ]. Defining popularity of each content f in cell m at time slot t as The total number of requests for content f in cell m at time slot t isThe total number of content requests for all UEs in cell m at slot t is/>Thus/>Popularity/>, of all content within cell mConstitute a content popularity vector/>The content popularity vectors of all cells form a content popularity matrix p t;
1.4 setting a content request model: a total of U UEs transmit content requests, defining the set of all UEs transmitting content requests in cell m in time slot t as The number of UEs transmitting content requests in cell m in time slot t is/>Let us assume that each UE requests each content at a maximum of one time in time slot t, define each UE/>, in cell m, in time slot tContent request vector of/>Wherein each element/> UE u requesting content f,/>, indicating in cell m at time slot tThe content request vector of all UEs in cell m in time slot t forms a content request matrix, indicating that UE u in cell m in time slot t does not request content f
1.5 Setting a cache model: defining a content caching decision vector to be maintained in the cache area of each MEC server m in time slot tWherein each element/> Representing the buffering of content f on MEC server m at time slot t,/>Meaning that the content f is not cached on the MEC server m at time slot t and the total size of the cached content in each MEC server cannot exceed its storage capacity sc m. The content caching decision vectors of all MEC servers form a content caching decision matrix d t;
1.6 setting up a communication model: assuming that each SBS operates on the same frequency band and the frequency bandwidth is B, the MBS and the SBS communicate with each other using a wired optical fiber, and thus the data transmission rate between the SBS and the MBS is large. Dividing the frequency bandwidth B into beta orthogonal sub-channels by using the orthogonal frequency division multiplexing technology, each UE u defined in the time slot t in the cell m can be allocated with a plurality of orthogonal sub-channels Each sub-channel bandwidth is/>Since the coverage areas of the SBSs are not overlapped with each other, the same-frequency interference does not exist between different SBSs and between different UE of the same SBS. Defining the downlink SNR value between time slot tUE u and local SBS m as/>And is also provided with
Wherein,Representing the transmitted power of SBS m at time slot t,/>Represents the channel gain between time slot tbs m and UE u, and/> Represents the distance between the time slot tbs m and UE u, μ represents the path loss factor, σ 2 represents the variance of the additive gaussian white noise. Thus, the download rate between time slot tUE u and local SBS m is defined as/>And is also provided with
Defining the data transmission rate between each SBS m and MBS n as constantThe data transmission rate between MBS n and cloud server c is constant/>And/>Thus, the download delay required to obtain content f from the local MEC server m at time slot t UEu is defined as/>And is also provided with
Defining the download delay required to obtain content f from other non-local MEC servers-m at time slot tue u asAnd is also provided with
Defining the download delay required to acquire content f from cloud server c at time slot tue u asAnd is also provided with
Thus, the first and second substrates are bonded together,
1.7 Setting a content delivery model: the basic process of content delivery is that each UE independently requests a plurality of contents from a local MEC server, and if the contents are cached in a cache area of the local MEC server, the contents are directly transmitted to the UE by the local MEC server; if the content is not cached in the local MEC server, the content can be acquired from the MEC servers of other SBSs through MBS and then transmitted to the UE through the local MEC server; if all MEC servers do not cache the content, the content is relayed from the cloud server to the MBS through the core network, then the MBS is transmitted to the local MEC server, and finally the local MEC server delivers the content to the UE. Defining whether or not the content f is obtained from the local MEC server m in the time slot t UE u as a binary variableWherein/>Indicating that the content f is obtained from the local server m in time slot t UE u, otherwiseDefining whether or not the content f is obtained from the non-local server-m in the time slot t UE u as a binary variableWherein/>Indicating that the content f is obtained from the non-local server-m in time slot t UE u, otherwiseDefine whether content f is acquired from cloud server c in time slot t UE u as binary variable/>Wherein/>Indicating that content f is acquired from cloud server c in time slot tUE u, otherwise/>
Preferably, in the step 2, the specific steps are as follows:
2.1 describing the content caching decision problem for M SBSs as a constrained Markov decision process (Constrained Markov Decision Process, CMDP) problem, which can be expressed in terms of tuples < S, A, r, pr, c 1,c2,...,cM >, the optimization objective is to maximize the long-term cumulative discounted prize for all SBSs, where
2.1.1 S represents a state space, S t e S represents a state set of all SBS at time slot t, i.e. a content popularity matrix p t formed by content popularity vectors of all SBS at time slot t, so S t=pt;
2.1.2 A represents action space, definition a t ε A represents action selected in time slot t MBS, namely a t=dt;
2.1.3 r denotes the bonus function, defined as r t(st,at at time slot tMBS), denotes the immediate bonus obtained after MBS performs action a t in state s t, and
Where w 1 and w 2 represent weights, satisfying w 1+w2 =1 and w 1>w2, w 1 =0.9,Indicating the total cache hit rate hit by the local SBS m,Representing the total cache hit rate hit by non-native SBS-m;
2.1.4 Pr represents the state transition function, i.e. the probability of MBS to transition to the next state s t+1 after performing action a t from the current state s t, and
2.1.5 C 1,c2,...,cM represents the constraint of M SBSs, meaning that the total size of the cached content of each SBS does not exceed the storage capacity sc m, i.e. meets
2.2 A Double DQN algorithm is used, the training process of which is similar to the DQN algorithm, including an online Q network and a target Q network, except that the algorithm decomposes the maximum operation of the target Q value in the DQN algorithm into action selection and action evaluation, i.e. the action is selected using the online Q network and evaluated using the target Q network. The Double DQN algorithm includes two processes, namely a training process and an execution process, wherein the training process is as follows:
2.2.1 in the initialization phase of the algorithm: initializing the storage capacity N of an empirical playback memory, the sampling batch size K (N > K), the empirical playback period K (i.e., sampling period), the weight θ of the online Q network Q, the target Q network The weight of (a) is θ - =θ, the learning rate α, the discount factor γ, the parameter ε of ε -greedy strategy, the time interval C for updating the target Q network parameter, the total training frequency (i.e. total number of episode) EP, and the total time slot number T (T > N) contained in each training, defining the index of episode as i, and initializing i=1;
2.2.2 if i.ltoreq.EP, enter 2.2.2.1; otherwise, training ends:
2.2.2.1 initializing t=1;
2.2.2.2 inputting the current state s t into an online Q network so as to output the Q values of all actions, then selecting all actions meeting the requirement of the storage capacity according to constraint conditions, selecting an action a t from the actions and executing the actions by adopting an epsilon-greedy strategy, wherein the epsilon-greedy strategy is that an intelligent agent randomly selects the actions with smaller probability epsilon in each time slot, and selects the action with the highest Q value with larger probability 1-epsilon;
2.2.2.3 after performing action a t, the agent gets the instant prize r t and transitions to the next state s t +1, then stores the experience sample e t=(st,at,rt,st+1) in an experience replay memory;
2.2.2.4 if t < N, let t +1, and returns to 2.2.2.2; otherwise, enter 2.2.2.5;
2.2.2.5 if t% k=0, then go to 2.2.2.6; otherwise, let t≡t+1 and return 2.2.2.2;
2.2.2.6 assume that a certain experience sample j in the experience playback memory is e j=(sj,aj,rj,sj+1), and define the priority of the experience sample j as
pj=|δj|+∈ (9)
Where ε > 0 is used to ensure that each sample has a priority other than 0, δ j represents the time difference (TemporalDifference, TD) error of sample j, i.e., the difference between the target Q value and the estimated Q value of sample j, the Double DQN algorithm uses an online Q network to select the action with the largest Q value, and uses the target Q network to evaluate the Q value of the action, i.e
Therefore, if the TD error of a sample is larger, the priority thereof is also larger. Then, the priorities of all samples in the experience playback memory are calculated through formulas (9) and (10);
2.2.2.7 employ a Sum Tree data structure to extract k experience samples from the experience playback memory, where each leaf node at the bottom layer represents the priority of each experience sample, the value of each parent node is equal to the Sum of the values of the two child nodes, and the root node at the top layer represents the Sum of the priorities of all samples. The specific process is as follows: firstly, dividing the value of a root node by k to obtain k priority intervals, then randomly selecting a value in each interval, judging which leaf node at the bottom layer corresponds to the value through searching from top to bottom, and selecting a sample corresponding to the leaf node to obtain k experience samples;
2.2.2.8 separately calculate the target Q value y j for each of the k empirical samples j according to equation (11), i.e., using the online Q network to select the action with the largest Q value, and using the target Q network to evaluate the Q value of the action, i.e.
2.2.2.9 Define the Loss function Loss (θ) as the mean square error between the target Q value y j and the estimated Q value Q (s j,aj; θ), i.e
Loss(θ)=E[(yt-Q(st,at;θ))2] (12)
Wherein E [. Cndot ] represents a mathematical expectation. Then, based on the k empirical samples, updating the weight θ of the online Q network using a random gradient descent method to minimize the loss function;
2.2.2.10 if t% c=0, copying the updated weight θ of the online Q network to the target Q network to update the weight θ - of the target Q network; otherwise, the weight theta - of the target Q network does not need to be updated;
2.2.2.11 if T < T, let t++1, and return to 2.2.2.2; otherwise, let i≡i+1 and return to 2.2.2.1;
After the training process of DoubleDQN algorithm is completed, the optimal weight theta * of the online Q network is obtained, and then the trained DoubleDQN algorithm is deployed on the MBS to be executed, and the execution process is as follows:
2.2.3 initializing t=1;
2.2.4 The MBS collects the state set s t of all SBS in the time slot t, then s t is input into the trained online Q network, so as to output the Q values of all actions;
2.2.5 selecting all the actions meeting the storage Capacity requirement according to the constraint, then selecting the action a t with the maximum Q value from the actions and executing, namely
at=arg maxa′Q(st,a′;θ*) (13)
2.2.6 After MBS performs action a t, instant prize r t is obtained and transitions to the next state s t+1;
2.2.7 if T < T, let t≡t+1 and return 2.2.4; otherwise, the algorithm ends.
Preferably, in the step3, the specific steps are as follows:
3.1 determining the optimal content caching decision vector for each SBS m The bandwidth resource allocation problem of each SBS is then described as a nonlinear integer programming problem P, i.e. for/>All require
Wherein both the objective function and the constraint function can be expressed in terms of all decision variablesIn the form of unitary function summation, i.e
And all thatThe objective function is a separable concave function in the definition domain, and the constraint function is a linear constraint in the definition domain, so that the problem is a separable concave integer programming problem;
3.2 each SBS adopts an improved branch-and-bound method to solve the separable concave integer programming problem, and the method comprises the following specific procedures:
3.2.1, continuously relaxing the original problem P, namely removing integer constraint, and linearly approximating an objective function, so as to obtain a continuous relaxation & linear approximation sub-problem LSP of the original problem P, wherein the LSP is a separable linear programming problem;
3.2.2 solving a continuous optimal solution of the LSP by using a KKT condition, wherein if the continuous optimal solution is an integer solution, the continuous optimal solution is an optimal solution of the original problem P, otherwise, the objective function value of the continuous optimal solution is a lower bound of the optimal value of the original problem P;
3.2.3 branching is then performed from the continuous optimal solution, where each branch corresponds to a sub-problem, and then the continuous relaxation problem of the sub-problems is solved until a viable integer solution is found, the objective function value of which provides an upper bound for the original problem P, and the objective function value of the continuous optimal solution of each sub-problem provides a lower bound for the corresponding sub-problem. A branch may be pruned if it has no feasible solution, or if the continuous optimal solution is an integer solution, or if its lower bound exceeds the upper bound. And repeating the branching and pruning processes for branches which are not pruned until all branches are pruned. If a branch has a viable integer solution, the upper bound needs to be updated if necessary to ensure that the upper bound is equal to the minimum objective function value of the existing viable integer solution;
3.2.4 at the end of the algorithm, the best feasible integer solution at present is the optimal solution of the original problem P.
The methods mentioned in the present invention all belong to conventional technical means known to the person skilled in the art and are not described in detail.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (3)

1. The cooperative edge caching algorithm based on deep reinforcement learning in the ultra-dense network is characterized by comprising the following specific steps of:
Step1: setting each parameter of a system model;
1.1 setting up a network model: the method comprises three layers, namely a user equipment layer, an MEC layer and a cloud layer, wherein the user equipment layer comprises a plurality of user equipment, and each user equipment can only be connected to one small base station; the MEC layer comprises M small base stations and a macro base station, the macro base station covers all the small base stations, each small base station covers a plurality of user equipment, each small base station represents a small cell, the coverage areas among the small base stations are not overlapped with each other, each small base station is provided with an MEC server M epsilon M, the storage capacity of the MEC server M epsilon M is m, the storage capacity of all the MEC servers forms a storage capacity size vector sc= [ sc 1,sc2,...,scM ], the MEC server is responsible for providing edge cache resources for the user equipment, meanwhile, is responsible for collecting state information of each small cell and transmitting the state information to the macro base station, and the small base stations communicate with each other through the macro base station and share cache contents; the macro base station is responsible for collecting the state information of each small base station and making a caching decision for all the small base stations, and is connected to the cloud layer through a core backbone network; the cloud layer comprises a plurality of cloud servers, has rich computing and caching resources and is used for caching all contents;
1.2 dividing the whole time axis into T time slots with the same length, wherein T epsilon T represents a time slot index, and a quasi-static model is adopted, namely, in one time slot, all system state parameters are kept unchanged, and different time slot parameters are different;
1.3 setting a content popularity model: f contents are shared, the size of each content F epsilon F is z f, the sizes of each content are different, the sizes of all the contents form a content size vector z= [ z 1,z2,...,zf,...,zF ], and the popularity of each content F in a cell m in a time slot t is defined as The total number of requests for content f in cell m at time slot t is/>The total number of content requests of all user equipments in cell m in time slot t is/>Thus/>Popularity/>, of all content within cell mConstitute a content popularity vector/>The content popularity vectors of all cells form a content popularity matrix p t;
1.4 setting a content request model: a total of U user equipments transmitting content requests, defining the set of all user equipments transmitting content requests in cell m in time slot t as The number of user equipments transmitting content requests in cell m in time slot t is/>Let us assume that each user equipment requests each content at a maximum of one time in time slot t, define each user equipment/>, in cell m, in time slot tContent request vector of/>Wherein each element/> Indicating that user equipment u within cell m requests content f at time slot t,The content request vector of all user equipments in cell m in time slot t constitutes a content request matrix/>
1.5 Setting a cache model: defining a content caching decision vector to be maintained in the cache area of each MEC server m in time slot tWherein each element/> Representing the buffering of content f on MEC server m at time slot t,/>Indicating that content f is not cached on MEC server m at time slot t and that the total size of the cached content in each MEC server cannot exceed its storage capacity sc m; the content caching decision vectors of all MEC servers form a content caching decision matrix d t;
1.6 setting up a communication model: each small-sized base station is assumed to work on the same frequency band, the frequency band width is B, and the macro base station and the small-sized base stations are communicated by adopting a wired optical fiber, so that the data transmission rate between the small-sized base stations and the macro base station is very high; dividing the frequency band width B into beta orthogonal sub-channels by using the orthogonal frequency division multiplexing technology, and distributing a plurality of orthogonal sub-channels to each user equipment u defined in the time slot t in the cell m Each sub-channel bandwidth is/>Because the coverage areas of the small base stations are not overlapped with each other, the same-frequency interference does not exist between different small base stations and between different user equipment of the same small base station; defining the downlink SNR value between the user equipment u and the local small-scale base station m at time slot t as/>And is also provided with
Wherein,Representing the transmit power of the small base station m at time slot t,/>Represents the channel gain between the small base station m and the user equipment u at time slot t, and/> Represents the distance between the small base station m and the user equipment u at time slot t, μ represents the path loss factor, σ 2 represents the variance of the additive white gaussian noise; defining the download rate between the user equipment u and the local mini base station m at time slot t as/>And is also provided with
Defining the data transmission rate between each small base station m and the macro base station n as constantThe data transmission rate between macro base station n and cloud server c is constant/>And/>Defining the download delay required by the user equipment u to obtain the content f from the local MEC server m at time slot t as/>And is also provided with
Defining the download delay required for the user equipment u to obtain the content f from the other non-local MEC server-m at time slot t asAnd is also provided with
Defining the download delay required by the user equipment u to acquire the content f from the cloud server c at the time slot t asAnd is also provided with
Thus, the first and second substrates are bonded together,
1.7 Setting a content delivery model: the basic process of content delivery is that each user equipment independently requests a plurality of contents from a local MEC server, and if the contents are cached in a cache area of the local MEC server, the contents are directly transmitted to the user equipment by the local MEC server; if the content is not cached in the local MEC server, the content can be acquired from the MEC servers of other small base stations through the macro base station and then transmitted to the user equipment by the local MEC server; if all MEC servers do not cache the content, relaying the content from the cloud server to the macro base station through the core network, transmitting the content to the local MEC server through the macro base station, and finally delivering the content to the user equipment through the local MEC server;
Defining whether the user equipment u obtains the content f from the local MEC server m in the time slot t as binary variable Wherein/>Indicating that the user equipment u obtains the content f from the local server m in the time slot t, otherwise/>Defining whether the user equipment u obtains the content f from the non-local server-m in the time slot t as binary variable/>Wherein the method comprises the steps ofIndicating that the user equipment u obtains the content f from the non-local server-m in the time slot t, otherwise/>Defining whether the user equipment u obtains the content f from the cloud server c in the time slot t as binary variable/>Wherein/>Indicating that the user equipment u obtains the content f from the cloud server c in the time slot t, otherwise/>
Step 2: adopting a Double DQN algorithm to make an optimal cache decision for each small base station so as to maximize the total content cache hit rate of all small base stations, including the total cache hit rate hit by a local small base station and the total cache hit rate hit by other small base stations;
step 3: an improved branch-and-bound approach is employed to make optimal bandwidth resource allocation decisions for each small base station to minimize the total content download delay for all user equipment.
2. The collaborative edge caching algorithm based on deep reinforcement learning in an ultra dense network according to claim 1, wherein the specific steps of the Double DQN algorithm in step 2 are as follows:
2.1 describing the content caching decision problem for M small base stations as a constrained Markov decision process problem expressed in tuples < S, A, r, pr, c 1,c2,...,cM > with the objective of maximizing the long-term cumulative discount rewards for all small base stations, where
S represents a state space, S t e S represents a state set of all the small-sized base stations in the time slot t, namely a content popularity matrix p t formed by content popularity vectors of all the small-sized base stations in the time slot t, so S t=pt;
A represents an action space, and a t epsilon A represents an action selected by a macro base station in a time slot t, namely a t=dt;
r denotes a bonus function, defining a bonus function of r t(st,at at time slot t), denotes an instant bonus obtained after the macro base station performs action a t in state s t, and
Where w 1 and w 2 represent weights, satisfying w 1+w2 =1 and w 1>w2, w 1=0.9,w2 =0.1,Indicating the total cache hit rate hit by the local femto base station m,Representing the total cache hit rate hit by the non-local femto base station-m;
Pr represents a state transition function, i.e. the probability of the macro base station transitioning to the next state s t+1 after performing action a t from the current state s t, and
C 1,c2,...,cM represents constraint conditions of M small-sized base stations, that is, the total size of the cached content of each small-sized base station is not more than the storage capacity sc m, that is, the requirement is satisfied
2.2 The Double DQN algorithm includes two processes, namely a training process and an execution process, wherein the training process is as follows:
2.2.1 in the initialization phase of the algorithm: initializing the storage capacity N of an experience playback memory, the sampling batch size K (N > K), and an experience playback period K, namely a sampling period; weight θ of online Q network Q, target Q network The weight of (a) is theta - =theta, the learning rate alpha, the discount factor gamma, the parameter epsilon of epsilon-greedy strategy, the time interval C for updating the target Q network parameter, the total training frequency EP and the total time slot number T (T > N) contained in each training, the index of episode is defined as i, and i=1 is initialized;
2.2.2 for each i ε {1,2, …, EP }, the following steps are performed:
2.2.2.1 initializing t=1;
2.2.2.2 inputting the current state s t into an online Q network so as to output the Q values of all actions, then selecting all actions meeting the requirement of the storage capacity according to constraint conditions, selecting an action a t from the actions and executing the actions by adopting an epsilon-greedy strategy, wherein the epsilon-greedy strategy is that an intelligent agent randomly selects the actions with smaller probability epsilon in each time slot, and selects the action with the highest Q value with larger probability 1-epsilon;
2.2.2.3 after performing action a t, the agent gets the instant prize r t and transitions to the next state s t+1, then stores the experience sample e t=(st,at,rt,st+1) in the experience replay memory;
2.2.2.4 if t < N, let t +1, and returns to 2.2.2.2; otherwise, enter 2.2.2.5;
2.2.2.5 if t% k=0, then go to 2.2.2.6; otherwise, let t≡t+1 and return 2.2.2.2;
2.2.2.6 assume that a certain experience sample j in the experience playback memory is e j=(sj,aj,rj,sj+1), and define the priority of the experience sample j as
pj=|δj|+∈ (9)
Where ε > 0 is used to ensure that the priority of each sample is not 0, δ j represents the time difference error of sample j, i.e., the difference between the target Q value and the estimated Q value of sample j, the Double DQN algorithm uses an online Q network to select the action with the largest Q value, and uses the target Q network to evaluate the Q value of the action, i.e
Therefore, if the TD error of the sample is larger, the priority is also larger; then, the priorities of all samples in the experience playback memory are calculated through formulas (9) and (10);
2.2.2.7 extracting k experience samples from the experience playback memory by adopting a Sum Tree data structure, wherein each leaf node at the bottom layer represents the priority of each experience sample, the value of each father node is equal to the Sum of the values of two child nodes, and the root node at the top layer represents the Sum of the priorities of all samples; the specific process is as follows: firstly, dividing the value of a root node by k to obtain k priority intervals, then randomly selecting a value in each interval, judging which leaf node at the bottom layer corresponds to the value through searching from top to bottom, and selecting a sample corresponding to the leaf node to obtain k experience samples;
2.2.2.8 separately calculate the target Q value y j for each of the k empirical samples j according to equation (11), i.e., using the online Q network to select the action with the largest Q value, and using the target Q network to evaluate the Q value of the action, i.e.
2.2.2.9 Define the Loss function Loss (θ) as the mean square error between the target Q value y j and the estimated Q value Q (s j,aj; θ), i.e
Loss(θ)=E[(yt-Q(st,at;θ))2] (12)
Wherein E [. Cndot. ] represents a mathematical expectation; then, based on the k empirical samples, updating the weight θ of the online Q network using a random gradient descent method to minimize the loss function;
2.2.2.10 if t% c=0, copying the updated weight θ of the online Q network to the target Q network to update the weight θ - of the target Q network; otherwise, the weight theta - of the target Q network does not need to be updated;
2.2.2.11 if T < T, let t++1, and return to 2.2.2.2; otherwise, let i≡i+1 and return to 2.2.2.1; after the training process of DoubleDQN algorithm is completed, the optimal weight theta * of the online Q network is obtained, and then the trained DoubleDQN algorithm is deployed on the macro base station to be executed, wherein the execution process is as follows:
2.2.3 initializing t=1;
2.2.4 macro base stations collect state sets s t of all small base stations in a time slot t, and then input s t into a trained online Q network so as to output Q values of all actions;
2.2.5 selecting all the actions meeting the storage Capacity requirement according to the constraint, then selecting the action a t with the maximum Q value from the actions and executing, namely
at=argmaxa′Q(st,a′;θ*) (13)
2.2.6 Macro base station after performing action a t, gets the instant prize r t and transitions to the next state s t+1;
2.2.7 if T < T, let t≡t+1 and return 2.2.4; otherwise, the algorithm ends.
3. The collaborative edge caching algorithm based on deep reinforcement learning in an ultra dense network according to claim 1, wherein the specific steps in step 3 are as follows:
3.1 determining the best content caching decision vector for each small cell m The bandwidth resource allocation problem of each small-sized base station is then described as a nonlinear integer programming problem P, i.e., for/>All require
Wherein both the objective function and the constraint function can be expressed in terms of all decision variablesIn the form of unitary function summation, i.e
And all thatThe objective function is a separable concave function in the definition domain, and the constraint function is a linear constraint in the definition domain, so that the problem is a separable concave integer programming problem;
3.2 each small-sized base station adopts an improved branch-and-bound method to solve the separable concave integer programming problem, and the specific flow is as follows:
3.2.1, continuously relaxing the original problem P, namely removing integer constraint, and linearly approximating an objective function, so as to obtain a continuous relaxation & linear approximation sub-problem LSP of the original problem P, wherein the LSP is a separable linear programming problem;
3.2.2 solving a continuous optimal solution of the LSP by using a KKT condition, wherein if the continuous optimal solution is an integer solution, the continuous optimal solution is an optimal solution of the original problem P, otherwise, the objective function value of the continuous optimal solution is a lower bound of the optimal value of the original problem P;
3.2.3 branching is then performed from the continuous optimal solution, wherein each branching corresponds to a sub-problem, and then the continuous relaxation problem of the sub-problems is solved until a feasible integer solution is found, the objective function value of the feasible integer solution providing an upper bound for the original problem P, and the objective function value of the continuous optimal solution of each sub-problem providing a lower bound for the corresponding sub-problem; if a branch has no feasible solution, or the continuous optimal solution is an integer solution, or the lower bound exceeds the upper bound, the branch can be cut off; and repeating the processes of branching and pruning for branches without pruning until all branches are pruned; if a branch has a viable integer solution, the upper bound needs to be updated if necessary to ensure that the upper bound is equal to the minimum objective function value of the existing viable integer solution;
3.2.4 at the end of the algorithm, the best feasible integer solution at present is the optimal solution of the original problem P.
CN202010771674.7A 2020-08-04 2020-08-04 Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network Active CN111970733B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010771674.7A CN111970733B (en) 2020-08-04 2020-08-04 Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010771674.7A CN111970733B (en) 2020-08-04 2020-08-04 Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network

Publications (2)

Publication Number Publication Date
CN111970733A CN111970733A (en) 2020-11-20
CN111970733B true CN111970733B (en) 2024-05-14

Family

ID=73364237

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010771674.7A Active CN111970733B (en) 2020-08-04 2020-08-04 Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network

Country Status (1)

Country Link
CN (1) CN111970733B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20220089532A (en) 2020-12-21 2022-06-28 한국전자통신연구원 Method and apparatus for ttl-based cache management using reinforcement learning
CN112887992B (en) * 2021-01-12 2022-08-12 滨州学院 Dense wireless network edge caching method based on access balance core and replacement rate
CN113225584B (en) * 2021-03-24 2022-02-22 西安交通大学 Cross-layer combined video transmission method and system based on coding and caching
CN113094982B (en) * 2021-03-29 2022-12-16 天津理工大学 Internet of vehicles edge caching method based on multi-agent deep reinforcement learning
CN113240324A (en) * 2021-06-02 2021-08-10 中国电子科技集团公司第五十四研究所 Air and space resource collaborative planning method considering communication mechanism
CN113573324B (en) * 2021-07-06 2022-08-12 河海大学 Cooperative task unloading and resource allocation combined optimization method in industrial Internet of things
CN113573320B (en) * 2021-07-06 2024-03-22 西安理工大学 SFC deployment method based on improved actor-critter algorithm in edge network
CN113687960B (en) * 2021-08-12 2023-09-29 华东师范大学 Edge computing intelligent caching method based on deep reinforcement learning
CN114302421B (en) * 2021-11-29 2024-06-18 北京邮电大学 Method and device for generating communication network architecture, electronic equipment and medium
CN114301909B (en) * 2021-12-02 2023-09-22 阿里巴巴(中国)有限公司 Edge distributed management and control system, method, equipment and storage medium
CN114285853B (en) * 2022-01-14 2022-09-23 河海大学 Task unloading method based on end edge cloud cooperation in equipment-intensive industrial Internet of things
CN115270867A (en) * 2022-07-22 2022-11-01 北京信息科技大学 Improved DQN fault diagnosis method and system for gas turbine rotor system
CN115499441A (en) * 2022-09-15 2022-12-20 中原工学院 Deep reinforcement learning-based edge computing task unloading method in ultra-dense network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107079044A (en) * 2014-09-25 2017-08-18 交互数字专利控股公司 The process cached for perception of content and the provided for radio resources management for coordinated multipoint transmission
EP3605329A1 (en) * 2018-07-31 2020-02-05 Commissariat à l'énergie atomique et aux énergies alternatives Connected cache empowered edge cloud computing offloading
CN111447266A (en) * 2020-03-24 2020-07-24 中国人民解放军国防科技大学 Mobile edge calculation model based on chain and service request and scheduling method thereof

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11244242B2 (en) * 2018-09-07 2022-02-08 Intel Corporation Technologies for distributing gradient descent computation in a heterogeneous multi-access edge computing (MEC) networks
US11423254B2 (en) * 2019-03-28 2022-08-23 Intel Corporation Technologies for distributing iterative computations in heterogeneous computing environments
US10992752B2 (en) * 2019-03-28 2021-04-27 Intel Corporation Sensor network configuration mechanisms

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107079044A (en) * 2014-09-25 2017-08-18 交互数字专利控股公司 The process cached for perception of content and the provided for radio resources management for coordinated multipoint transmission
EP3605329A1 (en) * 2018-07-31 2020-02-05 Commissariat à l'énergie atomique et aux énergies alternatives Connected cache empowered edge cloud computing offloading
CN111447266A (en) * 2020-03-24 2020-07-24 中国人民解放军国防科技大学 Mobile edge calculation model based on chain and service request and scheduling method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
移动边缘网络中计算迁移与内容缓存研究综述;张开元, 桂小林, 任德旺, 李 敬, 吴 杰, 任东胜;《软件学报》;20190522;全文 *

Also Published As

Publication number Publication date
CN111970733A (en) 2020-11-20

Similar Documents

Publication Publication Date Title
CN111970733B (en) Collaborative edge caching algorithm based on deep reinforcement learning in ultra-dense network
CN111726826B (en) Online task unloading method in base station intensive edge computing network
Fadlullah et al. HCP: Heterogeneous computing platform for federated learning based collaborative content caching towards 6G networks
Hu et al. Twin-timescale artificial intelligence aided mobility-aware edge caching and computing in vehicular networks
CN111565419B (en) Delay optimization-oriented collaborative edge caching method in ultra-dense network
CN112995951B (en) 5G Internet of vehicles V2V resource allocation method adopting depth certainty strategy gradient algorithm
CN110769514B (en) Heterogeneous cellular network D2D communication resource allocation method and system
CN113543074B (en) Joint computing migration and resource allocation method based on vehicle-road cloud cooperation
CN110267338A (en) Federated resource distribution and Poewr control method in a kind of D2D communication
CN114025330B (en) Air-ground cooperative self-organizing network data transmission method
CN111132074B (en) Multi-access edge computing unloading and frame time slot resource allocation method in Internet of vehicles environment
CN114885426B (en) 5G Internet of vehicles resource allocation method based on federal learning and deep Q network
CN114390057B (en) Multi-interface self-adaptive data unloading method based on reinforcement learning under MEC environment
Chua et al. Resource allocation for mobile metaverse with the Internet of Vehicles over 6G wireless communications: A deep reinforcement learning approach
Shang et al. Computation offloading and resource allocation in NOMA–MEC: A deep reinforcement learning approach
Zhang et al. Two time-scale caching placement and user association in dynamic cellular networks
CN112153744A (en) Physical layer security resource allocation method in ICV network
Gao et al. Reinforcement learning based resource allocation in cache-enabled small cell networks with mobile users
Zhang et al. Toward intelligent resource allocation on task-oriented semantic communication
CN106060876A (en) Load balancing method for heterogeneous wireless network
Dai et al. Contextual multi-armed bandit for cache-aware decoupled multiple association in UDNs: A deep learning approach
CN117412391A (en) Enhanced dual-depth Q network-based Internet of vehicles wireless resource allocation method
Qureshi et al. Distributed self optimization techniques for heterogeneous network environments using active antenna tilt systems
CN116112934A (en) End-to-end network slice resource allocation method based on machine learning
CN115734195A (en) Dynamic data synchronization method in digital twin energized air-space-ground integrated network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant