CN109617991B - Value function approximation-based cooperative caching method for codes of small stations of ultra-dense heterogeneous network - Google Patents
Value function approximation-based cooperative caching method for codes of small stations of ultra-dense heterogeneous network Download PDFInfo
- Publication number
- CN109617991B CN109617991B CN201811634918.6A CN201811634918A CN109617991B CN 109617991 B CN109617991 B CN 109617991B CN 201811634918 A CN201811634918 A CN 201811634918A CN 109617991 B CN109617991 B CN 109617991B
- Authority
- CN
- China
- Prior art keywords
- state
- base station
- file
- time slot
- small
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/50—Network services
- H04L67/56—Provisioning of proxy services
- H04L67/568—Storing data temporarily at an intermediate stage, e.g. caching
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/06—Protocols specially adapted for file transfer, e.g. file transfer protocol [FTP]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04W—WIRELESS COMMUNICATION NETWORKS
- H04W28/00—Network traffic management; Network resource management
- H04W28/02—Traffic management, e.g. flow control or congestion control
- H04W28/10—Flow control between communication endpoints
- H04W28/14—Flow control between communication endpoints using intermediate storage
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Mobile Radio Communication Systems (AREA)
Abstract
The invention discloses a value function approximation-based ultra-dense heterogeneous network small station code cooperation caching method. Expressing a value function as a function of state and action by adopting a reinforcement learning method of value function approximation, taking the number of file requests directly served by a maximized average accumulated substation as an optimization target, continuously interacting with the environment to adapt to the dynamic change of the environment, excavating a potential file request transfer mode to obtain an approximation formula of the value function, and further obtaining a cooperative caching decision matched with the file request transfer mode; and the macro base station encodes the cooperative caching decision and transmits the encoded cooperative caching result to each small station. The invention makes a cache decision by a transfer mode of a file request in a real network excavated by reinforcement learning, does not need any hypothesis on prior distribution of data, and is more suitable for an actual system; and through real-time interaction with the environment, the popularity of the time-varying file can be tracked, a corresponding caching strategy is made, the process is simple and feasible, and the NP-hard problem does not need to be solved.
Description
Technical Field
The invention belongs to the technical field of wireless network deployment in mobile communication, and particularly relates to a super-dense heterogeneous network small station code cooperation caching method.
Background
With the popularization of intelligent terminals and the development of internet services, in order to meet the requirements of users on high data rate and high service quality, an ultra-dense heterogeneous network will become one of the key technologies of a fifth generation mobile communication system (5G), and by deploying dense small stations in the coverage range of a macro base station, the communication quality of users at the edge of the network can be effectively improved, so that the spectrum efficiency and the system throughput are improved. However, since the small stations are connected to the macro base station through the wireless backhaul link, the densely deployed small stations put a great strain on the wireless backhaul link, and the highly loaded wireless backhaul link becomes a bottleneck of the network. Ultra-dense network architectures are highly desirable to be integrated with other network architectures or technologies to better serve users, and mobile network marginalization is a suitable option. The edge storage is an important concept in the mobile network edge architecture, namely, the file is cached in a small station to reduce mass data transmission in a peak period, so that the load of a return link of a system can be effectively reduced, the transmission delay is reduced, and the user experience is improved. The number of the small stations in the super-dense heterogeneous network is large, the distance is short, the user is generally in the coverage range of the small stations, and if the small stations transmit files for the user in a cooperation mode, the limited cache space of the small stations can be utilized more fully. The problem of edge caching in ultra-dense heterogeneous networks is therefore worth intensive research.
The existing caching technology is used for modeling caching decisions into an optimization problem. Firstly, the file popularity is generally considered to be invariable along with time, the file popularity in an actual network is constantly changeable, and the method for solving the optimization problem based on the invariable file popularity cannot track the constant change of the file popularity, so that the obtained caching decision cannot be well suitable for the actual network; secondly, even if the constant file popularity is changed into the instant file popularity, the file popularity is changed once, the optimization problem needs to be operated again once, which brings huge network overhead, and moreover, the modeling optimization problem is often an NP-hard (Non-Polynomial hard) problem, and the solution is very difficult; finally, because the caching problem is that a caching decision is made according to file request behaviors which have already occurred in the network to prepare for the file request behaviors which will occur, the method for making the caching decision based on the traditional solution optimization problem cannot mine the transfer mode of the file request in the network, so that the made caching decision is not optimal for the file request which will occur.
Disclosure of Invention
In order to solve the technical problems provided by the background technology, the invention provides a value function approximation-based ultra-dense heterogeneous network small station coding cooperative caching method, a potential transfer mode of a file request is mined by adopting the value function approximation method, and a cooperative caching strategy superior to that of the traditional method is obtained.
In order to achieve the technical purpose, the technical scheme of the invention is as follows:
the macro base station and the small stations in the coverage area of the macro base station are used as machines, the macro base station is responsible for determining actions to be executed by the small stations in each time slot state and sending the actions to the small stations, each small station is responsible for executing the actions, the states comprise the file popularity of the time slot and a cooperative caching decision made in the previous time slot, and the actions are the cooperative caching decision made in the current time slot for requesting service for the file in the next time slot; expressing a value function as a function of state and action by adopting a reinforcement learning method of value function approximation, taking the number of file requests directly served by a maximized average accumulated substation as an optimization target, continuously interacting with the environment to adapt to the dynamic change of the environment, excavating a potential file request transfer mode to obtain an approximation formula of the value function, and further obtaining a cooperative caching decision matched with the file request transfer mode; and the macro base station encodes the cooperative caching decision and transmits the encoded cooperative caching result to each small station.
Further, the method comprises the following steps:
collecting macro base station set M, small station set P and file request set C in network1And the number p of small stations in the coverage area of the mth macro base stationmM belongs to M; obtaining a small station cache space K, and determining the station cache space K by an operator according to the network operation condition and the hardware cost; an operator divides one day time into T time slots according to the file request condition in the ultra-dense heterogeneous network, sets the time starting point of each time slot, and divides each time slot into three stages according to the occurrence time: a file transmission stage, an information exchange stage and a cache decision stage;
step 2, formulating a base station cooperation caching scheme based on MDS coding:
recording a cooperative caching decision vector of the small station as a (t), wherein each element a in the a (t)c(t)∈[0,1],c∈C1Representing the proportion of the buffering of the c-th file at the t-th slot, acThe file set of (t) ≠ 0 is the file set of the t-slot cache, which is marked as C' (t), the C-th file contains B information bits, and the m-th macro base station passes throughMDS codes generate B information bits by codingIndividual check bits:
in the above formula, d is the number of small stations with received signal power greater than a threshold, the threshold is determined by the operator according to the network operation condition, allEach check bit is divided into a small station candidate bit and a macro base station candidate bit, wherein the small station candidate bit comprises pmB bits, that is, each small station has B candidate bits which are not repeated mutually, and each small station selects a front a from the respective candidate bits in the t time slotc(t) buffering B bits;
the macro base station randomly selects (1-da) from candidate bits thereofc(t)) B bits are cached, and according to the coding property of the MDS, the whole file can be recovered by acquiring at least B check bits from one file request;
step 3, formulating a base station cooperative transmission scheme:
each file request of a user first gets da from the d substations covering itc(t) B bits, if dac(t) is more than or equal to 1, the macro base station does not need to transmit data; otherwise, the macro base station selects one small station closest to the user from the d small stations and transmits (1-da)c(t)) B bits to the small station, and then the small station transmits the bits to the user, and the data transmitted by the macro base station is called backhaul link load;
step 4, describing a reinforcement learning task by using a Markov Decision Process (MDP):
establishing reinforcement learning quadrupletsWherein X represents a state space and A represents an action space,Representing the probability of a state transition, the probability of performing action a transition to the x' state in the x state,represents a reward for the transfer;
the specific form of reinforcement learning quadruple is as follows:
an action space: since the number of elements contained in the cache decision vector is equal to the set C1The number of elements C, so that the motion space is a C-dimensional continuous space, ac(t) is quantized into L discrete values, L is determined by an operator according to the computing capacity of the macro station, and the discretized motion space is A ═ { a ═1,a2,…,a|A|Any one of them is a motion vectorj belongs to {1,2, …, | A | } needs to satisfyThe total number of the action vectors meeting the condition is | A |, and the caching decision a (t) of the t-th time slot belongs to A;
state space: p in the coverage area of mth macro station in the t time slotmThe total times of file requests of the small station are recorded as a vector N (t) [ N ]1(t),N2(t),…,NC(t)]The total file popularity is recorded as a vector theta (t) ═ theta1(t),θ2(t),…,θC(t)]WhereinC ∈ C, the state of the tth time slot is recorded as x (t) ═ Θ (t), a (t-1)](ii) a Let H ═ Θ1,Θ2,…,Θ|H|The total file popularity set is denoted as Θ (t), which is an element in the set H after quantization, and the state space is denoted as X ═ X1,x2,…,x|H||A|State X (t) e X;
probability of state transition: after the action a (t) is performed in the t-th time slot, the action is applied to the current state x (t), and the environment is converted from the current state with potential transition probabilityTransition to the next state x (t +1), the transition probability being unknown;
reward: at the same time that the context is transferred to x (t +1), the context will give the machine a rewardDefined here as the number of file requests served directly by the cell:
in the above formula, u [. cndot.)]Which represents a step function of the measured value,in order to update the number of files to be transmitted in the buffering decision stage of the t-th time slot,the number of files transmitted by the macro base station in the information exchange phase of the (t +1) th time slot;
step 5, defining a reinforcement learning target:
defining a deterministic policy function pi (X), X ∈ X, according to which the action a to be executed under state X (t) is known (t) ═ pi (X (t)), the state value function:
in the above formula, the first and second carbon atoms are,representing the progressive prize awarded by using the strategy pi from the state x (t)Also, γ < 1 > 0 is a measure of how much the action π (x (t)) performed by t slots affects the future states;
after the state value function is obtained, a state-action value function, namely a Q function, is obtained:
in the above formula, the first and second carbon atoms are,a '(t)) represents the accumulated reward brought by using the strategy pi after the action a' (t) is executed from the state x (t);
replacing x (t), x (t +1), a '(t) with x, x', a, respectively, with the goal of finding a desired jackpotThe maximum strategy is denoted as π*(x) The optimum value function isObtaining the following according to the optimal strategy:
namely:
step 6, formulating a Q-learning process based on value function approximation:
(601) expressing the Q function by approximation of a value function, i.e. expressing the Q function as a function of state and action, subject to transient rewardsIn state x (t), the action a' (t) is performed, and the Q function is approximately expressed as:
in the above formula, ω1And ω2Weight representing two parts, set ω1>>ω2,β,ηi,ξiThe parameters are unknown parameters and need to be obtained through learning;
(602) and (3) solving a cooperative caching decision:
(603) establishing a Q-learning goal:
calculating the real value of the accumulated prize in carrying out the action a (t) in the state x (t) according to the above formula:
in the above formula, the first and second carbon atoms are,is the motion estimation value under the state x (t + 1);
(604) defining a loss function:
in the above formula, eta ═ eta1,η2,…,ηC],ξ=[ξ1,ξ2,…,ξC],EπExpressing the expectation of the strategy pi;
updating parameters beta, eta and xi according to the loss function;
step 7, setting the current time slot t as 1, and randomly setting the starting state x (t) as [ theta (t), a (t-1)]Initial value of parameter betap=0,ηp=0,ξpThe operator sets the value of γ in the range of 0,1 according to the network change speed, and determines the value of the update step δ in the range of (0, 1) according to the order of magnitude of the parameter to be updated]Setting the number t of training time slots according to the network scaletotal;
Step 8, in a cache decision stage of a t time slot, a strategy of an epsilon-greedy method is used for taking a cooperative cache decision a (t) to be executed under a state x (t);
step 9, the macro base station carries out MDS coding on the files needing to be cached according to the step 2, and transmits the coded data packets to the small station for caching;
step 10, in the file transmission stage of the t +1 time slot, a user requests a file, and the base station performs cooperative transmission to serve the user according to the step 3;
step 11, in an information exchange stage of a t +1 time slot, reporting the file request times of all the small stations in the coverage range of each macro base station to the macro base station in the t +1 time slot, summarizing the total file request times by the macro base station to be recorded as a vector N (t +1), and calculating the total file popularity to be recorded as a vector theta (t + 1);
Step 13, estimating the action to be executed in the state x (t + 1):
step 14, updating parameters in the Q function approximation formula according to the step (604);
step 15, if t ═ ttotalIf yes, stopping training and entering step 16; otherwise, t is t +1, enter the next time slot, go back to step 8, continue training;
and step 16, starting from the t time slot, determining a cooperative caching decision based on the Q function approximation formula obtained by training, and serving a file request of the next time slot.
Further, in step 3, the determination method of d is as follows:
let the probability of a user being served by d' cell be pd'Firstly, based on the base station deployment situation of the operator, p is obtained by calculation according to the historical data of the user positiond': in a time period tau, the positions of U users are respectively recorded at intervals of tau ', tau and tau' are automatically determined by an operator according to the network operation condition, and the number of base stations d 'with the base station number being d' is recorded, wherein the received signal power of the user U belongs to {1,2, …, U } at each position is greater than a threshold valueThe historical positions of the U users are used for calculation to obtain:
in the above formula, the first and second carbon atoms are,indicating the number of positions where i base stations provide service for the user u in the historical position of the user u;
the solution of the above equation is as follows:
according tomaxd/L is more than or equal to 1 to determine the maximum value of the elements in the cache decision vector, LmaxIs the denominator of the largest element, since l is within the range satisfying the inequalitymaxThe smaller the size, the better, therefore Represents rounding up;
secondly, calculating the number z of each element i/L in the caching decision vector according to the caching space of the base stationi,i=1,2,…,lmax:
determining the position of each element: coefficient of curvature etaiθi(t), i is 1,2, …, C is arranged in descending order, the j-th element after sorting is marked asCorresponding to the h-th before sortingjThe document firstly preliminarily determines the positions of the elements:
then, adjustIn which condition 1-l is satisfiedmaxd/L < 0, fromStarting to j ═ 1, the following steps are looped to adjust the elements in the motion vector: fromTo find out the satisfying conditionAndminimum j' ofThe ratio is reduced by 1/L,adding 1/L;
Further, in step 8, a cooperative caching decision is selected according to step (602) with a probability of 1-epsilon; randomly selecting one satisfying condition by probability epsilonAndto coordinate caching decisions.
Further, in step (604), a random gradient descent method is adopted to update parameters β, η, ξ in the Q-function approximation expression:
in the above formula betac,Parameter, β, representing the current time slotp,Represents the parameter of the previous time slot, and the updating step length is represented by delta less than or equal to 1 and more than 0.
Adopt the beneficial effect that above-mentioned technical scheme brought:
the invention provides service for users by utilizing small-station cooperative coding caching and cooperative transmission, makes caching decision by mining the transfer mode of the file request in the collected real network through reinforcement learning, is used as a data-driven machine learning method, does not need any hypothesis on prior distribution of data, and is more suitable for an actual system; and through real-time interaction with the environment, the popularity of the time-varying file can be tracked, a corresponding caching strategy is made, the process is simple and feasible, and the NP-hard problem does not need to be solved.
The invention makes a cooperative caching decision based on a value function approximation method, the macro base station collects state information through continuous interaction with the environment, makes a corresponding cooperative caching decision and transmits the decision to each small station, so that the most accurate files can be cached by effectively utilizing the limited storage space of the small stations, the number of file requests directly served by the small stations is obviously increased, and the load of a return link of a system is reduced.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The technical scheme of the invention is explained in detail in the following with the accompanying drawings.
The invention provides a super-dense heterogeneous network small station coding cooperation caching method based on value function approximation, which aims to maximize the number of file requests directly served by an average accumulated small station and on the premise that the total size of small station caching files does not exceed a small station caching space. According to the method, a transfer mode of a file request is mined through reinforcement learning, and a small station code cooperation caching method is formulated according to the mined mode. The reinforcement learning is described as an MDP (Markov Decision Process), the macro base station and the small stations in the coverage area of the macro base station are used as machines, the macro base station is responsible for determining actions to be executed and sending the actions to the small stations, the small stations are responsible for executing the actions and changing the environment, the environment is fed back to the machines according to a reward function to obtain a reward, and the actions to be executed by the small stations in the state of each time slot are learned through continuous interaction with the environment, wherein the state is partial description of the environment observed by the macro base station and comprises the file popularity of the time slot and a cooperative caching Decision made in the previous time slot, and the actions are cooperative caching decisions made in the time slot and used for requesting services for files in the next time slot. The reward function is defined according to the goal of the caching decision, here defined as the number of file requests served directly by the child station. Value function approximation (value function approximation) is a reinforcement learning method, and is suitable for the condition that a reinforcement learning task is performed in a huge discrete state space or a continuous state space, a value function is expressed as a function of a state and an action, the number of file requests directly served by a maximum average accumulation small station is taken as an optimization target, the method can adapt to the dynamic change of an environment through continuous interaction with the environment, a potential file request transfer mode can be mined, an approximation formula of the value function is obtained, and a cooperative caching decision matched with the file request transfer mode is further obtained. The macro base station combines an MDS (maximum Distance separable) coding method to code the file, and finally, the coding cooperation cache result is transmitted to each small station, so that the file request number directly served by the small station is obviously increased, and the system return link load is reduced.
An embodiment is given below by taking an LTE-a system as an example, and as shown in fig. 1, the specific steps are as follows:
the first step is as follows: collecting network information, and setting parameters:
collecting macro base station set M, small station set P and file request set C in network1And the number p of small stations in the coverage area of the mth macro base stationmM is equal to M, set C1C files are contained; obtaining a small station cache space K, and determining the station cache space K by an operator according to the network operation condition and the hardware cost; the operator divides the time of day into T time slots according to the file request situation in the ultra-dense heterogeneous network,and setting a time starting point of each time slot, and dividing each time slot into three stages according to the occurrence time: file transmission stage, information exchange stage and buffer decision stage.
The second step is that: formulating a base station cooperation caching scheme based on MDS coding:
the cooperative caching decision vector of the small station is recorded as a (t) ═ a1(t),a2(t),…,aC(t)]Wherein 0 is not less than ac(t) is less than or equal to 1, C epsilon C represents the proportion of the C-th file buffered in the t-th time slot substation, acA file set (namely a file set cached in a t time slot) with (t) ≠ 0 is marked as C' (t), the file C contains B information bits, and the macro base station m generates the B information bits by encoding through MDSIndividual check bits:
wherein d is the number of small stations with received signal power greater than a threshold value, the threshold value is determined by an operator according to the network operation condition, and all the stations are connected with the networkEach check bit is divided into a small station candidate bit and a macro base station candidate bit, wherein the small station candidate bit comprises pmB bits, that is, each small station has B candidate bits which are not repeated mutually, and each small station selects a front a from the respective candidate bits in the t time slotc(t) buffering B bits;
the macro base station randomly selects (1-da) from candidate bits thereofc(t)) B bits are buffered, and according to the coding property of the MDS, the whole file can be recovered by acquiring at least B check bits once when the file is requested.
The third step: and (3) formulating a base station cooperative transmission scheme:
each file request of a user first gets da from the d substations covering itc(t) B bits, if dac(t) is more than or equal to 1, the macro base station does not need to transmit data; otherwise, the macro base station selects one small station closest to the user from the d small stations and transmits (1-da)c(t)) B bits to the small station, which then transmits the bits to the user, the data transmitted by the macro base station is called backhaul link load. d, a determination method:
the probability of a user being served by d' cell is pd'Firstly, based on the base station deployment situation of the operator, p is obtained by calculation according to the historical data of the user positiond': in a time period tau, the positions of U users are respectively recorded at intervals of tau ', tau and tau' are automatically determined by an operator according to the network operation condition, and the number of base stations d 'with the base station number being d' is recorded, wherein the received signal power of the user U belongs to {1,2, …, U } at each position is greater than a threshold valueThe historical positions of the U users are used for calculation to obtain:
whereinIndicating the number of locations where i base stations served user u in the historical location of user u.
the fourth step: the reinforcement learning task is described by MDP:
wherein X represents a state space, A represents an action space,representing the probability of a state transition, the probability of performing action a transition to the x' state in the x state,represents a reward for the transfer;
the specific form of reinforcement learning quadruples in this problem is as follows:
1. an action space: the action is defined as a cooperative caching decision vector of the small station, the action which can be taken by the machine forms an action space, the number of elements contained in the caching decision vector is equal to the number C of the files, the action space is a C-dimensional continuous space, and each dimension is more than or equal to 0 and less than or equal to acAnd C epsilon is quantized into L discrete values, L is determined by an operator according to the computing capability of the macro station, and the discretized action space is A ═ a ≦ 11,a2,…,a|A|Any one of them is a motion vectorj belongs to {1,2, …, | A | } needs to satisfyThe total number of motion vectors satisfying this condition is | a |, and the t-th time slot is determined by the cache decision a (t) e a.
2. State space: the state is the description of the environment where the machine senses, and consists of a file popularity vector and a cooperative caching decision vector of a small station, for example, p in the coverage area of the mth macro station at the tth time slotmThe total times of file requests of the small station are recorded as a vector N (t) ═ N1(t),N2(t),…,NC(t)]The total file popularity is recorded as a vector theta (t) ═ theta1(t),θ2(t),…,θC(t)]WhereinC ∈ C, the state of the tth time slot is recorded as x (t) ═ Θ (t), a (t-1)](ii) a Let H ═ Θ1,Θ2,…,Θ|H|Is the total set of file popularity, Θ (t) is quantifiedThat is, one element in the set H, the state space is denoted as X ═ X1,x2,…,x|H||A|State X (t) e X.
3. Probability of state transition: after the action a (t) is performed in the t-th time slot, the action is applied to the current state x (t), and the environment is converted from the current state with potential transition probabilityTransition to the next state x (t +1), the transition probability is unknown.
4. Reward: at the same time that the context is transferred to x (t +1), the context will give the machine a rewardDefined here as the number of file requests served directly by the cell:
wherein u [. C]Representing a step function, wherein when the value in the brackets is more than 0, the function value is 1, otherwise, the function value is 0;in order to update the number of files to be transmitted in the buffering decision stage of the t-th time slot,is the number of files transmitted by the macro base station during the information exchange phase of the (t +1) th time slot.
The fifth step: and (3) clear reinforcement learning target:
defining a deterministic policy function pi (X), X ∈ X, according to which the action a to be executed under state X (t) is known (t) ═ pi (X (t)); defining a gamma discount expected accumulated prize function:
wherein EπMeaning that the expectation for the strategy pi is made,representing the cumulative reward due to the use of the policy π, starting from state x (t), 0 ≦ γ < 1 is a measure of the degree of influence of the action π (x (t)) performed for the t slot on the future state.
After obtaining the state value function, a state-action value function (Q function) is obtained:
representing the accumulated reward brought by using the strategy pi after the action a' (t) is executed from the state x (t), and the equations (4) and (5) are called as Bellman equations.
Replacing x (t), x (t +1), a '(t) with x, x', a, respectively, with the goal of finding a desired jackpotThe maximum strategy is denoted as π*(x) The optimum value function isAccording to the formulas (4) and (5) under the optimal strategy, the following can be obtained:
namely:
(6) the two formulas (7) reveal the improvement mode of the non-optimal strategy, namely, the action selected by the strategy is changed into the current optimal action:
under the condition that the reinforcement learning quadruple is known, solving the Bellman equation based on an equation (8) available value iterative algorithm or a strategy iterative algorithm to obtain an optimal strategy.
And a sixth step: under the condition that the state transition probability is unknown, based on a Q-learning process of value function approximation:
because the state transition probability is unknown, an optimal strategy cannot be obtained through a strategy iteration algorithm or a value iteration algorithm; meanwhile, the conversion from a state value function to a Q function is difficult due to the unknown state transition probability, so that the Q function is directly estimated;
1. the Q function approximates: in order to solve the difficulty of Q-table storage and traversal search caused by large state space and action space, a value function approximation method is used for representing a Q function, namely the Q function is represented as a function of state and action, and is rewarded instantaneouslyTaking t slots as an example, in state x (t), the action a' (t) is performed, and the Q function is approximately expressed as:
wherein ω is1And ω2Weight representing two parts, set ω1>>ω2,β,ηi,ξiAre unknown parameters and need to be learned.
2. Selection of a collaborative caching decision:
(11) formula A collaborative caching policy that maximizes the value in middle brackets, as can be seen from the expression in middle brackets, is associated with (1-da'i(t)) multiplied factor ηiθi(t) is directly related to the magnitude of the value in parentheses, ηiθiThe larger (t) is, the corresponding (1-da'i(t)) should be smaller so that the larger the value in parentheses will be. Therefore, the solving process of equation (11) is as follows:
according tomaxd/L is more than or equal to 1 to determine the maximum value of the elements in the cache decision vector, LmaxIs the denominator of the largest element, since l is within the range satisfying the inequalitymaxThe smaller the size, the better, therefore Represents rounding up;
② calculating each element i/L, i-1, 2, …, L in the buffer decision vectormaxNumber z of (2)i:
determining the position of each element: coefficient of curvature etaiθi(t), i is 1,2, …, C is arranged in descending order, the j-th element after sorting is marked asCorresponding to the h-th before sortingjThe document firstly preliminarily determines the positions of the elements:
then, adjustIn which condition 1-l is satisfiedmaxd/L < 0, fromStarting to j ═ 1, the following steps are looped to adjust the elements in the motion vector: fromTo find out the satisfying conditionAndminimum j' ofThe ratio is reduced by 1/L,adding 1/L.
3. Q-learning goal:
substituting equation (6) into equation (5) can obtain:
(14) the formula discloses a calculation method for the accumulated prize true value by executing the action a (t) under the state x (t):
Defining a loss function:
wherein the parameter vector eta ═ eta1,η2,…,ηC],ξ=[ξ1,ξ2,…,ξC]The goal of Q-learning is to make the estimated and true values of the Q function as close as possible, i.e., to minimize the loss function.
4. And updating parameters beta, eta and xi in the approximate expression of the Q function by adopting a random gradient descent method:
wherein beta isc,Parameter, β, representing the current time slotp,Represents the parameter of the previous time slot, and the updating step length is represented by delta less than or equal to 1 and more than 0.
The seventh step: setting a current time slot t ═ 1, and randomly setting a start state x (t) ═ Θ (t), a (t-1)]Initial value of parameter betap=0,ηp=0,ξpThe operator sets the value of γ to 0, depending on how fast the network changes, in the range 0,1, according to which it is updatedThe order of magnitude of the parameter(s) determines the value of δ, in the range (0, 1)]Setting the number t of training time slots according to the network scaletotal。
Eighth step: in the cache decision stage of the t time slot, a strategy using an epsilon-greedy method takes a cooperative cache decision a (t) to be executed under a state x (t): selecting a cooperative caching decision according to the step 2 in the sixth step according to the probability 1-epsilon; randomly selecting one satisfying condition by probability epsilonAndto coordinate caching decisions.
The ninth step: and the macro base station carries out MDS coding on the files needing to be cached according to the second step, and transmits the coded data packets to the small station for caching.
The tenth step: and in the file transmission stage of the (t +1) th time slot, the user requests a file, and the base station serves the user according to the third step of cooperative transmission.
The eleventh step: in the information exchange stage of the (t +1) th time slot, all the small stations in the coverage range of each macro base station report the file request times of the small stations in the (t +1) th time slot to the macro base station, the macro base station collects the total file request times and records the total file request times as a vector N (t +1), and the total file popularity is calculated and recorded as a vector theta (t + 1).
The twelfth step: the state of transition to is x (t +1) ═ Θ (t +1), a (t)]Calculating a reward function according to the formula (3)
The thirteenth step: estimating the action to be performed in the state x (t +1) according to step 2 in the sixth step:
the fourteenth step is that: and updating parameters in the Q function approximation formula according to the formula (17).
Fifteenth aspect of the inventionThe method comprises the following steps: if t is ttotalIf so, stopping training and entering the sixteenth step; otherwise, t is t +1, enter the next time slot, go back to the eighth step, continue training.
Sixteenth, step: and (5) determining a cooperative caching decision to serve the file request of the next time slot according to the step 2 in the sixth step based on the Q function approximation formula obtained by training from the t time slot.
According to the process, in the Q function learning process, the macro base station and the small stations in the coverage area of the macro base station are used as machines, the file popularity and the cooperative caching decision of the small stations are used as states, the cooperative caching decision is used as an action, the file request number directly served by the small stations is used as a reward function, interaction is continuously carried out with the environment, the maximum accumulated reward function is used as a target, a Q function approximation formula is obtained through learning, the cooperative caching decision in each state is further obtained, then the macro base station encodes the files to be cached by using MDS, and the encoding result is transmitted to each small station for cooperative caching. The method utilizes a reinforcement learning method to find the mode from the data without solving the optimization problem based on data distribution. The method can track the popularity of the files which change in real time, fully excavate and make a cooperative caching decision by utilizing a potential file request transfer mode, is more suitable for an actual system, obviously improves the number of file requests directly served by a small station, effectively reduces the load of a return link of the system, provides the performance of the system, and improves the user experience.
The embodiments are only for illustrating the technical idea of the present invention, and the technical idea of the present invention is not limited thereto, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the scope of the present invention.
Claims (5)
1. The super-dense heterogeneous network small station code cooperative caching method based on value function approximation is characterized by comprising the following steps: the macro base station and the small stations in the coverage range of the macro base station are used as machines, the macro base station is responsible for determining actions to be executed by the small stations in each time slot state and transmitting the actions to the small stations, and the small stations are responsible for executing the actions, the states comprise the file popularity of the time slot and a cooperative caching decision made in the previous time slot, and the actions are the cooperative caching decision made in the current time slot for requesting service for the file of the next time slot; expressing a value function as a function of state and action by adopting a reinforcement learning method of value function approximation, taking the number of file requests directly served by a maximized average accumulated substation as an optimization target, continuously interacting with the environment to adapt to the dynamic change of the environment, excavating a potential file request transfer mode to obtain an approximation formula of the value function, and further obtaining a cooperative caching decision matched with the file request transfer mode; the macro base station encodes the cooperative caching decision and transmits the encoded cooperative caching result to each small station;
the method comprises the following steps:
step 1, collecting network information, and setting parameters:
collecting macro base station set M, small station set P and file request set C in network1And the number p of small stations in the coverage area of the mth macro base stationmM belongs to M; obtaining a small station cache space K, and determining the station cache space K by an operator according to the network operation condition and the hardware cost; an operator divides one day time into T time slots according to the file request condition in the ultra-dense heterogeneous network, sets the time starting point of each time slot, and divides each time slot into three stages according to the occurrence time: a file transmission stage, an information exchange stage and a cache decision stage;
step 2, formulating a base station cooperation caching scheme based on MDS coding:
recording a cooperative caching decision vector of the small station as a (t), wherein each element a in the a (t)c(t)∈[0,1],c∈C1Representing the proportion of the buffering of the c-th file at the t-th slot, acThe file set of (t) ≠ 0 is a file set of the t time slot cache and is marked as C' (t), the C-th file contains B information bits, and the m-th macro base station encodes and generates the B information bits through MDS (Multi-dimensional System) encodingIndividual check bits:
in the above formula, d is the number of small stations with received signal power greater than a threshold, the threshold is determined by the operator according to the network operation condition, allEach check bit is divided into a small station candidate bit and a macro base station candidate bit, wherein the small station candidate bit comprises pmB bits, that is, each small station has B candidate bits which are not repeated mutually, and each small station selects a front a from the respective candidate bits in the t time slotc(t) buffering B bits; the macro base station randomly selects (1-da) from candidate bits thereofc(t)) B bits are cached, and according to the coding property of the MDS, the whole file can be recovered by acquiring at least B check bits from one file request;
step 3, formulating a base station cooperative transmission scheme:
each file request of a user first gets da from the d substations covering itc(t) B bits, if dac(t) is more than or equal to 1, the macro base station does not need to transmit data; otherwise, the macro base station selects one small station closest to the user from the d small stations and transmits (1-da)c(t)) B bits to the small station, and then the small station transmits the bits to the user, and the data transmitted by the macro base station is called backhaul link load;
step 4, describing a reinforcement learning task by using a Markov Decision Process (MDP):
establishing reinforcement learning quadrupletsWherein X represents a state space, A represents an action space,representing the probability of a state transition, the probability of performing action a transition to the x' state in the x state,represents a reward for the transfer;
the specific form of reinforcement learning quadruple is as follows:
an action space: since the number of elements contained in the cache decision vector is equal to the set C1The number of elements C, so that the motion space is a C-dimensional continuous space, ac(t) is quantized into L discrete values, L is determined by an operator according to the computing capacity of the macro station, and the discretized motion space is A ═ { a ═1,a2,…,a|A|Any one of them is a motion vectorj belongs to {1,2, …, | A | } needs to satisfy the condition:the total number of the action vectors meeting the condition is | A |, and the caching decision a (t) of the t-th time slot belongs to A;
state space: p in the coverage area of mth macro station in the t time slotmThe total times of file requests of the small station are recorded as a vector N (t) [ N ]1(t),N2(t),…,NC(t)]The total file popularity is recorded as a vector theta (t) ═ theta1(t),θ2(t),…,θC(t)]WhereinThen the state of the tth slot is recorded as x (t) ═ Θ (t), a (t-1)](ii) a Let H ═ Θ1,Θ2,…,Θ|H|The total file popularity set is denoted as Θ (t), which is an element in the set H after quantization, and the state space is denoted as X ═ X1,x2,…,x|H||A|State X (t) e X;
probability of state transition: after the action a (t) is performed in the t-th time slot, the action is applied to the current state x (t), and the environment is converted from the current state with potential transition probabilityTransition to the next state x (t +1), the transition probability being unknown;
reward: at the same time that the context is transferred to x (t +1), the context will give the machine a rewardDefined here as the number of file requests served directly by the cell:
in the above formula, u [. cndot.)]Which represents a step function of the measured value,in order to update the number of files to be transmitted in the buffering decision stage of the t-th time slot,the number of files transmitted by the macro base station in the information exchange phase of the (t +1) th time slot;
step 5, defining a reinforcement learning target:
defining a deterministic policy function pi (X), X ∈ X, according to which the action a to be executed under state X (t) is known (t) ═ pi (X (t)), the state value function:
in the above formula, the first and second carbon atoms are,representing the accumulated reward brought by using the strategy pi starting from the state x (t), 0 ≦ γ < 1 is a measure of the degree of influence of the action pi (x (t)) performed by the t slot on the future state;
after the state value function is obtained, a state-action value function, namely a Q function, is obtained:
in the above formula, the first and second carbon atoms are,representing the accumulated reward brought by using the strategy pi after the action a' (t) is executed from the state x (t);
replacing x (t), x (t +1), a '(t) with x, x', a, respectively, with the goal of finding a desired jackpotThe maximum strategy is denoted as π*(x) The optimum value function isObtaining the following according to the optimal strategy:
namely:
step 6, formulating a Q-learning process based on value function approximation:
(601) expressing the Q function by approximation of a value function, i.e. expressing the Q function as a function of state and action, subject to transient rewardsIn state x (t), the action a' (t) is performed, and the Q function is approximately expressed as:
in the above formula, ω1And ω2Weight representing two parts, set ω1>>ω2,β,ηi,ξiThe parameters are unknown parameters and need to be obtained through learning;
(602) and (3) solving a cooperative caching decision:
(603) establishing a Q-learning goal:
calculating the real value of the accumulated prize in carrying out the action a (t) in the state x (t) according to the above formula:
in the above formula, the first and second carbon atoms are,is the motion estimation value under the state x (t + 1);
(604) defining a loss function:
in the above formula, eta ═ eta1,η2,…,ηC],ξ=[ξ1,ξ2,…,ξC],EπExpressing the expectation of the strategy pi;
updating parameters beta, eta and xi according to the loss function;
step 7, setting the current time slot t as 1, and randomly setting the starting state x (t) as [ theta (t), a (t-1)]Initial value of parameter betap=0,ηp=0,ξpThe operator sets the value of γ in the range of 0,1 according to the network change speed, and determines the value of the update step δ in the range of (0, 1) according to the order of magnitude of the parameter to be updated]Setting the number t of training time slots according to the network scaletotal;
Step 8, in a cache decision stage of a t time slot, a strategy of an epsilon-greedy method is used for taking a cooperative cache decision a (t) to be executed under a state x (t);
step 9, the macro base station carries out MDS coding on the files needing to be cached according to the step 2, and transmits the coded data packets to the small station for caching;
step 10, in the file transmission stage of the t +1 time slot, a user requests a file, and the base station performs cooperative transmission to serve the user according to the step 3;
step 11, in an information exchange stage of a t +1 time slot, reporting the file request times of all the small stations in the coverage range of each macro base station to the macro base station in the t +1 time slot, summarizing the total file request times by the macro base station to be recorded as a vector N (t +1), and calculating the total file popularity to be recorded as a vector theta (t + 1);
Step 13, estimating the action to be executed in the state x (t + 1):
step 14, updating parameters in the Q function approximation formula according to the step (604);
step 15, if t ═ ttotalIf yes, stopping training and entering step 16; otherwise, t is t +1, enter the next time slot, go back to step 8, continue training;
and step 16, starting from the t time slot, determining a cooperative caching decision based on the Q function approximation formula obtained by training, and serving a file request of the next time slot.
2. The method for cooperative caching of codes of small stations in ultra-dense heterogeneous network based on value function approximation as claimed in claim 1, wherein in step 3, the determination method of d is as follows:
let the probability of a user being served by d' cell be pd'Firstly, based on the base station deployment situation of the operator, p is obtained by calculation according to the historical data of the user positiond': in a time period tau, the positions of U users are respectively recorded at intervals of tau ', tau and tau' are automatically determined by an operator according to the network operation condition, and the number of base stations d 'with the base station number being d' is recorded, wherein the received signal power of the user U belongs to {1,2, …, U } at each position is greater than a threshold valueThe historical positions of the U users are used for calculation to obtain:
in the above formula, the first and second carbon atoms are,indicating the number of positions where i base stations provide service for the user u in the historical position of the user u;
3. the ultra-dense heterogeneous network small station coding cooperative caching method based on value function approximation as claimed in claim 1, wherein in step (602), ω is the factor of ω1>>ω2Omission ofObtaining a caching decision:
the solution of the above equation is as follows:
according tomaxd/L is more than or equal to 1 to determine the maximum value of the elements in the cache decision vector, LmaxIs the denominator of the largest element, since l is within the range satisfying the inequalitymaxThe smaller the size, the better, therefore Represents rounding up;
secondly, calculating the number z of each element i/L in the caching decision vector according to the caching space of the base stationi,i=1,2,…,lmax:
determining the position of each element: coefficient of curvature etaiθi(t), i is 1,2, …, C is arranged in descending order, the j-th element after sorting is marked asCorresponding to the h-th before sortingjThe document firstly preliminarily determines the positions of the elements:
then, adjustIn which condition 1-l is satisfiedmaxd/L < 0, fromStarting to j ═ 1, the following steps are looped to adjust the elements in the motion vector: fromTo find out the satisfying conditionAndminimum j' ofThe ratio is reduced by 1/L,adding 1/L;
4. The method for cooperative caching of codes of small stations in ultra-dense heterogeneous networks based on value function approximation as claimed in claim 3, wherein in step 8, a cooperative caching decision is selected according to step (602) with a probability of 1-epsilon; randomly selecting one satisfying condition by probability epsilonAndto coordinate caching decisions.
5. The method for cooperative caching of codes of small stations in ultra-dense heterogeneous network based on value function approximation as claimed in claim 1, wherein in step (604), a random gradient descent method is used to update parameters β, η, ξ:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811634918.6A CN109617991B (en) | 2018-12-29 | 2018-12-29 | Value function approximation-based cooperative caching method for codes of small stations of ultra-dense heterogeneous network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811634918.6A CN109617991B (en) | 2018-12-29 | 2018-12-29 | Value function approximation-based cooperative caching method for codes of small stations of ultra-dense heterogeneous network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109617991A CN109617991A (en) | 2019-04-12 |
CN109617991B true CN109617991B (en) | 2021-03-30 |
Family
ID=66015366
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811634918.6A Active CN109617991B (en) | 2018-12-29 | 2018-12-29 | Value function approximation-based cooperative caching method for codes of small stations of ultra-dense heterogeneous network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109617991B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110138836B (en) * | 2019-04-15 | 2020-04-03 | 北京邮电大学 | Online cooperative caching method based on optimized energy efficiency |
CN110381540B (en) * | 2019-07-22 | 2021-05-28 | 天津大学 | Dynamic cache updating method for responding popularity of time-varying file in real time based on DNN |
CN111311996A (en) * | 2020-03-27 | 2020-06-19 | 湖南有色金属职业技术学院 | Online education informationization teaching system based on big data |
CN112218337B (en) * | 2020-09-04 | 2023-02-28 | 暨南大学 | Cache strategy decision method in mobile edge calculation |
CN112672402B (en) * | 2020-12-10 | 2022-05-03 | 重庆邮电大学 | Access selection method based on network recommendation in ultra-dense heterogeneous wireless network |
CN112911717B (en) * | 2021-02-07 | 2023-04-25 | 中国科学院计算技术研究所 | Transmission method for MDS (data packet System) encoded data packet of forwarding network |
CN113132466B (en) * | 2021-03-18 | 2022-03-15 | 中山大学 | Multi-access communication method, device, equipment and medium based on code cache |
CN115118728B (en) * | 2022-06-21 | 2024-01-19 | 福州大学 | Edge load balancing task scheduling method based on ant colony algorithm |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103929781A (en) * | 2014-04-09 | 2014-07-16 | 东南大学 | Cross-layer interference coordination optimization method in super dense heterogeneous network |
CN104782172A (en) * | 2013-09-18 | 2015-07-15 | 华为技术有限公司 | Small station communication method, device and system |
CN104955077A (en) * | 2015-05-15 | 2015-09-30 | 北京理工大学 | Heterogeneous network cell clustering method and device based on user experience speed |
CN106358308A (en) * | 2015-07-14 | 2017-01-25 | 北京化工大学 | Resource allocation method for reinforcement learning in ultra-dense network |
CN108882269A (en) * | 2018-05-21 | 2018-11-23 | 东南大学 | The super-intensive network small station method of switching of binding cache technology |
CN110445825A (en) * | 2018-05-04 | 2019-11-12 | 东南大学 | Super-intensive network small station coding cooperative caching method based on intensified learning |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP5857405B2 (en) * | 2010-12-28 | 2016-02-10 | ソニー株式会社 | Information processing apparatus, playback control method, program, and content playback system |
-
2018
- 2018-12-29 CN CN201811634918.6A patent/CN109617991B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104782172A (en) * | 2013-09-18 | 2015-07-15 | 华为技术有限公司 | Small station communication method, device and system |
CN103929781A (en) * | 2014-04-09 | 2014-07-16 | 东南大学 | Cross-layer interference coordination optimization method in super dense heterogeneous network |
CN104955077A (en) * | 2015-05-15 | 2015-09-30 | 北京理工大学 | Heterogeneous network cell clustering method and device based on user experience speed |
CN106358308A (en) * | 2015-07-14 | 2017-01-25 | 北京化工大学 | Resource allocation method for reinforcement learning in ultra-dense network |
CN110445825A (en) * | 2018-05-04 | 2019-11-12 | 东南大学 | Super-intensive network small station coding cooperative caching method based on intensified learning |
CN108882269A (en) * | 2018-05-21 | 2018-11-23 | 东南大学 | The super-intensive network small station method of switching of binding cache technology |
Non-Patent Citations (1)
Title |
---|
OFDMA毫微微小区双层网络中基于分组的资源分配;张海波;《电子与信息学报》;20160229;第38卷(第2期);262-268 * |
Also Published As
Publication number | Publication date |
---|---|
CN109617991A (en) | 2019-04-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109617991B (en) | Value function approximation-based cooperative caching method for codes of small stations of ultra-dense heterogeneous network | |
CN112118601B (en) | Method for reducing task unloading delay of 6G digital twin edge computing network | |
CN110113190B (en) | Unloading time delay optimization method in mobile edge computing scene | |
CN110445825B (en) | Super-dense network small station code cooperation caching method based on reinforcement learning | |
Zhang et al. | Deep learning for wireless coded caching with unknown and time-variant content popularity | |
CN111629380B (en) | Dynamic resource allocation method for high concurrency multi-service industrial 5G network | |
CN110167176B (en) | Wireless network resource allocation method based on distributed machine learning | |
CN112637806B (en) | Transformer substation monitoring system based on deep reinforcement learning and resource scheduling method thereof | |
CN114745383A (en) | Mobile edge calculation assisted multilayer federal learning method | |
CN116782296A (en) | Digital twinning-based internet-of-vehicles edge computing and unloading multi-objective decision method | |
CN115146764A (en) | Training method and device of prediction model, electronic equipment and storage medium | |
CN112667406A (en) | Task unloading and data caching method in cloud edge fusion heterogeneous network | |
CN113139341A (en) | Electric quantity demand prediction method and system based on federal integrated learning | |
CN114553718B (en) | Network traffic matrix prediction method based on self-attention mechanism | |
CN115065728A (en) | Multi-strategy reinforcement learning-based multi-target content storage method | |
CN114548575A (en) | Self-adaptive building day-ahead load prediction method based on transfer learning | |
CN114626550A (en) | Distributed model collaborative training method and system | |
CN110505604B (en) | Method for accessing frequency spectrum of D2D communication system | |
Peng et al. | Hmm-lstm for proactive traffic prediction in 6g wireless networks | |
CN115022195B (en) | Flow dynamic measurement method for IPv6 network | |
CN116484976A (en) | Asynchronous federal learning method in wireless network | |
CN115623445A (en) | Efficient communication method based on federal learning in Internet of vehicles environment | |
CN115912430A (en) | Cloud-edge-cooperation-based large-scale energy storage power station resource allocation method and system | |
Gupta et al. | Learning-based multivariate real-time data pruning for smart PMU communication | |
CN113115355B (en) | Power distribution method based on deep reinforcement learning in D2D system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |