CN109617991B

CN109617991B - Value function approximation-based cooperative caching method for codes of small stations of ultra-dense heterogeneous network

Info

Publication number: CN109617991B
Application number: CN201811634918.6A
Authority: CN
Inventors: 潘志文; 高深; 刘楠; 尤肖虎
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2018-12-29
Filing date: 2018-12-29
Publication date: 2021-03-30
Anticipated expiration: 2038-12-29
Also published as: CN109617991A

Abstract

The invention discloses a value function approximation-based ultra-dense heterogeneous network small station code cooperation caching method. Expressing a value function as a function of state and action by adopting a reinforcement learning method of value function approximation, taking the number of file requests directly served by a maximized average accumulated substation as an optimization target, continuously interacting with the environment to adapt to the dynamic change of the environment, excavating a potential file request transfer mode to obtain an approximation formula of the value function, and further obtaining a cooperative caching decision matched with the file request transfer mode; and the macro base station encodes the cooperative caching decision and transmits the encoded cooperative caching result to each small station. The invention makes a cache decision by a transfer mode of a file request in a real network excavated by reinforcement learning, does not need any hypothesis on prior distribution of data, and is more suitable for an actual system; and through real-time interaction with the environment, the popularity of the time-varying file can be tracked, a corresponding caching strategy is made, the process is simple and feasible, and the NP-hard problem does not need to be solved.

Description

Value function approximation-based cooperative caching method for codes of small stations of ultra-dense heterogeneous network

Technical Field

The invention belongs to the technical field of wireless network deployment in mobile communication, and particularly relates to a super-dense heterogeneous network small station code cooperation caching method.

Background

With the popularization of intelligent terminals and the development of internet services, in order to meet the requirements of users on high data rate and high service quality, an ultra-dense heterogeneous network will become one of the key technologies of a fifth generation mobile communication system (5G), and by deploying dense small stations in the coverage range of a macro base station, the communication quality of users at the edge of the network can be effectively improved, so that the spectrum efficiency and the system throughput are improved. However, since the small stations are connected to the macro base station through the wireless backhaul link, the densely deployed small stations put a great strain on the wireless backhaul link, and the highly loaded wireless backhaul link becomes a bottleneck of the network. Ultra-dense network architectures are highly desirable to be integrated with other network architectures or technologies to better serve users, and mobile network marginalization is a suitable option. The edge storage is an important concept in the mobile network edge architecture, namely, the file is cached in a small station to reduce mass data transmission in a peak period, so that the load of a return link of a system can be effectively reduced, the transmission delay is reduced, and the user experience is improved. The number of the small stations in the super-dense heterogeneous network is large, the distance is short, the user is generally in the coverage range of the small stations, and if the small stations transmit files for the user in a cooperation mode, the limited cache space of the small stations can be utilized more fully. The problem of edge caching in ultra-dense heterogeneous networks is therefore worth intensive research.

The existing caching technology is used for modeling caching decisions into an optimization problem. Firstly, the file popularity is generally considered to be invariable along with time, the file popularity in an actual network is constantly changeable, and the method for solving the optimization problem based on the invariable file popularity cannot track the constant change of the file popularity, so that the obtained caching decision cannot be well suitable for the actual network; secondly, even if the constant file popularity is changed into the instant file popularity, the file popularity is changed once, the optimization problem needs to be operated again once, which brings huge network overhead, and moreover, the modeling optimization problem is often an NP-hard (Non-Polynomial hard) problem, and the solution is very difficult; finally, because the caching problem is that a caching decision is made according to file request behaviors which have already occurred in the network to prepare for the file request behaviors which will occur, the method for making the caching decision based on the traditional solution optimization problem cannot mine the transfer mode of the file request in the network, so that the made caching decision is not optimal for the file request which will occur.

Disclosure of Invention

In order to solve the technical problems provided by the background technology, the invention provides a value function approximation-based ultra-dense heterogeneous network small station coding cooperative caching method, a potential transfer mode of a file request is mined by adopting the value function approximation method, and a cooperative caching strategy superior to that of the traditional method is obtained.

In order to achieve the technical purpose, the technical scheme of the invention is as follows:

the macro base station and the small stations in the coverage area of the macro base station are used as machines, the macro base station is responsible for determining actions to be executed by the small stations in each time slot state and sending the actions to the small stations, each small station is responsible for executing the actions, the states comprise the file popularity of the time slot and a cooperative caching decision made in the previous time slot, and the actions are the cooperative caching decision made in the current time slot for requesting service for the file in the next time slot; expressing a value function as a function of state and action by adopting a reinforcement learning method of value function approximation, taking the number of file requests directly served by a maximized average accumulated substation as an optimization target, continuously interacting with the environment to adapt to the dynamic change of the environment, excavating a potential file request transfer mode to obtain an approximation formula of the value function, and further obtaining a cooperative caching decision matched with the file request transfer mode; and the macro base station encodes the cooperative caching decision and transmits the encoded cooperative caching result to each small station.

Further, the method comprises the following steps:

step 1, collecting network information, and setting parameters:

collecting macro base station set M, small station set P and file request set C in network₁And the number p of small stations in the coverage area of the mth macro base station_mM belongs to M; obtaining a small station cache space K, and determining the station cache space K by an operator according to the network operation condition and the hardware cost; an operator divides one day time into T time slots according to the file request condition in the ultra-dense heterogeneous network, sets the time starting point of each time slot, and divides each time slot into three stages according to the occurrence time: a file transmission stage, an information exchange stage and a cache decision stage;

step 2, formulating a base station cooperation caching scheme based on MDS coding:

recording a cooperative caching decision vector of the small station as a (t), wherein each element a in the a (t)_c(t)∈[0,1]，c∈C₁Representing the proportion of the buffering of the c-th file at the t-th slot, a_cThe file set of (t) ≠ 0 is the file set of the t-slot cache, which is marked as C' (t), the C-th file contains B information bits, and the m-th macro base station passes throughMDS codes generate B information bits by coding

Individual check bits:

in the above formula, d is the number of small stations with received signal power greater than a threshold, the threshold is determined by the operator according to the network operation condition, all

Each check bit is divided into a small station candidate bit and a macro base station candidate bit, wherein the small station candidate bit comprises p_mB bits, that is, each small station has B candidate bits which are not repeated mutually, and each small station selects a front a from the respective candidate bits in the t time slot_c(t) buffering B bits;

the macro base station randomly selects (1-da) from candidate bits thereof_c(t)) B bits are cached, and according to the coding property of the MDS, the whole file can be recovered by acquiring at least B check bits from one file request;

step 3, formulating a base station cooperative transmission scheme:

each file request of a user first gets da from the d substations covering it_c(t) B bits, if da_c(t) is more than or equal to 1, the macro base station does not need to transmit data; otherwise, the macro base station selects one small station closest to the user from the d small stations and transmits (1-da)_c(t)) B bits to the small station, and then the small station transmits the bits to the user, and the data transmitted by the macro base station is called backhaul link load;

step 4, describing a reinforcement learning task by using a Markov Decision Process (MDP):

establishing reinforcement learning quadruplets

Wherein X represents a state space and A represents an action space，

Representing the probability of a state transition, the probability of performing action a transition to the x' state in the x state,

represents a reward for the transfer;

the specific form of reinforcement learning quadruple is as follows:

an action space: since the number of elements contained in the cache decision vector is equal to the set C₁The number of elements C, so that the motion space is a C-dimensional continuous space, a_c(t) is quantized into L discrete values, L is determined by an operator according to the computing capacity of the macro station, and the discretized motion space is A ═ { a ═¹,a²,…,a^|A|Any one of them is a motion vector

j belongs to {1,2, …, | A | } needs to satisfy

The total number of the action vectors meeting the condition is | A |, and the caching decision a (t) of the t-th time slot belongs to A;

state space: p in the coverage area of mth macro station in the t time slot_mThe total times of file requests of the small station are recorded as a vector N (t) [ N ]₁(t),N₂(t),…,N_C(t)]The total file popularity is recorded as a vector theta (t) ═ theta₁(t),θ₂(t),…,θ_C(t)]Wherein

C ∈ C, the state of the tth time slot is recorded as x (t) ═ Θ (t), a (t-1)](ii) a Let H ═ Θ¹,Θ²,…,Θ^|H|The total file popularity set is denoted as Θ (t), which is an element in the set H after quantization, and the state space is denoted as X ═ X¹,x²,…,x^|H||A|State X (t) e X;

probability of state transition: after the action a (t) is performed in the t-th time slot, the action is applied to the current state x (t), and the environment is converted from the current state with potential transition probability

Transition to the next state x (t +1), the transition probability being unknown;

reward: at the same time that the context is transferred to x (t +1), the context will give the machine a reward

Defined here as the number of file requests served directly by the cell:

in the above formula, u [. cndot.)]Which represents a step function of the measured value,

in order to update the number of files to be transmitted in the buffering decision stage of the t-th time slot,

the number of files transmitted by the macro base station in the information exchange phase of the (t +1) th time slot;

step 5, defining a reinforcement learning target:

defining a deterministic policy function pi (X), X ∈ X, according to which the action a to be executed under state X (t) is known (t) ═ pi (X (t)), the state value function:

in the above formula, the first and second carbon atoms are,

representing the progressive prize awarded by using the strategy pi from the state x (t)Also, γ < 1 > 0 is a measure of how much the action π (x (t)) performed by t slots affects the future states;

after the state value function is obtained, a state-action value function, namely a Q function, is obtained:

in the above formula, the first and second carbon atoms are,

a '(t)) represents the accumulated reward brought by using the strategy pi after the action a' (t) is executed from the state x (t);

replacing x (t), x (t +1), a '(t) with x, x', a, respectively, with the goal of finding a desired jackpot

The maximum strategy is denoted as π^*(x) The optimum value function is

Obtaining the following according to the optimal strategy:

namely:

step 6, formulating a Q-learning process based on value function approximation:

(601) expressing the Q function by approximation of a value function, i.e. expressing the Q function as a function of state and action, subject to transient rewards

In state x (t), the action a' (t) is performed, and the Q function is approximately expressed as:

in the above formula, ω₁And ω₂Weight representing two parts, set ω₁＞＞ω₂，β，η_i，ξ_iThe parameters are unknown parameters and need to be obtained through learning;

(602) and (3) solving a cooperative caching decision:

(603) establishing a Q-learning goal:

calculating the real value of the accumulated prize in carrying out the action a (t) in the state x (t) according to the above formula:

in the above formula, the first and second carbon atoms are,

is the motion estimation value under the state x (t + 1);

(604) defining a loss function:

in the above formula, eta ═ eta₁,η₂,…,η_C]，ξ＝[ξ₁,ξ₂,…,ξ_C]，E_πExpressing the expectation of the strategy pi;

updating parameters beta, eta and xi according to the loss function;

step 7, setting the current time slot t as 1, and randomly setting the starting state x (t) as [ theta (t), a (t-1)]Initial value of parameter beta^p＝0，η^p＝0，ξ^pThe operator sets the value of γ in the range of 0,1 according to the network change speed, and determines the value of the update step δ in the range of (0, 1) according to the order of magnitude of the parameter to be updated]Setting the number t of training time slots according to the network scale_total；

Step 8, in a cache decision stage of a t time slot, a strategy of an epsilon-greedy method is used for taking a cooperative cache decision a (t) to be executed under a state x (t);

step 9, the macro base station carries out MDS coding on the files needing to be cached according to the step 2, and transmits the coded data packets to the small station for caching;

step 10, in the file transmission stage of the t +1 time slot, a user requests a file, and the base station performs cooperative transmission to serve the user according to the step 3;

step 11, in an information exchange stage of a t +1 time slot, reporting the file request times of all the small stations in the coverage range of each macro base station to the macro base station in the t +1 time slot, summarizing the total file request times by the macro base station to be recorded as a vector N (t +1), and calculating the total file popularity to be recorded as a vector theta (t + 1);

step 12, the state to be shifted to is x (t +1) ═ Θ (t +1), a (t)]Calculating a reward function

Step 13, estimating the action to be executed in the state x (t + 1):

step 14, updating parameters in the Q function approximation formula according to the step (604);

step 15, if t ═ t_totalIf yes, stopping training and entering step 16; otherwise, t is t +1, enter the next time slot, go back to step 8, continue training;

and step 16, starting from the t time slot, determining a cooperative caching decision based on the Q function approximation formula obtained by training, and serving a file request of the next time slot.

Further, in step 3, the determination method of d is as follows:

let the probability of a user being served by d' cell be p_d'Firstly, based on the base station deployment situation of the operator, p is obtained by calculation according to the historical data of the user position_d': in a time period tau, the positions of U users are respectively recorded at intervals of tau ', tau and tau' are automatically determined by an operator according to the network operation condition, and the number of base stations d 'with the base station number being d' is recorded, wherein the received signal power of the user U belongs to {1,2, …, U } at each position is greater than a threshold value

The historical positions of the U users are used for calculation to obtain:

in the above formula, the first and second carbon atoms are,

indicating the number of positions where i base stations provide service for the user u in the historical position of the user u;

then, d is selected as the probability value p_dThe 'largest d':

further, in step (602), due to ω₁＞＞ω₂Omission of

Obtaining a caching decision:

the solution of the above equation is as follows:

according to_maxd/L is more than or equal to 1 to determine the maximum value of the elements in the cache decision vector, L_maxIs the denominator of the largest element, since l is within the range satisfying the inequality_maxThe smaller the size, the better, therefore

Represents rounding up;

secondly, calculating the number z of each element i/L in the caching decision vector according to the caching space of the base station_i，i＝1,2,…,l_max：

Wherein

Represents rounding down;

determining the position of each element: coefficient of curvature eta_iθ_i(t), i is 1,2, …, C is arranged in descending order, the j-th element after sorting is marked as

Corresponding to the h-th before sorting_jThe document firstly preliminarily determines the positions of the elements:

then, adjust

In which condition 1-l is satisfied_maxd/L < 0, from

Starting to j ═ 1, the following steps are looped to adjust the elements in the motion vector: from

To find out the satisfying condition

And

minimum j' of

The ratio is reduced by 1/L,

adding 1/L;

also in the estimation step 13 using the above solution

Further, in step 8, a cooperative caching decision is selected according to step (602) with a probability of 1-epsilon; randomly selecting one satisfying condition by probability epsilon

And

to coordinate caching decisions.

Further, in step (604), a random gradient descent method is adopted to update parameters β, η, ξ in the Q-function approximation expression:

in the above formula beta^c，

Parameter, β, representing the current time slot^p，

Represents the parameter of the previous time slot, and the updating step length is represented by delta less than or equal to 1 and more than 0.

Adopt the beneficial effect that above-mentioned technical scheme brought:

the invention provides service for users by utilizing small-station cooperative coding caching and cooperative transmission, makes caching decision by mining the transfer mode of the file request in the collected real network through reinforcement learning, is used as a data-driven machine learning method, does not need any hypothesis on prior distribution of data, and is more suitable for an actual system; and through real-time interaction with the environment, the popularity of the time-varying file can be tracked, a corresponding caching strategy is made, the process is simple and feasible, and the NP-hard problem does not need to be solved.

The invention makes a cooperative caching decision based on a value function approximation method, the macro base station collects state information through continuous interaction with the environment, makes a corresponding cooperative caching decision and transmits the decision to each small station, so that the most accurate files can be cached by effectively utilizing the limited storage space of the small stations, the number of file requests directly served by the small stations is obviously increased, and the load of a return link of a system is reduced.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Detailed Description

The technical scheme of the invention is explained in detail in the following with the accompanying drawings.

The invention provides a super-dense heterogeneous network small station coding cooperation caching method based on value function approximation, which aims to maximize the number of file requests directly served by an average accumulated small station and on the premise that the total size of small station caching files does not exceed a small station caching space. According to the method, a transfer mode of a file request is mined through reinforcement learning, and a small station code cooperation caching method is formulated according to the mined mode. The reinforcement learning is described as an MDP (Markov Decision Process), the macro base station and the small stations in the coverage area of the macro base station are used as machines, the macro base station is responsible for determining actions to be executed and sending the actions to the small stations, the small stations are responsible for executing the actions and changing the environment, the environment is fed back to the machines according to a reward function to obtain a reward, and the actions to be executed by the small stations in the state of each time slot are learned through continuous interaction with the environment, wherein the state is partial description of the environment observed by the macro base station and comprises the file popularity of the time slot and a cooperative caching Decision made in the previous time slot, and the actions are cooperative caching decisions made in the time slot and used for requesting services for files in the next time slot. The reward function is defined according to the goal of the caching decision, here defined as the number of file requests served directly by the child station. Value function approximation (value function approximation) is a reinforcement learning method, and is suitable for the condition that a reinforcement learning task is performed in a huge discrete state space or a continuous state space, a value function is expressed as a function of a state and an action, the number of file requests directly served by a maximum average accumulation small station is taken as an optimization target, the method can adapt to the dynamic change of an environment through continuous interaction with the environment, a potential file request transfer mode can be mined, an approximation formula of the value function is obtained, and a cooperative caching decision matched with the file request transfer mode is further obtained. The macro base station combines an MDS (maximum Distance separable) coding method to code the file, and finally, the coding cooperation cache result is transmitted to each small station, so that the file request number directly served by the small station is obviously increased, and the system return link load is reduced.

An embodiment is given below by taking an LTE-a system as an example, and as shown in fig. 1, the specific steps are as follows:

the first step is as follows: collecting network information, and setting parameters:

collecting macro base station set M, small station set P and file request set C in network₁And the number p of small stations in the coverage area of the mth macro base station_mM is equal to M, set C₁C files are contained; obtaining a small station cache space K, and determining the station cache space K by an operator according to the network operation condition and the hardware cost; the operator divides the time of day into T time slots according to the file request situation in the ultra-dense heterogeneous network,and setting a time starting point of each time slot, and dividing each time slot into three stages according to the occurrence time: file transmission stage, information exchange stage and buffer decision stage.

The second step is that: formulating a base station cooperation caching scheme based on MDS coding:

the cooperative caching decision vector of the small station is recorded as a (t) ═ a₁(t),a₂(t),…,a_C(t)]Wherein 0 is not less than a_c(t) is less than or equal to 1, C epsilon C represents the proportion of the C-th file buffered in the t-th time slot substation, a_cA file set (namely a file set cached in a t time slot) with (t) ≠ 0 is marked as C' (t), the file C contains B information bits, and the macro base station m generates the B information bits by encoding through MDS

Individual check bits:

wherein d is the number of small stations with received signal power greater than a threshold value, the threshold value is determined by an operator according to the network operation condition, and all the stations are connected with the network

the macro base station randomly selects (1-da) from candidate bits thereof_c(t)) B bits are buffered, and according to the coding property of the MDS, the whole file can be recovered by acquiring at least B check bits once when the file is requested.

The third step: and (3) formulating a base station cooperative transmission scheme:

each file request of a user first gets da from the d substations covering it_c(t) B bits, if da_c(t) is more than or equal to 1, the macro base station does not need to transmit data; otherwise, the macro base station selects one small station closest to the user from the d small stations and transmits (1-da)_c(t)) B bits to the small station, which then transmits the bits to the user, the data transmitted by the macro base station is called backhaul link load. d, a determination method:

the probability of a user being served by d' cell is p_d'Firstly, based on the base station deployment situation of the operator, p is obtained by calculation according to the historical data of the user position_d': in a time period tau, the positions of U users are respectively recorded at intervals of tau ', tau and tau' are automatically determined by an operator according to the network operation condition, and the number of base stations d 'with the base station number being d' is recorded, wherein the received signal power of the user U belongs to {1,2, …, U } at each position is greater than a threshold value

The historical positions of the U users are used for calculation to obtain:

wherein

Indicating the number of locations where i base stations served user u in the historical location of user u.

Choosing d as the probability value p_d'Maximum d':

the fourth step: the reinforcement learning task is described by MDP:

wherein X represents a state space, A represents an action space,

represents a reward for the transfer;

the specific form of reinforcement learning quadruples in this problem is as follows:

1. an action space: the action is defined as a cooperative caching decision vector of the small station, the action which can be taken by the machine forms an action space, the number of elements contained in the caching decision vector is equal to the number C of the files, the action space is a C-dimensional continuous space, and each dimension is more than or equal to 0 and less than or equal to a_cAnd C epsilon is quantized into L discrete values, L is determined by an operator according to the computing capability of the macro station, and the discretized action space is A ═ a ≦ 1¹,a²,…,a^|A|Any one of them is a motion vector

j belongs to {1,2, …, | A | } needs to satisfy

The total number of motion vectors satisfying this condition is | a |, and the t-th time slot is determined by the cache decision a (t) e a.

2. State space: the state is the description of the environment where the machine senses, and consists of a file popularity vector and a cooperative caching decision vector of a small station, for example, p in the coverage area of the mth macro station at the tth time slot_mThe total times of file requests of the small station are recorded as a vector N (t) ═ N₁(t),N₂(t),…,N_C(t)]The total file popularity is recorded as a vector theta (t) ═ theta₁(t),θ₂(t),…,θ_C(t)]Wherein

C ∈ C, the state of the tth time slot is recorded as x (t) ═ Θ (t), a (t-1)](ii) a Let H ═ Θ¹,Θ²,…,Θ^|H|Is the total set of file popularity, Θ (t) is quantifiedThat is, one element in the set H, the state space is denoted as X ═ X¹,x²,…,x^|H||A|State X (t) e X.

3. Probability of state transition: after the action a (t) is performed in the t-th time slot, the action is applied to the current state x (t), and the environment is converted from the current state with potential transition probability

Transition to the next state x (t +1), the transition probability is unknown.

4. Reward: at the same time that the context is transferred to x (t +1), the context will give the machine a reward

Defined here as the number of file requests served directly by the cell:

wherein u [. C]Representing a step function, wherein when the value in the brackets is more than 0, the function value is 1, otherwise, the function value is 0;

is the number of files transmitted by the macro base station during the information exchange phase of the (t +1) th time slot.

The fifth step: and (3) clear reinforcement learning target:

defining a deterministic policy function pi (X), X ∈ X, according to which the action a to be executed under state X (t) is known (t) ═ pi (X (t)); defining a gamma discount expected accumulated prize function:

wherein E_πMeaning that the expectation for the strategy pi is made,

representing the cumulative reward due to the use of the policy π, starting from state x (t), 0 ≦ γ < 1 is a measure of the degree of influence of the action π (x (t)) performed for the t slot on the future state.

After obtaining the state value function, a state-action value function (Q function) is obtained:

representing the accumulated reward brought by using the strategy pi after the action a' (t) is executed from the state x (t), and the equations (4) and (5) are called as Bellman equations.

The maximum strategy is denoted as π^*(x) The optimum value function is

According to the formulas (4) and (5) under the optimal strategy, the following can be obtained:

namely:

(6) the two formulas (7) reveal the improvement mode of the non-optimal strategy, namely, the action selected by the strategy is changed into the current optimal action:

under the condition that the reinforcement learning quadruple is known, solving the Bellman equation based on an equation (8) available value iterative algorithm or a strategy iterative algorithm to obtain an optimal strategy.

And a sixth step: under the condition that the state transition probability is unknown, based on a Q-learning process of value function approximation:

because the state transition probability is unknown, an optimal strategy cannot be obtained through a strategy iteration algorithm or a value iteration algorithm; meanwhile, the conversion from a state value function to a Q function is difficult due to the unknown state transition probability, so that the Q function is directly estimated;

1. the Q function approximates: in order to solve the difficulty of Q-table storage and traversal search caused by large state space and action space, a value function approximation method is used for representing a Q function, namely the Q function is represented as a function of state and action, and is rewarded instantaneously

Taking t slots as an example, in state x (t), the action a' (t) is performed, and the Q function is approximately expressed as:

wherein ω is₁And ω₂Weight representing two parts, set ω₁＞＞ω₂，β，η_i，ξ_iAre unknown parameters and need to be learned.

2. Selection of a collaborative caching decision:

due to omega₁＞＞ω₂Omission of

Obtaining a caching decision:

(11) formula A collaborative caching policy that maximizes the value in middle brackets, as can be seen from the expression in middle brackets, is associated with (1-da'_i(t)) multiplied factor η_iθ_i(t) is directly related to the magnitude of the value in parentheses, η_iθ_iThe larger (t) is, the corresponding (1-da'_i(t)) should be smaller so that the larger the value in parentheses will be. Therefore, the solving process of equation (11) is as follows:

Represents rounding up;

② calculating each element i/L, i-1, 2, …, L in the buffer decision vector_maxNumber z of (2)_i：

Wherein

Represents rounding down;

then, adjust

In which condition 1-l is satisfied_maxd/L < 0, from

To find out the satisfying condition

And

minimum j' of

The ratio is reduced by 1/L,

adding 1/L.

3. Q-learning goal:

substituting equation (6) into equation (5) can obtain:

(14) the formula discloses a calculation method for the accumulated prize true value by executing the action a (t) under the state x (t):

wherein

The estimated value of the motion in the state x (t +1) is estimated according to step 2.

Defining a loss function:

wherein the parameter vector eta ═ eta₁,η₂,…,η_C]，ξ＝[ξ₁,ξ₂,…,ξ_C]The goal of Q-learning is to make the estimated and true values of the Q function as close as possible, i.e., to minimize the loss function.

4. And updating parameters beta, eta and xi in the approximate expression of the Q function by adopting a random gradient descent method:

wherein beta is^c，

Parameter, β, representing the current time slot^p，

The seventh step: setting a current time slot t ═ 1, and randomly setting a start state x (t) ═ Θ (t), a (t-1)]Initial value of parameter beta^p＝0，η^p＝0，ξ^pThe operator sets the value of γ to 0, depending on how fast the network changes, in the range 0,1, according to which it is updatedThe order of magnitude of the parameter(s) determines the value of δ, in the range (0, 1)]Setting the number t of training time slots according to the network scale_total。

Eighth step: in the cache decision stage of the t time slot, a strategy using an epsilon-greedy method takes a cooperative cache decision a (t) to be executed under a state x (t): selecting a cooperative caching decision according to the step 2 in the sixth step according to the probability 1-epsilon; randomly selecting one satisfying condition by probability epsilon

And

to coordinate caching decisions.

The ninth step: and the macro base station carries out MDS coding on the files needing to be cached according to the second step, and transmits the coded data packets to the small station for caching.

The tenth step: and in the file transmission stage of the (t +1) th time slot, the user requests a file, and the base station serves the user according to the third step of cooperative transmission.

The eleventh step: in the information exchange stage of the (t +1) th time slot, all the small stations in the coverage range of each macro base station report the file request times of the small stations in the (t +1) th time slot to the macro base station, the macro base station collects the total file request times and records the total file request times as a vector N (t +1), and the total file popularity is calculated and recorded as a vector theta (t + 1).

The twelfth step: the state of transition to is x (t +1) ═ Θ (t +1), a (t)]Calculating a reward function according to the formula (3)

The thirteenth step: estimating the action to be performed in the state x (t +1) according to step 2 in the sixth step:

the fourteenth step is that: and updating parameters in the Q function approximation formula according to the formula (17).

Fifteenth aspect of the inventionThe method comprises the following steps: if t is t_totalIf so, stopping training and entering the sixteenth step; otherwise, t is t +1, enter the next time slot, go back to the eighth step, continue training.

Sixteenth, step: and (5) determining a cooperative caching decision to serve the file request of the next time slot according to the step 2 in the sixth step based on the Q function approximation formula obtained by training from the t time slot.

According to the process, in the Q function learning process, the macro base station and the small stations in the coverage area of the macro base station are used as machines, the file popularity and the cooperative caching decision of the small stations are used as states, the cooperative caching decision is used as an action, the file request number directly served by the small stations is used as a reward function, interaction is continuously carried out with the environment, the maximum accumulated reward function is used as a target, a Q function approximation formula is obtained through learning, the cooperative caching decision in each state is further obtained, then the macro base station encodes the files to be cached by using MDS, and the encoding result is transmitted to each small station for cooperative caching. The method utilizes a reinforcement learning method to find the mode from the data without solving the optimization problem based on data distribution. The method can track the popularity of the files which change in real time, fully excavate and make a cooperative caching decision by utilizing a potential file request transfer mode, is more suitable for an actual system, obviously improves the number of file requests directly served by a small station, effectively reduces the load of a return link of the system, provides the performance of the system, and improves the user experience.

The embodiments are only for illustrating the technical idea of the present invention, and the technical idea of the present invention is not limited thereto, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the scope of the present invention.

Claims

1. The super-dense heterogeneous network small station code cooperative caching method based on value function approximation is characterized by comprising the following steps: the macro base station and the small stations in the coverage range of the macro base station are used as machines, the macro base station is responsible for determining actions to be executed by the small stations in each time slot state and transmitting the actions to the small stations, and the small stations are responsible for executing the actions, the states comprise the file popularity of the time slot and a cooperative caching decision made in the previous time slot, and the actions are the cooperative caching decision made in the current time slot for requesting service for the file of the next time slot; expressing a value function as a function of state and action by adopting a reinforcement learning method of value function approximation, taking the number of file requests directly served by a maximized average accumulated substation as an optimization target, continuously interacting with the environment to adapt to the dynamic change of the environment, excavating a potential file request transfer mode to obtain an approximation formula of the value function, and further obtaining a cooperative caching decision matched with the file request transfer mode; the macro base station encodes the cooperative caching decision and transmits the encoded cooperative caching result to each small station;

the method comprises the following steps:

step 1, collecting network information, and setting parameters:

recording a cooperative caching decision vector of the small station as a (t), wherein each element a in the a (t)_c(t)∈[0,1]，c∈C₁Representing the proportion of the buffering of the c-th file at the t-th slot, a_cThe file set of (t) ≠ 0 is a file set of the t time slot cache and is marked as C' (t), the C-th file contains B information bits, and the m-th macro base station encodes and generates the B information bits through MDS (Multi-dimensional System) encoding

Individual check bits:

Each check bit is divided into a small station candidate bit and a macro base station candidate bit, wherein the small station candidate bit comprises p_mB bits, that is, each small station has B candidate bits which are not repeated mutually, and each small station selects a front a from the respective candidate bits in the t time slot_c(t) buffering B bits; the macro base station randomly selects (1-da) from candidate bits thereof_c(t)) B bits are cached, and according to the coding property of the MDS, the whole file can be recovered by acquiring at least B check bits from one file request;

step 3, formulating a base station cooperative transmission scheme:

establishing reinforcement learning quadruplets

Wherein X represents a state space, A represents an action space,

represents a reward for the transfer;

the specific form of reinforcement learning quadruple is as follows:

j belongs to {1,2, …, | A | } needs to satisfy the condition:

Then the state of the tth slot is recorded as x (t) ═ Θ (t), a (t-1)](ii) a Let H ═ Θ¹,Θ²,…,Θ^|H|The total file popularity set is denoted as Θ (t), which is an element in the set H after quantization, and the state space is denoted as X ═ X¹,x²,…,x^|H||A|State X (t) e X;

Defined here as the number of file requests served directly by the cell:

step 5, defining a reinforcement learning target:

in the above formula, the first and second carbon atoms are,

representing the accumulated reward brought by using the strategy pi starting from the state x (t), 0 ≦ γ < 1 is a measure of the degree of influence of the action pi (x (t)) performed by the t slot on the future state;

in the above formula, the first and second carbon atoms are,

representing the accumulated reward brought by using the strategy pi after the action a' (t) is executed from the state x (t);

The maximum strategy is denoted as π^*(x) The optimum value function is

Obtaining the following according to the optimal strategy:

namely:

step 6, formulating a Q-learning process based on value function approximation:

(602) and (3) solving a cooperative caching decision:

(603) establishing a Q-learning goal:

in the above formula, the first and second carbon atoms are,

is the motion estimation value under the state x (t + 1);

(604) defining a loss function:

updating parameters beta, eta and xi according to the loss function;

Step 13, estimating the action to be executed in the state x (t + 1):

2. The method for cooperative caching of codes of small stations in ultra-dense heterogeneous network based on value function approximation as claimed in claim 1, wherein in step 3, the determination method of d is as follows:

The historical positions of the U users are used for calculation to obtain:

in the above formula, the first and second carbon atoms are,

then, d is selected as the probability value p_d'Maximum d':

3. the ultra-dense heterogeneous network small station coding cooperative caching method based on value function approximation as claimed in claim 1, wherein in step (602), ω is the factor of ω₁＞＞ω₂Omission of

Obtaining a caching decision:

the solution of the above equation is as follows:

Represents rounding up;

Wherein

Represents rounding down;

then, adjust

In which condition 1-l is satisfied_maxd/L < 0, from

To find out the satisfying condition

And

minimum j' of

The ratio is reduced by 1/L,

adding 1/L;

also in the estimation step 13 using the above solution

4. The method for cooperative caching of codes of small stations in ultra-dense heterogeneous networks based on value function approximation as claimed in claim 3, wherein in step 8, a cooperative caching decision is selected according to step (602) with a probability of 1-epsilon; randomly selecting one satisfying condition by probability epsilon

And

to coordinate caching decisions.

5. The method for cooperative caching of codes of small stations in ultra-dense heterogeneous network based on value function approximation as claimed in claim 1, wherein in step (604), a random gradient descent method is used to update parameters β, η, ξ:

in the above formula beta^c，

Parameter, β, representing the current time slot^p，