CN111800828B

CN111800828B - A mobile edge computing resource allocation method for ultra-dense networks

Info

Publication number: CN111800828B
Application number: CN202010597779.5A
Authority: CN
Inventors: 李立欣; 程倩倩; 张敬敏; 王大伟; 李旭; 梁微; 林文晟; 李煊
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2023-07-18
Anticipated expiration: 2040-06-28
Also published as: CN111800828A

Abstract

The present invention discloses a mobile edge computing resource allocation method of an ultra-dense network. Based on the ultra-dense network, the NOMA-MEC communication system in the ultra-dense network includes M={1,2,...,M} small base stations, Among them, each small base station is equipped with a MEC server to perform the computing tasks of user offloading; assuming that the set of users served by each small base station is N={1,2,...,N}, N users are divided into Y={1, 2,...,Y} groups, each group has K={1,2,...,K} users. It solves the problem that it is difficult for the prior art to deal with the mutual interference between users, thereby affecting the computing performance of the users.

Description

Mobile edge computing resource allocation method for ultra-dense network

[ field of technology ]

The invention belongs to the technical field of wireless communication, and particularly relates to a mobile edge computing resource allocation method of an ultra-dense network.

[ background Art ]

With the rapid development of the fifth generation (5G) mobile communication technology, deployment of Ultra Dense Networks (UDNs) has become a major architecture for future development. The UDN can effectively improve system capacity and data transmission rate to ensure user quality of service. However, due to the limited computing power of the user, solving the computationally intensive task in UDNs is a significant challenge. As an emerging technology, mobile Edge Computing (MEC) has been proposed to alleviate the computational pressure of users in UDNs. In particular, MECs offload computationally intensive tasks to the network edge to reduce the user's energy consumption and task delay.

In MEC systems, how to increase the utilization of spectrum resources between users is a significant challenge, as it directly affects energy consumption and task delay. As an emerging multiple access method, non-orthogonal multiple access (NOMA) can effectively improve the spectral efficiency of a system by allocating the same resources to multiple users. Thus, in certain operations NOMA has been applied to MEC systems to reduce energy consumption and task delays.

Average field gaming (MFG) is a tool suitable for scenes with large scale gaming individuals that can model relationships between individuals and groups in a UDN. Specifically, in UDN, the MFG averages the influence between each member, simplifying a complex model.

The authors in the literature 1"Learning deep mean field games for modeling large population behavior[in International Conference on Learning Representations, vancouver, canada, apr.2018 @, demonstrate an equilibrium solution for average field gaming with a Markov Decision Process (MDP) to predict the evolution of demographics over time.

Document 2"Collaborative Artificial Intelligence (AI) for User-Cell Association in Ultra-Dense Cellular Systems [ IEEE International Conference on Communications Workshops (ICCCWorkshops), kansas, MO, may2018]" proposes a neural Q learning algorithm to solve the problem of User association in ultra-dense network systems.

Unlike the prior art, the present invention models NOMA-MEC systems in UDN scenarios, where each Small Base Station (SBS) is equipped with a MEC server. When a user cannot handle a large number of computing tasks, some tasks will be offloaded onto the MEC server. Firstly, a User Clustering Matching Algorithm (UCMA) based on channel gain difference is provided, and the user is clustered, so that the data rate of the user is improved. Then, using NOMA-MEC system as model, establishing MFG theory frame, using deep deterministic strategy gradient (DDPG) algorithm in reinforcement learning to solve the equilibrium solution algorithm of MFG, so as to reduce energy consumption and task delay of user.

[ invention ]

The invention aims to provide a mobile edge computing resource allocation method of an ultra-dense network, which aims to solve the problem that the prior art is difficult to process mutual interference among users, so that the computing performance of the users is affected.

The technical scheme adopted by the invention is that the method for allocating the mobile edge computing resources of the ultra-dense network is based on the ultra-dense network, and a NOMA-MEC communication system in the ultra-dense network comprises M= {1,2, …, M } small base stations, wherein each small base station is provided with an MEC server to execute a computing task unloaded by a user; assuming that the set of users served by each small cell is n= {1,2, …, N }, the N users are divided into y= {1,2, …, Y } groups, and k= {1,2, …, K } users are in each group;

the resource allocation method is implemented according to the following steps:

step one, constructing an uplink NOMA-MEC communication system, wherein each SBS is provided with an MEC server to serve a plurality of users;

step two, clustering is carried out on all users in the NOMA-MEC communication system according to the difference of channel gains; the users in the clusters adopt a NOMA transmission mode, and the clusters adopt a TDMA transmission mode;

step three, calculating the calculation cost of the user, namely the time delay and the energy consumption when the user processes the task; wherein the computing costs include a local computing cost of the user and an offload computing cost;

modeling a NOMA-MEC communication system as an MFG framework; the SINR and the channel gain of the user are expressed as a state space, and the transmitting power, the unloading decision factor and the resource allocation factor of the user are expressed as an action space; constructing a reward function of the user according to the calculation cost of the user;

and fifthly, acquiring an equilibrium solution of the average field game by using a reinforcement learning method based on DDPG, namely an optimal resource allocation scheme in the mobile edge computing system.

Further, the specific method of the second step is as follows:

in the NOMA-MEC communication system model established in the first step, all users of each SBS service are ordered according to the magnitude of the channel gain, and then the users with the first M channel gains are sequentially selected as the first users in M NOMA clusters;

selecting a user with the maximum sum of channel gain differences of the NOMA cluster from other users according to a greedy matching method;

when the number of users cannot be uniformly allocated to each cluster, redundant users are randomly allocated to different clusters, and the channel gain of each user in the clusters is different.

Further, the specific mode of the third step is as follows:

3.1 Cost of local computation for the user:

let x be _mk Unloading variables representing the kth user in the mth group, for a local computing model, i.e., the user can accomplish the computing task locally, without unloading the computing task to the MEC server, assume f _m l _k And > 0 represents the local computing capacity of the kth user in the mth group, when the user performs the task locally, the time is:

when computing the energy consumption of local computing, a commonly used model of computing energy consumption is adopted, namely epsilon=κf ² . Where κ is an energy coefficient depending on the chip structure, the local energy consumption of the kth user in the mth group can be expressed as:

according to formulas (5) and (6), the local computation cost of the kth user in the mth group can be expressed as:

wherein,,and->Weight coefficients representing delay and energy consumption, respectively, and

3.2 Offloading computational cost for the user:

in the process of unloading to the MEC server for calculation, the method comprises two parts of transmission and calculation at the MEC server, wherein the transmission time and the execution time are respectively as follows:

wherein f _s Is the computational power of the MEC server;

the total time of the unloading process is:

the energy consumption of the unloading process also has two parts, namely, the energy consumption in the transmission process and the energy consumption for executing the calculation task at the MEC server are respectively:

according to equations (11) and (12), the total energy consumption of the unloading process is expressed as:

thus, the offload computation cost function for the kth user in the mth group is expressed as:

3.3 Total computation cost for the user):

according to 3.1 and 3.2, obtaining a user local computation cost and a user offload computation cost, the overall computation cost function of the user to complete the computation task can be expressed as:

further, the specific steps of the fourth step are as follows:

in a NOMA-MEC system of an ultra-dense network, the state and channel gain of the kth user in the mth group are expressed as a state space, and the state space is expressed as:

s _mk (t)＝{τ _mk (t),h _mk (t)} (16)，

each user is based on the current state s _mk (t) selecting action a from action space A _mk (t) the action of the kth user in the mth group consists of its power, unload variables and weight coefficients, action a _mk (t) ∈A is expressed as:

a _mk (t)＝{p _mk (t),x _mk ,λ _mk } (17)，

in the method, in the process of the invention,weight coefficients representing delay and energy consumption;

according to the analysis of the user calculation cost in the third step, the cost function of the user is expressed as:

therefore, the reward function for the kth user in the mth group is expressed as:

in average field gaming, the Hamilton-Jacobi-Bellman (HJB) equation and Fokker-Planck-Kolmogorov (FPK) equation describe the overall system model;

when the kth user in the mth group is in state s _mk (t) Down selection action a _mk At (t), its FPK equation can be expressed as:

π _mk (t+1)＝π _mk (t)P _mk (p _mk ,x _mk ,λ _mk ) (20)，

wherein pi _mk (t+1) is the state of the kth user in the mth group at the time (t+1), P _mk (p _mk ,x _mk ,λ _mk ) The probability that the kth user in the mth group transits from the t moment state to the (t+1) moment state is mainly determined by the action of the user;

according to the definition of the reward function, the state s at time t _mk The value function (i.e., the HJB equation) of (t) is expressed as:

and solving a Nash equilibrium solution of the MFG based on the FPK and HJB equations.

Further, the specific mode of the fifth step is as follows:

the DDPG algorithm is adopted to solve the equilibrium solution of the MFG, and the objective function of the DDPG algorithm is defined as:

wherein θ ^μ Is a parameter of the policy network that generates deterministic actions, and θ ^μ Updating through a strategy gradient;

there are mainly two in the Actor sectionNetworks, i.e., an online policy network and a target policy network. The deterministic strategy μ is used to directly derive each moment action a _t ＝μ(s _t |θ ^μ ) And (5) determining a value. Like the Actor section, the Critic section also has two networks, an online Q network and a target Q network. The Q function (i.e., action value function) defined by the bellman equation is a reward expectation of selecting an action under a deterministic strategy, using a Q network to fit the Q function, namely:

Qμ(s _t ,a _t )＝E[R+γQ(s _t+1 ,μ(s _t+1 ))] (23)，

wherein Q is ^μ (s _t ,a _t ) Represented in state s _t The deterministic strategy mu is adopted to select the action a _t The expected values obtained, in order to measure the performance of the policy, define the performance targets as follows:

wherein β represents the behavior policy, ρ ^β Is a probability density function of the state space. In the Critic section, the mean square error is used as a loss function, namely:

thus, the loss function L with respect to θ can be obtained from a standard back propagation algorithm ^Q The gradient of (a), namely:

and updating the gradient in real time to enable the objective function to be converged, and finally obtaining an optimal strategy, namely obtaining an optimal resource allocation scheme in the mobile edge computing system.

Compared with the prior art, the invention has the beneficial effects that:

1. the NOMA-MEC system is constructed as an MFG theoretical framework, and the equilibrium solution of the MFG is solved through reinforcement learning, so that the calculation cost of a user, including energy consumption and time delay, is minimized.

2. The invention constructs an uplink NOMA-MEC system in an ultra-dense network, and each SBS is provided with one MEC server to serve a plurality of users. In the system, all users of each SBS service are divided into different clusters according to a user clustering algorithm to increase the data rate of the users.

3. The NOMA-MEC system under ultra dense networks is modeled as a MFG framework. And then solving the equilibrium solution of the MFG by adopting a DDPG method, and reducing the energy consumption and task delay of the user by learning a dynamic resource allocation strategy.

4. According to the invention, the method can effectively learn the optimal resource allocation strategy through experiments, and compared with other methods, the method can more effectively reduce the calculation time delay and the energy consumption of the user.

[ description of the drawings ]

FIG. 1 is a block diagram of a system for mobile edge computation for ultra dense networks in accordance with the present invention;

FIG. 2 is a schematic diagram of the relationship between the average field gaming and reinforcement learning algorithms of the present invention;

FIG. 3 is a schematic diagram of the present invention employing a reinforcement learning algorithm to optimize resource allocation in a NOMA-MEC system;

FIG. 4 is a graph showing the relationship between the energy consumption and the maximum transmission power under different algorithm comparisons according to the present invention;

fig. 5 is a schematic diagram showing the relationship between the calculated time delay and the maximum transmitting power under the comparison of different algorithms according to the present invention.

[ detailed description ] of the invention

The invention will be described in detail below with reference to the drawings and the detailed description.

Unlike the existing literature, the invention researches the resource optimization in the uplink NOMA-MEC system in the ultra-dense network from the aspects of relieving network resources and overcoming the limitations of the mobile equipment, and the invention combines a deep reinforcement learning algorithm to minimize the system delay and energy consumption by optimizing the power and unloading strategy.

Step one, constructing a system model:

an uplink NOMA-MEC system is constructed, with one MEC server per SBS to serve multiple users.

The concrete construction mode is as follows:

as shown in fig. 1, the present invention contemplates a NOMA-MEC communication system in an ultra-dense network of m= {1,2, …, M } small base stations, each equipped with a MEC server to perform user offloaded computational tasks. Assuming that the set of users served by each small cell is n= {1,2, …, N }, in order to reduce interference between users, the users need to be grouped. In the invention, N users are divided into Y= {1,2, …, Y } groups, and K= {1,2, …, K } users are in each group.

In the information transmission, the bandwidth B of the whole system is divided into Y sub-channels, and the bandwidth of each sub-channel is expressed as B _sc = B/Y, while users in each group transmit information simultaneously in their sub-channels.

And step two, clustering all users in the system through a user clustering algorithm to improve the data transmission rate of the users. The users in the clusters adopt a NOMA transmission mode, and the clusters adopt a time division multiple access (Time division multiple access, TDMA) transmission mode.

The specific mode of the second step is as follows:

in the NOMA-MEC communication system model established in step one, all users of each SBS service are ordered according to the magnitudes of their channel gains, and then the user with the first M channel gains is sequentially selected as the first user in the M NOMA clusters. Next, a user having the NOMA cluster with the largest sum of channel gain differences is selected from the remaining users according to a greedy matching method. In addition, when the number of users cannot be uniformly allocated to each cluster, redundant users may be randomly allocated to different clusters, and the channel gain of each user in the clusters is different.

And step three, calculating the calculation cost of the user, namely the time delay and the energy consumption when the user processes the task. Including the local and offload computation costs of the user.

The specific mode of the third step is as follows:

and finishing clustering by the user according to the clustering algorithm in the step two. Because the NOMA technology is adopted by the users in the clusters when information is transmitted, the TDMA technology is adopted among the clusters, so that any user can be interfered by users in the same cluster and also can be interfered by users of other SBS services in the same time slot when information is transmitted.

For users within the NOMA cluster, users with greater channel gain will experience interference from users with less channel gain. The user with the smallest channel gain is not interfered by other users. Thus, the interference experienced by users within a NOMA cluster can be expressed as:

wherein p is _mf Representing the transmission power of the f-th user in the m-th NOMA cluster, h _mf Representing the channel gain of the f-th user in the m groups.

Secondly, in an ultra-dense network, users served by different small base stations will interfere when transmitting tasks in the same time slot, which can be expressed as:

wherein p is _jk Representing the transmission power of the kth user in the j group, h _jk Representing the channel gain of the kth user in the j group.

The SINR of the kth user in the mth group is expressed as:

wherein,,is the power of additive white Gaussian noise, so the firstThe data rate of the kth user in the m groups is expressed as:

R _mk ＝W _sc log(1+τ _mk ) (4)，

wherein W is _sc ＝W _total /M，W _total Is the system bandwidth.

The computing tasks of the kth user in the mth group may be defined asWherein d _mk Representing input data required by the kth user in the mth group to complete the computing task, c _mk Representing the kth user calculation d in the mth group _mk The number of CPU cycles required, ">Representing the last time the kth user in the mth group completed the computing task.

Let x be _mk Unloading variables representing the kth user in the mth group, for the local computational model, assume thatRepresenting the local computing capacity of the kth user in the mth group, when the user performs the task locally, its time is:

when computing the energy consumption of local computing, a commonly used model of computing energy consumption is adopted, namely epsilon=κf ² . Where κ is an energy coefficient depending on the chip structure, so the local energy consumption of the kth user in the mth group can be expressed as:

according to formulas (5) and (6), the calculation cost of the kth user in the mth group at the time of local calculation can be expressed as:

wherein,,and->Weight coefficients representing delay and energy consumption, respectively, andwhen->Indicating that the user is sensitive to delay and paying more attention to calculation time; otherwise, the user is indicated to have low energy, and the energy consumption of the calculation task is more focused.

wherein f _s Is the computational power of the MEC server. The total time for this unloading process is:

similarly, the energy consumption in the unloading process also has two parts, namely, the energy consumption in the transmission process and the energy consumption for executing the calculation task at the MEC server are respectively:

according to equations (11) and (12), the total energy consumption of the offloading process can be expressed as:

thus, the cost function of the kth user in the mth group during offloading can be expressed as:

further, the cost function of the kth user in the mth group to complete the computing task can be expressed as

Step four, establishing a cost function:

modeling NOMA-MEC as an MFG framework, wherein SINR and channel gain of a user are represented as a state space, and transmit power, offloading decision factors, and resource allocation factors of the user are represented as an action space; and constructing a reward function of the user according to the calculation cost of the user.

The specific steps of the fourth step are as follows:

interference can become very severe when many users compute tasks simultaneously. This severely reduces the data transfer rate of the user, thereby increasing the time delay and power consumption in offloading the computing tasks. Since each user is an independent individual, in an ultra-dense scenario, it only considers its own interests. Therefore, the present invention expresses this model as the MFG theoretical framework.

The status of each user comes only from its own local observations. In a NOMA-MEC system of an ultra-dense network, the state and channel gain of the kth user in the mth group are expressed as a state space, and the state space is expressed as:

s _mk (t)＝{τ _mk (t),h _mk (t)} (16)，

a _mk (t)＝{p _mk (t),x _mk ,λ _mk } (17)，

in the method, in the process of the invention,weight coefficients representing delay and energy consumption.

It is an object of the invention to minimize the computational cost of a user on the basis of a maximum delay. From the analysis of the user's computational cost in step three, the user's cost function can be expressed as:

therefore, the reward function for the kth user in the mth group can be expressed as:

in average field gaming, the Hamilton-Jacobi-Bellman (HJB) equation and the Fokker-Planck-Kolmogorov (FPK) equation describe the overall system model. When in group m

k users in state s _mk (t) Down selection action a _mk At (t), its FPK equation can be expressed as:

π _mk (t+1)＝π _mk (t)P _mk (p _mk ,x _mk ,λ _mk ) (20)，

wherein pi _mk (t+1) is the state of the kth user in the mth group at the time (t+1), P _mk (p _mk ,x _mk ,λ _mk ) Is the probability that the kth user in the mth group transitions from the t-time state to the (t+1) time state, which is primarily determined by the user's actions.

the Nash equilibrium solution for MFG can be solved based on FPK and HJB equations.

And fifthly, acquiring an equilibrium solution of the average field game by using a reinforcement learning method based on DDPG.

The specific mode of the fifth step is as follows:

the DDPG algorithm is adopted to solve the equilibrium solution of the MFG, which can solve the problem of continuous motion space, and the relationship between the MFG and reinforcement learning is shown in figure 2. The DDPG algorithm can be used for resource optimization problems in many communication scenarios.

A schematic diagram of optimizing resource allocation in a NOMA-MEC system using DDPG algorithm is shown in fig. 3. The DDPG algorithm is an Actor-Critic framework, and therefore is mainly divided into an Actor and Critic to illustrate the process of the DDPG algorithm. The Actor part outputs a specific action a by minimizing the action Q (s, a) through a deterministic strategy mu on the premise of inputting a state s; the Critic part outputs Q (s, a) updated by the bellman equation on the premise of inputting the state s and the specific action a. Thus, the objective function of the DDPG algorithm can be defined as:

wherein θ ^μ Is a parameter of the policy network that generates deterministic actions, and θ ^μ By policyThe gradient is updated.

In the Actor section there are mainly two networks, an online policy network and a target policy network. The deterministic strategy μ is used to directly derive each moment action a _t ＝μ(s _t |θ ^μ ) And (5) determining a value. Like the Actor section, the Critic section also has two networks, an online Q network and a target Q network. The Q function (i.e., action value function) defined by the bellman equation is a reward expectation of selecting an action under a deterministic strategy, using a Q network to fit the Q function, namely:

Q ^μ (s _t ,a _t )＝E[R+γQ(s _t+1 ,μ(s _t+1 ))] (23)，

wherein β represents the behavior policy, ρ ^β Is a probability density function of the state space. The purpose of training is to target the performance of the Q network J _β Maximization minimizes the loss of Q network. In the Critic section, the mean square error is used as a loss function, namely:

L(θ ^Q )＝E[R+γQ′(s _t+1 ,μ′(s _t+1 |θ ^μ′ )|θ ^Q′ )-Q(s _t ,a _t |θ ^Q )] (25)，

examples:

the illustrations provided in the examples below and the setting of specific parameter values in the models are mainly for illustrating the basic idea of the invention and for performing simulation verification on the invention, and in a specific application environment, the actual scene and the requirements can be appropriately adjusted.

The invention researches a NOMA-MEC system in an ultra-dense network, wherein 60 small base stations are randomly distributed within a range of 10km by 10km, the coverage range of each small base station is 20m, and 64 users are randomly distributed near the small base stations.

To implement the DDPG algorithm, the Actor network and Critic network use a fully connected neural network with three hidden layers, each hidden layer containing 300 neurons. For an Actor network, the last output layer uses Sigmoid activation functions to ensure that the probability of the last action output is between 0-1. For Critic networks, a ReLU activation function is used for each layer. Learning rates of the Actor network and the Critic network are set to 0.0001 and 0.001, respectively.

Fig. 4 and 5 show the effect of maximum transmit power for different algorithms and different multiple access modes. In fig. 4, it can be observed that the energy consumption of the system gradually increases with an increase in the maximum transmission power. The NOMA scheme may achieve lower power consumption when the maximum transmit power is fixed. This is because users in the NOMA cluster can simultaneously use the full spectrum resources to send information, which can reduce the power consumption of the system. As can be seen from fig. 5, the calculation delay decreases with increasing maximum transmit power. This is because, when the maximum transmission power is large, both the calculation speed and the data transmission rate of the user become large, resulting in a reduction in calculation delay.

Claims

1. A method for distributing computing resources at mobile edge of ultra-dense network is characterized by that,

the resource allocation method is based on an ultra-dense network, and a NOMA-MEC communication system in the ultra-dense network comprisesEach small base station is provided with an MEC server to execute the calculation task of user unloading; assume that each small base stationThe user set of the service is->N users are divided intoGroups, each group has +.>A user;

the resource allocation method is implemented according to the following steps:

step one, constructing an uplink NOMA-MEC communication system, wherein each small base station SBS is provided with an MEC server to serve a plurality of users;

2. The method for allocating mobile edge computing resources of an ultra-dense network according to claim 1, wherein the specific method in the second step is as follows:

3. The method for allocating mobile edge computing resources of an ultra dense network according to claim 1 or 2, wherein the specific manner of the third step is:

3.1 Cost of local computation for the user:

let x be _mk Unloading variables representing the kth user in the mth group, for a local computing model, i.e., the user can accomplish the computing task locally, without unloading the computing task to the MEC server, assumingRepresenting the local computing capacity of the kth user in the mth group, c _mk Representing the number of CPU cycles required by the kth user in the mth group to perform local computation, when the user performs the task locally, the time is:

when computing the energy consumption of local computing, a commonly used model of computing energy consumption is adopted, namely epsilon=κf ² The method comprises the steps of carrying out a first treatment on the surface of the Where ε represents the local computation energy consumption and κ is the energy coefficient depending on the chip structure, the local energy consumption of the kth user in the mth group can be expressed as:

wherein,,and->Weight coefficients representing delay and energy consumption, respectively, and +.>

3.2 Offloading computational cost for the user:

wherein f _s Is the computational power of the MEC server; r is R _mk Representing the data transmission rate of the kth user in the mth group;

the total time of the unloading process is:

wherein p is _mk Representing user power;

3.3 Total computation cost for the user):

4. the method for allocating mobile edge computing resources of an ultra dense network according to claim 1 or 2, wherein the step four comprises the following specific steps:

s _mk (t)＝{τ _mk (t),h _mk (t)} (16)，

wherein τ is _mk (t) represents the signal-to-interference-and-noise ratio of the user, h _mk (t) represents the channel gain of the user;

each user is based on the current state s _mk (t) from the action spaceIn selection action a _mk (t) the action of the kth user in the mth group consists of its power, unload variables and weight coefficients, action +.>Expressed as:

a _mk (t)＝{p _mk (t),x _mk ,λ _mk } (17)，

in the method, in the process of the invention,weight coefficients representing delay and energy consumption; p is p _mk (t) represents the data transmission power of the user, x _mk An unload variable representing a user;

wherein,,representing the local computational cost of the kth user in the mth group,/>representing the offload computation cost of the kth user in the mth group;

according to the definition of the reward function, the state s at time t _mk The value function of (t) is expressed as:

wherein V is _t ^μ (s _mk ) A value function representing a selection strategy μ at time t, R (p _mk ,x _mk ,λ _mk |s _mk ) Representing a reward function; and solving a Nash equilibrium solution of the MFG based on the FPK and HJB equations.

5. The method for allocating mobile edge computing resources of an ultra dense network according to claim 1 or 2, wherein the specific manner of the fifth step is:

wherein θ ^μ Is a parameter of the policy network that generates deterministic actions, and θ ^μ Updating by means of strategy gradients, E representing the expected value of the function, gamma representing the weighted value of the reward function, R _i Representing a prize function value;

two networks, namely an online strategy network and a target strategy network, are mainly arranged in the Actor part; the deterministic strategy μ is used to directly derive each moment action a _t ＝μ(s _t |θ ^μ ) A determined value; the Critic part is also provided with two networks, namely an online Q network and a target Q network, which are the same as the Actor part; the Q function defined by the bellman equation is the rewards expectation of the selection of actions under deterministic policies, the Q function is fitted using a Q network, namely:

wherein s represents the state of the user,representing user state set +.>Obeying the probability density function ρ ^β E represents the expected value of the function, beta represents the behavior strategy, ρ ^β Is a probability density function of the state space; in the Critic section, mean square error is usedThe difference is as a loss function, namely:

wherein,,represents the expected value of the function, R represents the value of the reward function, gamma represents the weighted value of the reward function, mu ' represents a deterministic strategy, Q ' represents the expected value obtained with a deterministic strategy mu ', theta ^Q Q network parameters, θ, representing the generation of desired values by strategy μ ^μ′ The representation is a parameter of the policy network μ' that generates deterministic actions, θ ^Q′ A Q network parameter representing the generation of the desired value by the strategy μ';