CN114885422A

CN114885422A - Dynamic edge computing unloading method based on hybrid access mode in ultra-dense network

Info

Publication number: CN114885422A
Application number: CN202210299457.1A
Authority: CN
Inventors: 鲜永菊; 刘闯
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-08-09

Abstract

The invention belongs to the technical field of mobile communication, and particularly relates to a dynamic edge computing unloading method based on a hybrid access mode in an ultra-dense network, which comprises the following steps: constructing a multi-user ultra-dense network; the user equipment generates a calculation task and sends a task request to the macro base station; a macro base station constructs a task model and acquires the state information of the current network; executing different task transmission modes according to the state information of the current network; inputting the state information of the current network into the trained neural network to obtain an unloading decision and a resource allocation scheme; the macro base station sends the decision scheme to each user and sends the resource allocation scheme to the micro base station; the user unloads the task according to the decision-making scheme, and the micro base station distributes resources according to the resource distribution scheme; the method adopts a double-delay deterministic strategy gradient algorithm with an advanced target to solve the optimization problem, and improves the training efficiency of the model.

Description

Dynamic edge computing unloading method based on hybrid access mode in ultra-dense network

Technical Field

The invention belongs to the technical field of mobile communication, and particularly relates to a dynamic edge computing unloading method based on a hybrid access mode in an ultra-dense network.

Background

With the rapid development of wireless communication technology and the widespread of smart devices, mobile applications such as face recognition, online mobile games, Virtual Reality (VR) and Augmented Reality (AR) have been explosively increased in recent years. Most of them are computationally intensive or delay sensitive applications, whereas mobile devices (e.g. smartphones, wearable devices) are typically limited in both computational power and battery power. The relationship between applications and resource constrained devices presents a significant challenge to improving the user computing experience.

Mobile Edge Computing (MEC) sinks a Computing server from a cloud center to the Edge of a network, the distance between user equipment and the server is greatly shortened, and a user can utilize a Computing offloading technology to offload tasks to the Edge server for Computing to meet the demand of intensive Computing. In addition, an Ultra-Dense Network (UDN) under a 5G architecture is a heterogeneous Network scheme for multi-base-station cooperative service, and the overall performance of the Network is improved by deploying a large number of micro base stations and macro base stations in a hot spot area. Non-orthogonal Multiple Access (NOMA) is a radio Access technology that enables Multiple users to share the same channel for information transmission with more capabilities. The throughput rate and the network capacity of the network are improved. Therefore, UDNs integrated with MECs and NOMA are considered a reliable technology in 5G applications. However, since the densely deployed micro base stations and MEC servers cause multiple users to be in the coverage area of the micro base stations, and different base stations have different computing capabilities, different channels should be reasonably allocated, and how to make offloading decisions and resource allocation for the users is a challenge.

The existing resource allocation method comprises a learning-assisted mean field gaming method, and the method is used for clustering base station groups in a super-dense network NOMA-MEC system, wherein each cluster uses NOMA to transmit information. In dynamic request scheduling optimization in mobile edge computing of the Internet of things, a combined request unloading and resource scheduling problem is modeled into a mixed integer nonlinear programming, and user mobility is considered to minimize response delay of requests.

In the above methods, some methods also consider user mobility in a dynamic MEC system, however, in these working scenarios, users only make simple position changes during offloading, and the impact on offloading is only changes in channel state, and changes in user number caused by user movement are not considered. In some areas with large changes of the pedestrian volume, the fixed base station is deployed in the areas with the rapidly increased pedestrian volume, and the transmission and calculation requirements of each user cannot be met. At this time, the conventional transmission method cannot meet the requirement. Therefore, how to design an offloading scheme to meet the computing requirements of users in the areas with large variation of the human traffic in the ultra-dense network has important research value.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a dynamic edge computing unloading method based on a hybrid access mode in an ultra-dense network, which comprises the following steps:

constructing a multi-user ultra-dense network, and initializing the network;

the user equipment generates a calculation task and sends a task request to the macro base station;

a macro base station receives a task request in a current system and constructs a task model according to the task request;

the macro base station acquires state information of a current network, wherein the state information comprises a task model, computing resources of a base station server, a base station channel state and channel information between the base station and a user;

the controller of the macro base station judges whether the network reaches the maximum value of the OMA transmission mode or not according to the state information of the current network, if the network reaches the maximum value, other users adopt the NOMA transmission mode to carry out task transmission, and if the network does not reach the maximum value, the OMA transmission mode is adopted to transmit tasks;

the base station controller inputs the state information of the current network into the trained neural network to obtain an unloading decision and a resource allocation scheme;

the macro base station sends the decision scheme to each user and sends the resource allocation scheme to the micro base station; and the user unloads the tasks according to the decision scheme, and the micro base station allocates resources according to the resource allocation scheme.

Preferably, the multi-user ultra-dense network comprises a macro base station and N micro base stations, and each micro base station is provided with an MEC server for executing calculation tasks; each micro base station adopts orthogonal frequency division multiple access to user equipment.

Preferably, the constructing of the task model comprises: existing user task Q _u (t)＝[d _u (t),c _u (t),τ _u (t)]And add new user task

Wherein Q _u (t)、

Tasks for existing users and newly added users, respectively, d _u (t) and

respectively representing the data size of the existing user task and the data size of the newly added user task, c _u (t) and

respectively representing the unit CPU cycle number required by the execution of the existing user task and the unit CPU cycle number required by the execution of the newly added user task, tau _u (t) and

respectively representing the time delay threshold of the existing user task and the time delay threshold of the newly added user task.

Preferably, the OMA transmission scheme includes: and carrying out data transmission by adopting an orthogonal frequency division multiplexing channel.

Preferably, the NOMA transmission mode includes: when the network capacity is not enough to accommodate the number of users, the newly added users adopt the NOMA mode to transmit data, namely a plurality of users use the same channel, and the network capacity is increased by using the mode of increasing transmission power.

Preferably, the training of the neural network includes: the neural network is trained by adopting a double-delay deterministic strategy gradient algorithm, and the method comprises the following steps:

step 1: initializing parameters of a network, wherein the neural network comprises a strategy network and a value network;

and 2, step: each time slot macro base station is used as an intelligent agent to acquire current environment state information, wherein the information comprises the number of occupied channels in a network, user calculation task information, calculation resources of each micro base station server and channel states between the micro base stations and users;

and step 3: inputting the current environment state information into a policy network to obtain task actions; the task actions include user offload decision, power control, and computational resource allocation actions;

and 4, step 4: calculating the instant reward of the agent according to the task action and the current network state information;

and 5: the intelligent agent stores the current network state, the task action, the instant reward and the network state at the next moment into a priority experience replay array as a quadruple;

step 6: the policy network and the value network are trained using data in the preferred empirical replay array.

Further, the calculation expression of the instant rewarding function is as follows:

wherein r is _t Representing the instant reward function, s, at time t _t Information indicating the state of the system environment at time t, a _t Indicating the task action at time t, U indicating the user set, E _u (t) denotes the total energy consumption by the u-th user at time t _u (t) represents a penalty function; upsilon is ¹ And upsilon ² Is two positive real numbers and satisfies upsilon ¹ ≤E _u (t)≤υ ² ；T _u (t) denotes task execution delay, τ _u (t) represents a task execution latency threshold.

Preferably, the process of training the policy network includes:

step 611: initializing parameters of a policy network;

step 612: randomly extracting a quadruple from the preferred experience replay array, inputting data in the quadruple into the policy network, and obtaining the fraction q of actions in the quadruple data _t ＝Q(s _t ,a _t ；θ ^Q )；

Step 613: acquiring a target network after the value network updating parameters are acquired;

step 614: updating the parameters of the policy network by adopting a gradient descent algorithm according to the parameters of the value network;

step 615: and recalculating the action score according to the updated parameters, and finishing the training of the strategy network when the action score is maximum.

Preferably, the process of training the value network includes:

step 621: initializing parameters of a value network;

step 622: extracting a set of experience groups(s) from the experience replay array _t ,a _t ,r _t ,s _t+1 ) And inputting the extracted experience group into a value network to obtain task evaluation q at the time t _t ＝Q(s _t ,a _t ；θ ^Q ) And task evaluation q at time t +1 _t+1 ＝Q(s _t+1 ,a _t+1 ；θ ^Q )；

Step 623: acquiring a target network after updating the strategy network parameters, calculating task action at the next moment in the network according to the acquired target network, and adding noise into the task action;

step 624: calculating two TD targets according to the task action at the next moment after the noise is added, and selecting the minimum value of the two TD targets; the expression calculated is:

wherein the content of the first and second substances,

representing TD targets, r, of a value network 1 _t Representing an instant reward, gamma representing a discount factor, Q representing a value network, s _t+1 Indicates the state of the next time, a _t+1 Represents the next moment of action, θ ^Q1′ Representing a parameter of the value network 1,

representing TD targets, θ, of value network 2 ^Q2′ Representing value network 2 parameters;

step 625: calculating a loss function of the model according to the TD target;

step 626: and updating parameters of the model by adopting a gradient descent algorithm, and finishing the training of the model when the loss function is minimum.

Further, the formula for updating the parameters of the model by using the gradient descent algorithm is as follows:

wherein the content of the first and second substances,

is the TD error, alpha is the learning rate,

indicating the number of experience groups extracted from the experience regression groups,

representing the graduating of the value network parameters,

representing a value network, r _B Representing a single empirical group r _B ，

The state of the experience group is represented,

representing the empirical set of actions, θ ^Q Representing a value network parameter.

The invention has the beneficial effects that:

the invention provides a mixed NOMA-MEC system, which can be used for transmitting tasks by a user in a NOMA mode when the number of the users is large and the original network cannot accommodate the users. An optimization problem is presented that minimizes energy consumption while maximizing user capacity. The traditional optimization algorithm is relatively laboursome in solving dynamic and multidimensional problems, and the invention adopts a double-delay deterministic strategy gradient algorithm with an advanced target to solve the optimization problem. Through simulation experiments, the system based on the hybrid access mode has certain advantages compared with a single access mode system.

Drawings

FIG. 1 is a diagram of a multi-user ultra-dense network scenario;

FIG. 2 is a frame diagram of a deep reinforcement learning algorithm;

FIG. 3 is a flowchart of a deep reinforcement learning algorithm;

FIG. 4 is a graph of the impact of number of users on average energy consumption;

FIG. 5 is a graph of the impact of number of users on task completion rate.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

A specific implementation mode of a dynamic edge computing unloading method based on a hybrid access mode in an ultra-dense network comprises the following steps:

constructing a multi-user ultra-dense network and initializing the network;

the method comprises the steps that user equipment generates a calculation task and sends a task request to a macro base station;

A specific implementation of a hybrid access mode-based dynamic edge computing offloading method in an ultra-dense network, as shown in fig. 1, the building of a system model of a multi-user ultra-dense network includes: considering a heterogeneous network scenario in an ultra-dense network, a plurality of users and a plurality of base stations. The current user set is

When the number of users changes, the users which cannot be accommodated by the system are represented as

Each base station is equipped with a MEC server and a SDN controller is deployed on the macro base station as a centralized control. Base station setBy

Meaning that each MEC server has different computing power. Normally using Orthogonal Frequency Division Multiple Access (OFDMA) transmission, a base station channel is divided into K Orthogonal sub-channels, and the channel set is

Channels are multiplexed between the base stations, and the capacity of the network is KN under normal conditions. A discrete time model is used to divide time into a set of time slots of length l,

the task that the user u arrives at the t time slot is represented by Q _u (t)＝[d _u (t),c _u (t),τ _u (t)]It is shown that,

representing user u ⁺ The task arriving at the time slot t, where d denotes the input data size (bit), c denotes the unit computing power (cycle/bit) required to execute the task, and τ is the maximum tolerated delay. Modeling the network scene into a three-dimensional scene due to the continuous change of the user position, wherein the user position is represented by p _u (t)＝(x _u ,y _u 0) denotes that the base station position and height are fixed, and the position of each base station is represented by p _n (t)＝(x _n ,y _n Phi) is shown.

Using λ _u,n (t) denotes the offload decision of user u, λ _u,n When (t) is k, the user selects to unload the task to the nth base station for calculation and uses the kth channel for task transmission,

the meaning is the same. When a user has tasks to calculate, the tasks are transmitted to a selected base station through a corresponding wireless uplink, then the calculation tasks are executed through an MEC server of the base station, and after the tasks are completed, calculation results are obtained through a wireless downlinkAnd sending back to the user equipment. Since the result backtransmission of the last phase is less heavily weighted than the first two phases, including the communication phase and the computation phase, are of significant interest.

The transmission rate of users in the multi-user ultra-dense network is as follows:

wherein B is _n Representing a fixed bandwidth, p, of a base station n _u,n (t) is the transmit power, σ, of the user equipment u ² Representing Additive White Gaussian Noise (AWGN), h _u,n (t) is the channel gain between the user and the base station.

In a very dense network, interference between devices cannot be ignored since multiple base stations multiplex the spectrum. I is _u,n (t) is the interference suffered by the current equipment on the current channel, and the expression is as follows:

wherein p is _u′,n′ (t) is the transmit power of other users using the current channel, h _u′,n (t) represents the channel gain of the current channel user u' and base station n.

The transmission delay of the unloading calculation can be obtained according to the transmission rate of the user as follows:

wherein d is _u (t) represents a task data size.

The time consumed by the server-side calculation is as follows:

wherein, c _u (t) represents the execution of the taskNumber of CPU cycles required by the business unit, f _u,n (t) represents the calculation resource allocated to user u by base station n, and satisfies

F _n Is the maximum computational power of base station n.

The total latency of user u can be expressed as:

the energy consumed by the user in the process of uploading the task is as follows:

since only the user experience is considered, the user u can always consume energy by ignoring the computing energy consumption of the server

At some point, the number of users increases to a number that the system cannot accommodate, and newly joined users need to select the same channel to offload their tasks as the users that have started transmission before. After a base station needing to be unloaded is determined by a specific measurement standard, a multiplexed sub-channel also needs to be selected, and due to the NOMA transmission principle and the SIC decoding rule, a receiving end firstly decodes a user with a large channel gain, regards a user signal with a small channel gain as interference, and needs to select a channel with a large channel gain relative to the current user for transmission in order not to influence the experience of the previous user.

Can obtain u ⁺ The transmission rate for the user using NOMA is:

wherein, B _n Representing the total bandwidth of the base station, K representing the total transmission channelThe number, σ, represents white gaussian noise,

indicating the transmit power when the newly added user uses NOMA transmission,

which represents the gain of the channel and is,

indicating co-channel interference, p _u,n (t)h _u,n (t) is the interference of the original user to the new added user in the same channel.

After the nl time, the previous user has finished transmitting, and the newly added user can continue transmitting by using the OMA mode to save energy consumption. Having a transmission rate of

Wherein the content of the first and second substances,

indicating the transmit power when the newly added user is using OMA transmission.

Its transmission time can be expressed as:

wherein the content of the first and second substances,

indicating the size of the task data for the newly added user.

The time consumed by the server-side calculation is as follows:

wherein the content of the first and second substances,

indicating the number of unit CPU cycles required for the execution of the new user task,

allocating base station n to user u ⁺ Is satisfied by computing resources of

F _n Is the maximum computational power of base station n.

User u ⁺ The total delay can be expressed as:

the energy consumed by the user in the process of uploading the task is

Only the experience of the user is considered, the computing energy consumption of the server is ignored, and the energy consumed in the process of uploading the task is the user u ⁺ Total energy consumption of

The invention relates to a specific implementation mode of a dynamic edge computing unloading method based on a hybrid access mode in an ultra-dense network, which aims to solve the optimization problem of minimizing user energy consumption under the condition of meeting a time delay requirement and maximizing the number of access users, namely:

wherein, C1 and C2 are unloading variable constraints, C1 limits unloading decision values to be in K channels only, C2 indicates that one channel of the base station can be allocated to only one user, wherein G _{#} 1 indicates that # is true in the function. C3 is the constraint of computing resources, and the computing resources allocated by the base station to all users cannot exceed the maximum value of the resources owned by the base station itself. C4, C5 are transmit power constraints. C6 and C7 are to ensure that the task can be completed on time. T represents total time, U represents total number of users, U + represents total number of newly added users, lambda represents unloading decision variable, p represents transmitting power, f represents calculation resource distribution variable, E _u,n (t) represents the energy consumption,

representing the energy consumption of the newly added user, N representing the total number of base stations, lambda _u,n (t) represents an unloading decision variable,

represents the unloading decision variable of the newly added user, K represents the total number of channels, u represents the users u, u ⁺ Indicating newly added user u ⁺ And n represents a base station n,

representing an offload decision representation function, f _u,n (t) represents a computational resource allocation variable,

indicating a newly added user computing resource allocation variable, F _n Representing the maximum computational resource, p, of the base station n _u,n (t) represents the transmission power and,

which is indicative of the maximum transmit power,

indicating the transmit power when the newly added user is using OMA transmission,

indicating that the newly added user uses NOMA transmit power,

indicating the maximum transmit power, T, of the newly added user _u (t) denotes task execution delay, τ _u (t) represents a task execution latency threshold,

indicating the delay in the execution of the new user task,

indicating the execution delay threshold of the new user task.

The problem P involves the optimization of three variables, λ is a decision matrix of dimension U × N, the elemental values are single discrete integers no greater than K, and P and f are continuous real vectors relating to all users. It can be seen that the problem P is a non-convex mixed integer nonlinear programming problem, but the work of the present invention is applied to a dynamic MEC system, and the difficulty of solving the problem P by using a conventional optimization algorithm under a time-varying condition is large. Reinforcement Learning (RL) is a relatively advanced decision-making method that obtains the maximum return by continuously performing trial-and-error Learning in a target environment, feeding back results and modifying strategies. Despite its many advantages, it also lacks scalability and is inherently limited to rather low dimensional problems, mainly because reinforcement learning algorithms have the same memory, computational and sample complexity as other algorithms. Therefore, in order to solve the high-dimensional decision problem which is difficult to process by reinforcement learning, the Deep Reinforcement Learning (DRL) combines the perception capability of the deep learning and the decision capability of the reinforcement learning, and solves the problems of a high-dimensional state space and a behavior space through enhancement function approximation and a deep neural network.

The present invention is intended to solve the above mixed integer nonlinear programming problem using deep reinforcement learning.

The reinforcement learning framework of the present invention comprises: the RL framework is mainly composed of an agent, an environment and three elements, wherein the three elements comprise: state space

It is the set of all possible presence states. Movement space

Refers to the set of all possible actions of the agent. Reward function

Meaning that after the agent performs an action, the environment returns a reward to the agent:

the agent learns continuously in discrete time slot steps through constant interaction with the environment and based on the rewards earned. At each time slot t, the agent observes the environment

And take action

The action of the agent is output determined by the policy function μ. After performing the action, the environment returns a scalar reward r to the agent _t And is converted into the next state s _t+1 。

The present invention defines the following elements according to the system model:

(1) state space: at the beginning of each time slot, the macro base station observes the system state of the wireless network, including the task details requested by all devices, the available computing resources of each base station MEC server, the channel gain between the user and the base station, and the channel occupation condition. System state

Can be defined as:

v (t) is a task request feature matrix for all users, including data features of all user devices that achieve the task. H (t) is the channel gain matrix between each base station and each user. F (t) is the calculation resource vector of all base station servers, which represents the calculation resource available for the current server, and j (t) is the channel occupation condition of each base station.

(2) An action space: based on the currently observed system state s _t The agent selects a different action, action a, based on the decision variables of the problem P _t Can be defined as:

λ (t) is a strategy action of base station selection and channel selection, p (t) is a transmission power allocation action of user equipment, and f (t) is a calculation resource action allocated to a user by the base station.

(3) Rewarding: the objective of the joint computation offload and resource allocation problem P proposed by the present invention is to minimize the energy consumption of the user, and the negative value of the result can be directly used as the instant reward. The instant reward function is defined as:

wherein: _u (t) is a penalty function, which is defined as follows:

wherein upsilon is ¹ And upsilon ² Is two positive real numbers and satisfies upsilon ¹ ≤O _u (t)≤υ ² 。

The optimization problem of the invention contains continuous variables, and the Deep Deterministic Policy Gradient (DDPG) is the most common continuous control method, which is an Actor-critical-based Deep reinforcement learning algorithm and consists of a Policy network (Actor) mu (s; theta) ^μ ) To approximate the policy function, a value network (reviewer) Q (s, a; theta ^Q ) To approximate the cost function of the motion, θ ^μ And theta ^Q Respectively, corresponding neural network parameters. The policy network is mainly responsible for outputting actions according to the current state, the output of the policy network is determined actions, and the output of the value network is the current state and the value of the output actions of the policy network, namely, the actions are scored according to the current state to guide the actors to make better actions. However, DDPG is prone to generate an overestimation problem in network training, and studies have been made to alleviate the overestimation problem by using a target network, but the effect is not ideal. The Double delay depth determination strategy Gradient (TD 3) adopts a truncated Double Q Learning (Clipped Double Q-Learning) manner to alleviate the overestimation problem.

TD3 uses two value networks and one policy networkCollaterals of each formula are Q(s) _t ,a _t ；θ ^Q1 )，Q(s _t ,a _t ；θ ^Q2 )，μ(s _t ,θ ^μ ) The three neural networks also correspond to a target network, Q(s) respectively _t ,a _t ；θ ^Q1′ )， Q(s _t ,a _t ；θ ^Q2′ )，μ(s _t ,θ ^μ′ ). Wherein theta is ^Q1′ ,θ ^Q2′ ,θ ^μ′ The network model is shown in fig. 2 for the neural network parameters of the corresponding target network.

Experience Replay (Experience Replay) is an important skill in reinforcement learning, and the performance of reinforcement learning can be greatly improved. The records (experiences) of the agent's interactions with the environment are stored in an array, which is referred to as the experience Replay array (Replay Buffer), and used to train the agent. Controlling interaction of agents with the environment using a policy network, collecting quadruplets(s) _t ,a _t ,r _t ,s _t+1 ) An experience replay array is placed and a certain amount of experience is then extracted from the array to train the policy network and the value network.

The process of training the policy network comprises the following steps: given a current state s _t The policy network outputs an action a _t ＝μ(s _t ；θ ^μ ) The value network will give the action a score based on the current state: q. q.s _t ＝Q(s _t ,a _t ；θ ^Q ). Policy network parameter θ ^μ Will influence the action a _t Thereby influencing q _t . The aim of training the strategy network is to improve the parameter theta ^μ Let q be _t Becomes as large as possible. Updated using the following formula:

a gradient ascent is used to make the score higher, where beta is the learning rate,

is from experience to experienceAnd returning the number of the extracted experience groups in the group.

The process of training the value network includes: the value network acts as a reviewer and in order for it to score the actors more and more accurately, it is necessary to calibrate its score based on the actual observed reward. The training value network mainly uses a time Difference algorithm (TD) algorithm, and the yield value network fits the TD target. Extracting a set of experience groups(s) from the experience replay array _t ,a _t ,r _t ,s _t+1 ) First, let the value network make an evaluation to obtain q _t ＝Q(s _t ,a _t ；θ ^Q ) And q is _t+1 ＝Q(s _t+1 ,a _t+1 ；θ ^Q ) The next action here is given by the target policy network and adds noise to it:

a _t+1 ＝μ(s _t+1 ；θ ^μ′ )+ξ

where ξ is a random variable representing a noise subject to a truncated Normal Distribution (called clipping Normal Distribution)

Meaning that the mean is 0 and the standard deviation is sigma, but the variables fall only in the interval [ -c, c]Is normally distributed. With a truncated normal distribution, excessive noise can be prevented. The addition of this noise in the process of computing the TD target may make the action output smoother to reduce errors.

Calculating two TD targets

And

gamma is a discount factor, and the smaller value of the two is taken as a TD target:

wherein the content of the first and second substances,

representing TD targets, θ, of value network 2 ^Q2′ Representing a value network 2 parameter.

The loss function according to the TD objective calculation model, expressed as:

updating parameters by using a gradient descent method, wherein the expression is as follows:

wherein

For TD error, α is the learning rate. The use of the gradient descent method allows the loss function to be smaller, i.e. allows the evaluation of the value network to be closer to the TD target.

The Actor-Critic uses the value network Q to guide the updating of the policy network mu. If the value network itself is not reliable, then scoring the action with the value network is inaccurate and does not help improve the policy network. When the value network is still poor, the value network is updated urgently, and not only can the mu not be improved, but also the training of Q is unstable due to the change of the mu. Experiments have shown that the policy network as well as the three target networks should be updated slower than the value network. Each round of training of a conventional Actor-Critic makes one update to the policy network, the value network, and the target network. A better approach is to update the value network once per round, but update the policy network and the three target networks every k rounds. The specific algorithm flow is shown in fig. 3.

Fig. 4 and 5 are comparison diagrams of the algorithm of the present invention and the conventional algorithm. The main simulation parameters are set as follows. The number of micro base stations in the network is 10. Each user starts a calculation task, and c is the [300,50 ] in the task parameters]0kbi, the number d of CPU needed by all task unit bit is 1000cycles/bit, the maximum tolerant time delay tau is equal to [20,50 ∈]ms. Maximum transmission power p of user equipment ^max 2W. Different base stations have different computing capabilities, F ∈ [15,25 ]]GHz. The bandwidth of each base station is set to B10 MHz, and the channel is divided into K4 subchannels. The additive white Gaussian noise was set to-174 dBm/Hz. The channel gain between the user and the base station follows a free space path loss model, denoted as

Wherein the antenna gain A _d Carrier frequency f 4.11 _c ＝900MHz，

Is the distance between the user equipment and the base station.

The algorithm contains 6 neural networks, including two value networks, a policy network and two corresponding target networks. The value network and the strategy network are all composed of four fully-connected layers, wherein each fully-connected layer comprises an input layer and an output layer of two hidden layers, each hidden layer comprises 256 neurons, the number of the neurons of the input layer of the strategy network is the size of a state space, the size of the output layer is an action space, the size of the input layer of the value network is the size of an action space, and the number of the neurons of the output layer is 1. The hidden layers of the neural network all use a Linear correction Unit (ReLU) as an activation function, the output layers use sigmod as the activation function, and an Adam optimizer is used for updating the parameters of the neural network. Number of experience group samples

The neural network learning rate is 0.01 and the preferred empirical replay array size is 512.

Fig. 4 and fig. 5 are graphs comparing effects of three transmission modes. As can be seen from the figure, as the number of users increases, the average energy consumption for the task execution of the three types of users increases, because as the number of users increases, the computational resources allocated to each user decreases, and in order to complete the task within the delay threshold, a larger transmission power is required for the task transmission. However, in contrast, as the number of users increases, the power consumption of the hybrid NOMA transmission method is slightly larger than that of the OMA method, but the advantage in the aspect of task completion rate is obvious. This is because as the number increases, the user of the OMA method cannot transmit tasks, but only waits, and at this time, the delay threshold is easily exceeded, and the hybrid NOMA can transmit data in time by using a method that consumes energy. Whereas mixed NOMA performs less well in terms of task completion rate than pure NOMA, there is a saving in power consumption.

The above-mentioned embodiments, which further illustrate the objects, technical solutions and advantages of the present invention, should be understood that the above-mentioned embodiments are only preferred embodiments of the present invention, and should not be construed as limiting the present invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A dynamic edge computing unloading method based on a hybrid access mode in an ultra-dense network is characterized by comprising the following steps:

constructing a multi-user ultra-dense network and initializing the network;

2. The method for unloading dynamic edge computing based on a hybrid access mode in the ultra-dense network according to claim 1, wherein the multi-user ultra-dense network comprises a macro base station and N micro base stations, and each micro base station is configured with an MEC server to execute computing tasks; each micro base station adopts orthogonal frequency division multiple access to user equipment.

3. The method for dynamic edge computing offloading based on hybrid access in ultra-dense network as claimed in claim 1, wherein constructing the task model comprises: existing user task Q _u (t)＝[d _u (t),c _u (t),τ _u (t)]And add new user task

Wherein Q _u (t)、

Tasks for existing users and newly added users, respectively, d _u (t) and

are respectively provided withIndicates the number of unit CPU cycles required for the execution of the existing user task and the number of unit CPU cycles required for the execution of the newly added user task, tau _u (t) and

4. The method of claim 1, wherein the OMA transport mode comprises: and carrying out data transmission by adopting an orthogonal frequency division multiplexing channel.

5. The method of claim 1, wherein the NOMA transmission scheme includes: when the network capacity is not enough to accommodate the number of users, the newly added users adopt the NOMA mode to transmit data, namely a plurality of users use the same channel, and the network capacity is increased by using the mode of increasing transmission power.

6. The method according to claim 1, wherein the training of the neural network comprises: the neural network is trained by adopting a double-delay deterministic strategy gradient algorithm, and the method comprises the following steps:

and 3, step 3: inputting the current environment state information into a policy network to obtain task actions; the task actions include user offload decision, power control, and computational resource allocation actions;

7. The method of claim 6, wherein the calculation expression of the immediate reward function is as follows:

wherein r is _t Representing the instant reward function, s, at time t _t Information indicating the state of the system environment at time t, a _t Indicating the task action at time t, U indicating the user set, E _u (t) denotes the total energy consumption by the u-th user at time t _u (t) represents a penalty function; upsilon is ¹ And upsilon ² Is two positive real numbers and satisfies upsilon ¹ ≤E _u (t)≤υ ² ；T _u (t) represents the task execution delay, τ _u (t) represents a task execution latency threshold.

8. The method of claim 6, wherein the training of the policy network comprises:

step 611: initializing parameters of a policy network;

step 612: random in preferred empirical replay arraysExtracting a quadruple, inputting data in the quadruple into a policy network to obtain the fraction q of actions in quadruple data _t ＝Q(s _t ,a _t ；θ ^Q )；

9. The method of claim 6, wherein the training of the value network comprises:

step 621: initializing parameters of a value network;

wherein the content of the first and second substances,

step 625: calculating a loss function of the model according to the TD target;

10. The method according to claim 9, wherein the formula for updating the parameters of the model by using the gradient descent algorithm is as follows:

wherein the content of the first and second substances,

is the TD error, alpha is the learning rate,

representing the number of experience groups drawn from the experience reentry group,

representing the graduating of the value network parameters,

The state of the experience group is represented,